The Web as a Dynamic Graph:Theoretical Growth Models

Theoretical Growth Models

This fractal like structure of an evolving graph, whose growth processes are so organised that the degree distribution of its vertices satisfy power law, which has a large number of bipartite cliques as subgraphs, and exhibits the small world phenomenon has generated a lot of interest among computer scientists recently. Part of the reason is that the web graph does not belong to the much studied Gn,p model which consists of all graphs with n vertices having p as the probability that there is an between any two vertices [7]. Graphs in Gn,p are essentially sparse and are unlikely to have many bipartite cliques as subgraphs. Moreover, for large n the degree distribution function is Poisson. There is consensus that the primary reason for the web graph to be different from a traditional random graph is that edges in the Web exhibit preferential attachment. Hyperlinks from a new web page are more likely to be directed to popular well established website/webpage just as a new manuscript is more likely to cite papers that are already well cited. It is interesting to note that preferential attachment has been used to model evolving phenomena in fields ranging from economics [25], biology [32], to languages [33]. Simon [30] used preferential attachment to explain power law distributions already observed in phenomena like distribution of incomes, distribution of species among genera, distribution of word frequencies in documents etc. A modern exposition of Simon’s argument in the framework of its applicability to web graphs is provided by Mitzenmacher [27], and is the basis of Barabasi et al.’s “mean field theory” based model [8], as well as the “rate equation” approach of Krapivsky and Redner [20]. All three use continuous differential equations to model the dynamics of the evolving web graph. This approach is attractive because of its simplicity and the ease with which it enables one to focus on understanding the issues involved, particularly those relating to power law distributions associated with the Web.

Let us consider the following growth model (we will call this the basic model) which forms the kernel of most of the reported models in literature. At each time step a new node is added to the web graph. This node gets one edge incident at it from one of the existing nodes in the graph, and has one edge pointing out of it. Let us assume that the tail (the node from which the edge emanates) of the edge which is incident at the node just added is chosen uniformly at random from the existing nodes. The head (the node at which the edge is incident) of the edge which emanates from the node just added is chosen with probability proportional to the indegree of the head in keeping with the preferential attachment paradigm that new web pages tend to attach themselves to popular web pages. We assume, to keep matters simple, that whole process starts with a single node with one edge emanating which is incident at the node itself.

Let Ik (t) and Ok (t) denote the number of nodes in the evolved web graph with indegree and outdegree equal to k respectively after t timesteps. Ik (t) and Ok (t) are random variables for all k, t 1. We will assume for the purposes of what follows that their expected values are concentrated around their means, and we will use Ik (t) and Ok (t) to denote the expected value also. Note that, at time t, one node and two directed edges are added to the graph. The expected increase in the value of Ik (t) is controlled by two processes. One is the expected increase in the value of Ik−1(t), and the other is the expected decrease in the value of Ik (t). The expected increase is (k 1)Ik−1(t)/t. This is because of our assumption that the probability is proportional to the indegree of the head of the node at which the edge emanating from the node just added is incident and t is the total number of nodes in the graph. Reasoning in the same way we get that the expected decrease is kIk (t)/t. Since this change has has taken place over one unit of time we can write

image

This implies Ok 2−k . That the outdegree distribution is exponential and not power law should not come as a surprise if we recall that the process that affected outdegree was uniform random and had no component of preferential attachment associated with it.

It would be instructive to note that very simple changes in the growth model affect the degree distribution functions substantially. Consider the following variation on the growth model analysed above. The edge emanating out of the node added does not just attach itself to another node on the basis of its indegree. With probability β the edge points to a node chosen uniformly at random, and with probability 1 β the edge is directed to a node chosen proportionally to its indegree. The eq. (51.1 ) now takes the form

image

satisfies the above recurrence. It should be noted that in this case the power law coefficient can be any number larger than 2 depending upon the value of β.

How do we get power law distribution in the out degree of the nodes of the graph? Simplest modification to the basic model to ensure that would be to choose the tail of the edge incident at the added node to be chosen with probability proportional to the out degree of the nodes. Aiello et al. [4], and Cooper and Frieze [13] have both given analysis of the version of the basic model in which, at any time epoch, apart from edges incident and emanating out of the added node being chosen randomly and according to the out and in degree distributions of the existing nodes, edges are added between existing nodes of the graph. Both in and out degree distributions show power law behaviour. From Web modeling perspective this is not particularly satisfying because there is no natural analogue of preferential attachment to explain the process that controls the number of hyperlinks in a web page. Never-the-less all models that enable power law distribution in out degree of nodes in literature resort to it in one form or the other. We will now summarize the other significant variations of the basic model that have been reported in literature.

Kumar et al. [21] categorise evolutionary web graph models according to the rate of growth enabled by them. Linear growth models allow one node to be added at one time epoch along with a fixed number of edges to the nodes already in the graph. Exponential growth models allow the graph to grow by a fixed fraction of the current size at each time epoch. The models discussed above are linear growth models. Kumar et al. in [21] introduce the notion of copying in which the head of the edge emanating out of the added node is chosen to be the head of an edge emanating out of a “designated” node (chosen randomly).

The intuition for copying is derived from the insight that links out of a new page are more likely to be directed to pages that deal with the “topic” associated with the page. The designated node represents the choice of the topic and the links out of it are very likely links to “other” pages relating to the topic. Copying from the designated node is done with probability 1 β. With probability β the head is chosen uniformly at random. The exponential growth model that they have analyzed does not involve the copying rule for distribution of the new edges to be added. The tail of a new edge is chosen to be among the new nodes with some probability factor. If the tail is to be among the old nodes, then the old node is chosen with probability proportional to its out degree. The analysis, as in [4, 13], is carried out totally within the discrete domain using martingale theory to establish that the expected values are sharply clustered. For the linear growth model the power law coefficient is the same as in eq. (51.4). The copying model is particularly interesting because estimates of the distribution of bipartite cliques in graphs generated using copying match those found experimentally.

Another approach that has been used to model evolutionary graphs is the so called Master Equation Approach introduced by Dorogovtsev et al. [16] which focuses on the probability that at time t a node introduced at time i, has degree k. If we denote this quantity by p(k, i, t), then the equation controlling this quantity for indegree in the basic model becomes

image

Notice that the solution to eq. (51.6) with appropriate initial conditions is of the form P (k) k2 which is the same as that obtained by working in the continuous domain.

Rigorous analysis of the stochastic processes done so far in [4, 13, 21] has so far taken into account growth, i.e. birth, process only. The combinatorics of a web graph model that involves death processes also has still to be worked out. It must be pointed out that using less rigorous techniques a series of results dealing with issues like non-linear preferential attachment and growth rates, growth rates that change with time (aging and decay), and death processes in the form of edge removals have appeared in literature. This work is primarily being done by physicists. Albert and Barabasi [5], and Dorogovtsev and Mendes [15] are two comprehensive surveys written for physicists which the computer scientist swould do well to go through. However, with all this work we are still far away from having a comprehensive model that takes into account all that we understand of the Web. All models view web growth as a global process. But we know that a web page to propagate Esperanto in India is more likely to have hyperlinks to and from pages of Esperanto enthusiasts in rest of the world. From modeling perspective every new node added during the growth process may be associated with the topic of the node chosen, let us say by preferential attachment, to which the added node first points to. This immediately reduces the set of nodes from which links can point to it. In effect the probability for links to be added between two nodes which are associated with some topics will be a function of how related the topics are. That there is some underlying structure to the web graph that is defined by the topics associated with the nodes in the graph has been in the modeling horizon for some time. Search engines that use web directories organised hierarchically as graphs and trees of topics have been designed [11]. Experiments to discover the topic related underlying structures have been carried out [10, 12, 18]. A model that takes into account the implicit topic based structure and models web growth as a number of simultaneously taking place local processes [2] is as step in this direction. All that is known about the Web and the models that have been developed so far are pointers to the realisation that development mathematically tractable models that model the phenomena faithfully is a challenging problem which will excite the imagination and the creative energies of researchers for some time.

Comments

Popular posts from this blog

Data Structure Visualization:Introduction and Value of Data Structure Rendering

Collision Detection:Penetration Depth Computation

Concurrent Data Structures:Linked Lists