The Web as a Dynamic Graph:Introduction and Experimental Observations
Introduction
The World Wide Web (the Web) was started as an experimental network in 1991. Its growth since then can only be termed explosive. It has several billion pages today and is growing exponentially with time. This growth is totally distributed. There is no central authority to control the growth. The hyperlinks endow the Web with some structure in the sense that viewing the individual web pages as nodes and the hyperlinks as directed edges between them, the Web can be looked upon as a directed graph. What stands out is that this directed graph is not only dynamic—it is rapidly growing and changing—it has been much too large for some time to even have a complete snapshot. Experimental understanding of its structure is based on large but partial web crawls. What properties are being investigated is itself driven by the requirements of the increasingly sophisticated nature of the applications being developed as well as analogies and insights from fields like bibliometrics involving study of citations in academic literature [17].
Let us briefly consider topic search, discussed in detail in Chapter 50, which involves searching for pages on the Web which correspond closely to a given search topic. The seminal work in this area is Kleinberg’s HITS algorithm [19] that assumes that for any topic on the Web there are pages which could be considered to be “authoritative” on that topic, and pages which are “hubs” in the sense that they contain links to relevant pages on that topic. Given a collection of pages and links between them, selected by some sampling method as pertaining to the given topic, HITS algorithm ranks the pages by weights which are representative of the quality of the pages as hubs or authorities. These weights are nothing but principal eigen values, and are, in some sense, a “measure” of the “denseness” of the interconnections between the pages. This model of dense interconnection between hubs and authorities of a given topic gave rise to the notion of “cyber communities” in the Web associated with different topics. Underlying this model of cyber communities was the hypothesis that a subgraph representing a Web community would contain “bipartite cores”.
Bipartite cores, as the name suggests, are complete bipartite graphs corresponding to the hubs and authorities around which the communities are supposed to have developed.
Experimental investigations of the structure of the web graph, taking place on the graphs extracted out of the partial crawls, has confirmed much of the above and more. The structural understanding resulting from the experimental investigations has fueled both, theoretical model building which attempts to explain experimentally observed phenomena, and development of new algorithmic techniques that solve traditional problems of search and information retrieval on the web graph in novel ways. Moreover, the reality of the Web as a structure which is too large and continuously changing, makes the standard off-line and on-line models for algorithm design totally inapplicable. Over the last seven-eight years researchers have attempted to grapple with these unique issues of complexity in the study of the Web. Interestingly, contributions to this study of the Web have come not only from from computer scientists, but also from physicists who have brought to Web model building techniques from statistical mechanics that have been successful in predicting macro level behavior of a variety of natural phenomenon from millions of its constituent parts. In this chapter an attempt is made to put together what to the author are the main strands of this rapidly evolving model building. The rest of the chapter is organized as follows: Section 2 surveys the experimental observations and reflects what are the major trends in the findings. Section 3 contains the basic theoretical framework developed to explain the experimental findings. Section 4 contains examples of web algorithmics. Section 5 is crystal gazing and reflects what to the author are the grand challenges.
Experimental Observations
Recent literature contains reports of a number of experiments conducted to investigate topological properties satisfied by the web graph [6, 9, 22]. These experiments were conducted over a period of time, and using Web samples of varying sizes. Albert, Jeong and Barabasi [6] used nd.edu subset of the Web. Kumar et al.[22] used a cleaned up version of a 1997 web crawl carried out by Alexa Inc. Broder et al. [9] based their measurements on an Altavista crawl having about 200 million pages and 1.5 billion hyperlinks. The most fundamental observation that emerges from these experiments conducted at different times, focusing on different subparts of the Web, is that the degree distribution of nodes in the web graph follows a power law. The degree distribution is said to satisfy power law if the fraction of nodes of degree x is proportional to x−α for α > 0. Power law distribution is observed for
both the indegrees and the outdegrees of the web graph. Broder at al. report that for indegrees the power coefficient —indexexperimental observations!indegree distributionα ≈ 2.1, and for outdegrees α ≈ 2.72. There is a very close match in literature in the value of α for indegree distribution. For outdegree distribution the value of α reported varies from 2.4 to 2.72 [6].
Broder et al. [9] also analysed the crawl for connectedness. Viewing the web graph as an undirected graph, it was observed that 91% of the nodes were connected to each other and formed a giant connected component. Interestingly, it was found that the distribution of the number of connected components by their sizes also satisfied power law (α ≈ 2.5).
Power law distribution in the sizes of the components was observed even when the graph was viewed as directed. However, the size of the largest strongly connected component (giant SCC) was only 28% of the total web crawl. The giant SCC was reachable from about 22% of the nodes (the set IN). About similar percentage of nodes were reachable from the giant SCC (the set OUT). A significant portion of the rest of the nodes constituted, in Broder et al.’s terminology, “tendrils”, nodes reachable from IN or from which the set OUT is reachable. All the experiments done so far point to fractal like self similar nature of the Web in the sense that structure described above is likely to be exhibited in any non-trivial crawl carried out at any time.
Kumar et al. [22] also carried out experiments to measure the number of bipartite cores in the Web. In a cleaned up version of the web graph consisting of 24 million nodes, they reported discovering around 135,000 bipartite cliques Ki,j with i ≥ 3 and j = 3. The number of Ki,j ’s with i, j = 3 was approximately 90,000, and the numbers dropped exponentially with increase in j. Finding such cliques in the web graph is an algorithmically challenging problem which we will further discuss in the section on Web algorithmics.
Measurements have also been made of the “diameter” of the Web [6, 9]. If the Web has the structure as asserted in [9], then the probability of a path existing between any two random vertices is approximately 24%, and the average shortest path length is 16. Albert et al. [6] measured the average shortest path length on a directed graph generated having in and outdegree distribution satisfying the power law coefficients of 2.1 and 2.45 respectively.
In a directed graph with 8 × 108 vertices the average shortest path length was determined to be approximately 19, and was a linear function of the logarithm of the number of vertices.
The Web, therefore, is considered to exhibit the “small worlds” phenomenon [31].
Comments
Post a Comment