# Data Encounters Many New Challenges Computer Science Essay

**Published:** **Last Edited:**

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.

## ABSTRACT

Mining of data encounters many new challenges with increased amount of information on data repository (Data warehouse, Database, World Wide Web etc.). Data repository documents have been main resource for various purposes, people are really want to search the required information in a very efficient manner. The search engines play vital role for retrieving the required information from huge information. In this we assume World Wide Web information as a data repository and propose a dynamic and efficient algorithm for search engine to return quality result by scoring the relevance of web document. This algorithm increases the degree of relevance than the original one, and decreases the time to find the required web documents. Search engines generally return a large number of pages in response to user queries. To assist the users to navigate in the result list, ranking methods are applied on the search results. Most of the ranking algorithms proposed in the literature are either link or content oriented, which do not consider user usage trends. Here, in this thesis, a page ranking mechanism called PRABUBB (Page Ranking Algorithm Based on user Browsing Behavior) is being devised for search engines, which works on the basic ranking algorithm of Google i.e. PageRank takes number of visits of inbound links of Web pages into account. It is described in the literature that PageRank algorithm uses the link structure to calculate the importance (rank value) of different pages and on the basis of importance, it sorts the list of pages return by search engine in response of query submitted by users. Rank value of pages does not change till the link-structure of web remains constant. In other words, PageRank only use web structure mining technique to calculate the rank value of pages. To make rank value of pages dynamic rather than static, in this thesis, a new concept called PRABUBB is proposed and described, which takes into account users' behavior i.e. Link Visit Information, and calculates importance of pages. This concept is very useful to display most valuable pages on the top of the result list on the basis of user browsing behavior, which reduces the search space to a large scale.

Thesis also presents a method to find link-visit counts of Web pages because there is not any known method to get the link-visit information of users. The implementation of PRABUBB and a comparison of PRABUBB with other PageRank algorithms are also shown. Finally, by a sample web graph, PRABUBB is applied to calculate the importance of pages and comparison between results of PRABUBB and original page rank is shown .

## CHAPTER 1: INTRODUCTION

## Introduction to page rank algorithm

Web Mining is defined as the application of data mining techniques on the World Wide Web to find hidden information, This hidden information i.e. knowledge could be contained in content of web pages or in link structure of WWW or in web server logs. WWW is a vast resource of hyperlinked and heterogeneous information including text, image, audio, video, and metadata. It is estimated that WWW has expanded by about 2000 % since its evolution and is doubling in size every six to ten Months. With the rapid growth of information sources available on the WWW and growing needs of users, it is becoming difficult to manage the information on the web and satisfy the user needs. Actually, we are drowning in data but starving for knowledge. Therefore, it has become increasingly necessary for users to use some information retrieval techniques to find, extract, filter and order the desired information.

The World Wide Web (Web) is popular and interactive medium to propagate information today. The Web is huge, diverse, dynamic, widely distributed global information service center. As on today WWW is the largest information repository for knowledge reference. With the rapid growth of the Web, users get easily lost in the rich hyperlink structure. Providing relevant information to the users to cater to their needs is the primary goal of website owners. Therefore, finding the content of the Web and retrieving the users' interests and needs from their behavior have become increasingly important. When a user makes a query from searchengine, it generally returns a large number of pages in response to user queries. This result-list contains many relevant and irrelevant pages according to user's query. As user impose more number of relevant pages in the search result-list. To assist the users to navigate in the result list, various ranking mets are applied on the search results.

## Parameters

In this section different parameters, selected for web page ranking, are discussed. The page ranking will be done taking a weighted average of all or some of the parameters. The weight given to a particular parameter will depend upon the category of the page. In the proposed algorithm a single query may give different ranking to a depending on the category of the page-which is not possible in any existing search engines. The algorithm is flexible in the sense that just by changing the weights the same algorithm provides ranking for different types of pages.

## 1.2.1 Relevance Weight

Relevance weight measures the relevance of a page with respect to a query topic by counting the number of occurrences of the query topic or part of the query topic within the text of the document. The term frequency matrix provides useful information for calculating relevance weight. Some existing ways are Vector Space Model [15] [16], Cover Density Ranking [17], Three Level Scoring method [14] etc.

## 1.2.2 Hub and Authority Weight

Hub and authority weight of a page is calculated using the HITS algorithm. Given a user query, the HITS algorithm first creates a neighborhood graph for the query. The neighborhood contained nearly top 200 matched web pages retrieved from a content-based web search engine; it also contained all the pages these 200 web pages linked to and pages that linked to these 200 top pages.

## 1.2.3 Link Analysis of a Page

The HITS algorithm analyzes the link structure information of a web graph. The hyperlink information of a single page (e.g. number of links, anchor text and positions of the pages in the domain tree with respect to a particular page) are also found to give useful information during syntactic categorization of a web page.

## 1.2.3.1 Number of Hyperlinks

The number of hyperlinks of a page is calculated by getting the total number of a href tags. For getting the exact the number of hyperlinks the number of frame src tags should be added to the number of a href tags and links to the same page should be excluded.

## 1.2.3.2 Anchor Text

The anchor text can be used to calculate the weight of links during measuring hub and authority weight. By analyzing anchor text the glossary pages can very easily be identified.

## 1.2.3.3 Positions of Hyperlinked Pages in the Domain Tree with Respect to a Particular Page

It has been found the portals have large number of hyperlinks pointing to same level nodes in the domain tree rooted at the next higher level node of the source of the page; e.g., if source is a.b.com nature of hyperlinks are x.b.com or y.b.com. The site maps and home pages have large number of hyperlinks pointing to lower level nodes in the domain tree rooted at the source of the page; e.g., if source is a.b.com nature of hyperlinks are a.b.com/x, a.b.com/y.

## 1.2.4 Types of Content

The syntactic analysis of the content also gives useful properties about the type of a page. Examples of these types of properties are:

1. Number of images in a page

2. Text length to number of images proportion etc.

3. Relevance weight of the query string within special tags like Heading tag, title tag etc.

## Brief Overview of Page Ranking Algorithm

Ranking is an integral yet important component of any information retrieval system [5]. Over the past decade, the Web has grown exponentially both in size and variety. As this rapid growth of WWW, a simple keyword search could match hundreds of thousands of Web pages. A human usually can check only the first twenty or some more URLs returned by the search engines. So users rely heavily on search engines to not only retrieve the Web pages related to their information need but also correctly rank those Web pages according to their relevance to the user's query when displaying. Thus, the ordering of the search results becomes a crucial factor to

evaluate the effectiveness of a search engine. Given a query, text-based search engines normally return a large number of relevant Web pages. To be more effective, the returned pages must be adequately ranked according to their importance with respect to the user's information need. Link graph features derived from hyperlinks on Web pages such as in-degree and out-degree have been shown to significantly improve the performance of the text-based retrieval algorithms on the Web.

Three most representative link-based page ranking algorithms are PageRank and HITS (Hypertext Induced Selection) and Weighted PageRank.

## 1.3.1 Page Rank Algorithm

PageRank was proposed by Lawrence Page and Sergey Brin, the graduate students of Stanford, in 1998, and has been used as the core ranking algorithm of Google, today's most widely used search engine. PageRank score of each page is pre-computed for the entire Web graph, which contains more than 50 billion pages today, and must be upgraded periodically (e.g., every three months or so) and each upgrading needs hundreds of thousand high-end computers and 3~5 days to finish. The Page Rank algorithm is based on the concepts that if a page contains important links towards it then the links of this page towards the other page are also to be considered as important pages. The Page Rank considers the back link in deciding the rank score. If the addition of the all the ranks of the back links is large then the page then it is provided a large rank . A simplified version of PageRank is given by:

Where the PageRank value for a web page u is dependent on the PageRank values for each web page v out of the set Bu(this set contains all pages linking to web page u), divided by the number L(v) of links from page v. An example of back link is shown in figure below. U is the

back link of V & W and V & W are the back links of X.

Illustration of back links

## 1.3.2 HITS Algorithm

HITS was proposed by Kleinburg in 1999 and is now used by Ask.com. While Stanford was developing PageRank, the IBM Almaden research center was defining HITS. The notion behind HITS is the discrimination between authorities and hubs. Authorities are pages with good content, whereas hubs are pages with links to good pages. Hubs and Authorities exhibit a mutually reinforcement relationship. HITS is not a global ranking algorithm. It is query-dependent and is computed at query time. Original HITS algorithm uses similarity measures only partially and is 3 susceptible to spamming. HITS algorithm ranks the web page by processing in links and out links of the web pages. In this algorithm a web page is named as authority if the web page is pointed by many hyper links and a web page is named as HUB if the page point to various hyperlinks. An Illustration of HUB and authority are shown in figure

Figure 4: Illustration of Hub and Authorities

HITS is technically, a link based algorithm. In HITS algorithm, ranking of the web page is decided by analyzing their textual contents against a given query. After collection of the web pages, the HITS algorithm concentrates on the structure of the web only, neglecting their textual contents. Original HITS algorithm has some problems which are given below.

(i) High rank value is given to some popular website that is not highly relevant to the given query.

(ii) Drift of the topic occurs when the hub has multiple topics as equivalent weights are given to all of the out links of a hub page. Figure shows an Illustration of HITS process.

Figure 5: Illustration of HITS process

## 1.3.3 Weighted Page Rank Algorithm

Weighted Page Rank Algorithm is proposed by Wenpu Xing and Ali Ghorbani. Weighted page rank algorithm (WPR) is the modification of the original page rank algorithm. WPR decides the rank score based on the popularity of the pages by taking into consideration the importance of both the in-links and out-links of the pages. This algorithm provides high value of rank to the more popular pages and does not equally divide the rank of a page among its out-link pages. Every out-link page is given a rank value based on its popularity. Popularity of a page is decided by observing its number of in links and out links. Simulation of WPR is done using the website of Saint Thomas University and simulation results show that WPR algorithm finds larger number of relevant pages compared to standard page rank algorithm.

## 1.3.4 The SALSA Algorithm

An alternative algorithm, SALSA, was proposed by Lempel and Moran , that combines ideas from both HITS and PAGERANK. As in the case of HITS, visualize the graph G as a bipartite graph, where hubs point to authorities. The SALSA algorithm performs a random walk on the bipartite hubs and authorities graph, alternating between the hub and authority sides. The random walk starts from some authority node selected uniformly at random. The random walk then proceeds by alternating between backward and forward steps. When at a node on the authority side of the bipartite graph, the algorithm selects one of the incoming links uniformly at random and moves to a hub node on the hub side. When at node on the hub side the algorithm selects one of the outgoing links uniformly at random and moves to an authority. The authority weights are deï¬ned to be the stationary distribution of this random walk. Formally, the Markov Chain of the random walk has transition probabilities

Pa(i,j) =

Recall that Ga = (A;Ea) denotes the authority graph, where there is an (undirected) edge between two authorities if they share a hub. This Markov Chain corresponds to a random walk on the authority graph Ga, where we move from authority i to authority j with probability Pa(i; j). Let Wr denote the matrix derived from matrix W by normalizing the entries such that, for each row, the sum of the entries is 1, and let Wc denote the matrix derived from matrix W by normalizing the entries such that, for each column, the sum of the entries is 1. Then the stationary distribution of the SALSA algorithm is the principal left eigenvector of the matrix MS = WTc Wr.

## Objective

In search engine there are many points where searching process can be improved, but two

main points which can be improved with respect to search results are.

First, how does one can define the relevance between documents? Current approaches

use text matching as a solution for this task. However, these approaches have many disadvantages. For example, if someone gives a query based on some keywords, no one will be able to infer what is he talking about? If some one want information about windows would a search system be able to infer that he is talking about the operating system Windows or if he is talking about the windows of a house? In the same way java may interpret a programming language, to an island or to coffee? These are just examples and one can think that the best approach to solve them is giving semantic information to the content, so that when someone give some keywords to search system, system should understand the meanings. However, the Semantic Web is still too far to be a valid approach for the whole WWW.

ï€ Second, the current methods and algorithms to rank the results given by a search engine are not best. They can be optimized in many ways. The best way to optimize them is to know and include information about the user. If search engine knows the user interests it can filter the results for him automatically instead of letting him do it and waste his time. Many different ways exist to get the information about the user's interest by taking the history of queries submitted by the user into account (so if search system knows the user has search for information related to the computer science field then system can infer that with windows. and java. he is probably referring to the operating system and the programming language) or asking the user to give information about him (called user profile) where search system can find useful information (if he is a computer science engineer for instance) and important one for research in context of this thesis is his browsing information ( how user visits pages of internet or web).

## Proposed Work

Two main objective of the proposed ranking system are:

ï€±ï€®ï€ To make ranking system more dynamic than existing ranking algorithms of search engine.

2. To minimize the user time required to find the desired results from set of returned results by the search engine for given query.

ï€ Ranking of algorithm will consider user's browsing behavior of web. As it is introduced in previous section that PRABUBB uses the user's browsing information in consideration to calculate rank of a documents rather than link structure. Due to browsing information in consideration PRABUBB system is more dynamic than other ranking algorithms.

## Organization of desertion

## CHAPTER 2: LITERATURE SURVEY

With the rapid growth of the Web, users get easily lost in the rich hyper structure. Providing relevant information to the users to cater to their needs is the primary goal of website

owners. Therefore, finding the content of the Web and retrieving the users' interests and needs from their behavior have become increasingly important. Web mining is used to categorize users and pages by analyzing the users' behavior, the content of the pages, and the order of the URLs

that tend to be accessed in order. Web structure mining plays an important role in this approach. Two page ranking algorithms, HITS and PageRank, are commonly used in web structure mining. Both algorithms treat all links equally when distributing rank scores. Several algorithms

have been developed to improve the performance of these methods. The Weighted PageRank algorithm (WPR), an extension to the standard PageRank algorithm, is introduced in this paper. WPR takes into account the importance of both the inlinks and the outlinks of the pages and distributes rank scores based on the popularity of the pages. The results of our simulation studies show that WPR performs better than the conventional PageRank algorithm in terms of returning

larger number of relevant pages to a given query.

Web is expending day by day and people generally rely on search engine to explore the web. In such a scenario it is the duty of service provider to provide proper, relevant and quality information to the internet user against their query submitted to the search engine. It is a challenge for service provider to provide proper, relevant and quality information to the internet user by using the web page contents and hyperlink between the web pages.

Most explanations of the user behavior while interacting with the web are based on a top-down approach, where the entire Web, viewed as a vast collection of pages and interconnection

links, is used to predict how the users interact with it. A prominent example of this approach is the random-surfer model, the core ingredient behind Google's PageRank. This model exploits the linking structure of the Web to estimate the percentage of web surfers viewing any given page. Contrary to the top-down approach, a bottom-up approach starts from the user and incrementally builds the dynamics of the web as the result of the users' interaction with it. The second approach has not being widely investigated, although there are numerous advantages over

the top-down approach regarding (at least) personalization and decentralization of the required infrastructure for web tools. In this thesis, we propose a bottom-up approach to study the web

dynamics based on web-related data browsed, collected, tagged, and semi-organized by end users. Our approach has been materialized into a hybrid bottom-up search engine that produces search results based solely on user provided web-related data and their sharing among users. We conduct an extensive experimental study to demonstrate the qualitative and quantitative characteristics of user generated web-related data, their strength, and weaknesses as well as to compare the search results of our bottom-up search engine with those of a traditional one. Our study shows that a bottom-up search engine starts from a core consisting of the most interesting part of the Web (according to user opinions) and incrementally (and measurably) improves its ranking, coverage, and accuracy. Finally, we discuss how our approach can be integrated with PageRank, resulting in a new page ranking algorithm that can uniquely combine link analysis with users' preferences.

The main goal of this thesis is to capture interesting and uninteresting web pages from user browsing behavior. These web pages are stored in user profile under positive and negative documents. We propose a ranking algorithm that is based on the combination of different information resources collected from the reference ontology, user profile and original search engine's ranking. Experiments show that our model offers improved performance over the Google search engine.

The present thesis discusses a web page ranking algorithm, which consolidates web page classification with web page ranking to offer flexibility to the user as well as to produce more accurate search result. The classification is done based on several properties of a web page which are not dependent on the meaning of its content. The existence of this type of classification is supported by applying fuzzy c-means algorithm and neural network classification on a set of web pages. The typical interface of a web search engine is proposed to change to a more flexible interface which can take the type of the web page along with the search string.

Based on the algorithm used, the ranking algorithm provides a definite rank to resultant web pages. A typical search engine should use web page ranking techniques based on the

specific needs of the users. After going through exhaustive analysis of algorithms for ranking of web pages against the various parameters such as methodology, input parameters,

relevancy of results and importance of the results, it is concluded that existing techniques have limitations particularly in terms of time response, accuracy of results, importance of the results and relevancy of results. An efficient web page ranking algorithm should meet out these

challenges efficiently with compatibility with global standards of web technology.

## Search Engine

A search engine receives uses query, processes the query, and searches into its index for relevant documents i.e. the documents that are likely related to query and supposed to be interesting then, search engine ranks the documents found relevant and it shows them as results. This process can be divided in the following tasks:

## ï€ Crawling:

A crawler is in charge of visiting as many pages it can and retrieve the information needed from them. The idea is that this information is stored for the use by the search engine afterwards.

## Indexing:

The information provided by a crawler has to be stored in order to be accessed by the search engine. As the user will be in front of his computer waiting for the answer of the search engine, time response becomes an important issue. That is why this information is indexed in order to decrease the time needed to look into it.

## Searching:

The web search engine represents the user interface needed to permit the user to query the information. It is the connection between the user and the information repository.

## Sorting/Ranking:

Due to the huge amount of information existing in the web, when a user sends a query about a general topic (e.g. java course), there exist an incredible number of pages related to this query but only a small part of such amount of information will be really interesting for the user. That is why current search engines incorporate ranking algorithms in order to sort the results.

CHAPTER 3:

## 3.1.1 Page Rank Algorithm

The original PageRank algorithm was described by Lawrence Page and Sergey Brin in several publications. It is given by

PR(A) = (1-d) + d (PR(T1)/C(T1) + ... + PR(Tn)/C(Tn))

where

http://pr.efactory.de/i/dot.gif

PR(A) is the PageRank of page A,

http://pr.efactory.de/i/dot.gif

PR(Ti) is the PageRank of pages Ti which link to page A,

http://pr.efactory.de/i/dot.gif

C(Ti) is the number of outbound links on page Ti and

http://pr.efactory.de/i/dot.gif

d is a damping factor which can be set between 0 and 1.

So, first of all, we see that PageRank does not rank web sites as a whole, but is determined for each page individually. Further, the PageRank of page A is recursively defined by the PageRanks of those pages which link to page A.

The PageRank of pages Ti which link to page A does not influence the PageRank of page A uniformly. Within the PageRank algorithm, the PageRank of a page T is always weighted by the number of outbound links C(T) on page T. This means that the more outbound links a page T has, the less will page A benefit from a link to it on page T.

The weighted PageRank of pages Ti is then added up. The outcome of this is that an additional inbound link for page A will always increase page A's PageRank.

Finally, the sum of the weighted PageRanks of all pages Ti is multiplied with a damping factor d which can be set between 0 and 1. Thereby, the extend of PageRank benefit for a page by another page linking to it is reduced.

## 3.1.2A Different Notation of the PageRank Algorithm

Lawrence Page and Sergey Brin have published two different versions of their PageRank algorithm in different papers. In the second version of the algorithm, the PageRank of page A is given as

PR(A) = (1-d) / N + d (PR(T1)/C(T1) + ... + PR(Tn)/C(Tn))

where N is the total number of all pages on the web. The second version of the algorithm, indeed, does not differ fundamentally from the first one. Regarding the Random Surfer Model, the second version's PageRank of a page is the actual probability for a surfer reaching that page after clicking on many links. The PageRanks then form a probability distribution over web pages, so the sum of all pages' PageRanks will be one.

Contrary, in the first version of the algorithm the probability for the random surfer reaching a page is weighted by the total number of web pages. So, in this version PageRank is an expected value for the random surfer visiting a page, when he restarts this procedure as often as the web has pages. If the web had 100 pages and a page had a PageRank value of 2, the random surfer would reach that page in an average twice if he restarts 100 times.

As mentioned above, the two versions of the algorithm do not differ fundamentally from each other. A PageRank which has been calculated by using the second version of the algorithm has to be multiplied by the total number of web pages to get the according PageRank that would have been caculated by using the first version. Even Page and Brin mixed up the two algorithm versions in their most popular paper "The Anatomy of a Large-Scale Hypertextual Web Search Engine", where they claim the first version of the algorithm to form a probability distribution over web pages with the sum of all pages' PageRanks being one.

In the following, we will use the first version of the algorithm. The reason is that PageRank calculations by means of this algorithm are easier to compute, because we can disregard the total number of web pages.

## 3.1.3 The Characteristics of PageRank

The characteristics of PageRank shall be illustrated by a small example.

http://pr.efactory.de/i/bsp1.gif

We regard a small web consisting of three pages A, B and C, whereby page A links to the pages B and C, page B links to page C and page C links to page A. According to Page and Brin, the damping factor d is usually set to 0.85, but to keep the calculation simple we set it to 0.5. The exact value of the damping factor d admittedly has effects on PageRank, but it does not influence the fundamental principles of PageRank. So, we get the following equations for the PageRank calculation:

PR(A) = 0.5 + 0.5 PR(C)

PR(B) = 0.5 + 0.5 (PR(A) / 2)

PR(C) = 0.5 + 0.5 (PR(A) / 2 + PR(B))

These equations can easily be solved. We get the following PageRank values for the single pages:

PR(A) = 14/13 = 1.07692308

PR(B) = 10/13 = 0.76923077

PR(C) = 15/13 = 1.15384615

It is obvious that the sum of all pages' PageRanks is 3 and thus equals the total number of web pages. As shown above this is not a specific result for our simple example.

For our simple three-page example it is easy to solve the according equation system to determine PageRank values. In practice, the web consists of billions of documents and it is not possible to find a solution by inspection.

## 3.2 Background of PageRanking Algorithm

PageRank was developed at Stanford University by Larry Page (hence the name Page-Rank) and Sergey Brin as part of a research project about a new kind of search engine. Sergey Brin had the idea that information on the web could be ordered in a hierarchy by "link popularity": a page is ranked higher as there are more links to it. It was co-authored by Rajeev Motwani and Terry Winograd. The first paper about the project, describing PageRank and the initial prototype of the Google search engine, was published in 1998: shortly after, Page and Brin founded Google Inc., the company behind the Google search engine. While just one of many factors that determine the ranking of Google search results, PageRank continues to provide the basis for all of Google's web search tools.

PageRank has been influenced by citation analysis, early developed by Eugene Garfield in the 1950s at the University of Pennsylvania, and by Hyper Search, developed by Massimo Marchiori at the University of Padua. In the same year PageRank was introduced (1998), Jon Kleinberg published his important work on HITS. Google's founders cite Garfield, Marchiori, and Kleinberg in their original paper.

A small search engine called "RankDex" from IDD Information Services designed by Robin Li was, since 1996, already exploring a similar strategy for site-scoring and page ranking. The technology in RankDex would be patented by 1999 and used later when Li founded Baidu in China. Li's work would be referenced by some of Larry Page's U.S. patents for his Google search methods.

## 3.3 PageRank calculation

To calculate the PageRank for a page, all of its inbound links are taken into account. These are links from within the site and links from outside the site.

PR(A) = (1-d) + d(PR(t1)/C(t1) + ... + PR(tn)/C(tn))

That's the equation that calculates a page's PageRank. It's the original one that was published when PageRank was being developed, and it is probable that Google uses a variation of it but they aren't telling us what it is. It doesn't matter though, as this equation is good enough.

In the equation 't1 - tn' are pages linking to page A, 'C' is the number of outbound links that a page has and 'd' is a damping factor, usually set to 0.85.

We can think of it in a simpler way:-

a page's PageRank = 0.15 + 0.85 * (a "share" of the PageRank of every page that links to it)

"share" = the linking page's PageRank divided by the number of outbound links on the page.

A page "votes" an amount of PageRank onto each page that it links to. The amount of PageRank that it has to vote with is a little less than its own PageRank value (its own value * 0.85). This value is shared equally between all the pages that it links to.

From this, we could conclude that a link from a page with PR4 and 5 outbound links is worth more than a link from a page with PR8 and 100 outbound links. The PageRank of a page that links to yours is important but the number of links on that page is also important. The more links there are on a page, the less PageRank value your page will receive from it.

If the PageRank value differences between PR1, PR2,.....PR10 were equal then that conclusion would hold up, but many people believe that the values between PR1 and PR10 (the maximum) are set on a logarithmic scale, and there is very good reason for believing it. Nobody outside Google knows for sure one way or the other, but the chances are high that the scale is logarithmic, or similar. If so, it means that it takes a lot more additional PageRank for a page to move up to the next PageRank level that it did to move up from the previous PageRank level. The result is that it reverses the previous conclusion, so that a link from a PR8 page that has lots of outbound links is worth more than a link from a PR4 page that has only a few outbound links.

Whichever scale Google uses, we can be sure of one thing. A link from another site increases our site's PageRank. Just remember to avoid links from link farms.

For a page's calculation, its existing PageRank (if it has any) is abandoned completely and a fresh calculation is done where the page relies solely on the PageRank "voted" for it by its current inbound links, which may have changed since the last time the page's PageRank was calculated.

The equation shows clearly how a page's PageRank is arrived at. But what isn't immediately obvious is that it can't work if the calculation is done just once. Suppose we have 2 pages, A and B, which link to each other, and neither have any other links of any kind. This is what happens:-

Step 1: Calculate page A's PageRank from the value of its inbound links

Page A now has a new PageRank value. The calculation used the value of the inbound link from page B. But page B has an inbound link (from page A) and its new PageRank value hasn't been worked out yet, so page A's new PageRank value is based on inaccurate data and can't be accurate.

Step 2: Calculate page B's PageRank from the value of its inbound links

Page B now has a new PageRank value, but it can't be accurate because the calculation used the new PageRank value of the inbound link from page A, which is inaccurate.

It's a Catch 22 situation. We can't work out A's PageRank until we know B's PageRank, and we can't work out B's PageRank until we know A's PageRank.

Now that both pages have newly calculated PageRank values, can't we just run the calculations again to arrive at accurate values? No. We can run the calculations again using the new values and the results will be more accurate, but we will always be using inaccurate values for the calculations, so the results will always be inaccurate.

The problem is overcome by repeating the calculations many times. Each time produces slightly more accurate values. In fact, total accuracy can never be achieved because the calculations are always based on inaccurate values. 40 to 50 iterations are sufficient to reach a point where any further iterations wouldn't produce enough of a change to the values to matter. This is precisiely what Google does at each update, and it's the reason why the updates take so long.

One thing to bear in mind is that the results we get from the calculations areÂ proportions. The figures must then be set against a scale (known only to Google) to arrive at each page's actual PageRank. Even so, we can use the calculations to channel the PageRank within a site around its pages so that certain pages receive a higher proportion of it than others.

## 3.4.1 System Analysis

Page ranking Homepage

Login form

Administrator User of System

Internal User External User

Admin Login

Pageindex incharge new user form

Admin Page

Pageindex incharge Page

Member Login

Schedule Master Page

## Figure: Decision tree for Page rank.

## 3.4.2 DATAFLOW FOR Page rank

Administrator

Loginâ€¦â€¦.

If

## Login

Successfully

START

Add News & Events

Authentication User

Allocate Of Privileges

Create, Modify, Delete, Mail, Chatâ€¦..

Event Is Saved

Displayed Updated File

B

Internal User

External User

Enter ID & PASSWORD

Entering Into The

## Pagerank

Is & PASSWOD

Match?

D

A

C

No

Yes

NO

Yes

View reports

Internal User Creation

Super USer

Log Monitoring

B

Online/Offline Communication

Chat

B

C

Pageindex incharge

Enter ID & PASSWORD

Is

## ID & PASSWORD

C

ADMINISTRATOR

YES

NO

STOP

Is

## ID & PASSWORD

A

B

C

Enter into the Original System

## (Page rank)

Denied Of Services

D