This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.
The World Wide Web continues to grow at an exponential rate, sofetching information about an interested-topic is gaining significance. This constitutes outstanding scaling challenges for traditional crawlers and search engines. This paper describes a web crawling approach called "Focused Crawler". The objective of a focused crawler is to find out pages which are pertinent to given keywords. Instead of collecting all available web documents related to query, a focused crawler analyze its crawl boundary to strike upon the links that are highly pertinent for the crawl, and avoids inappropriate links of the document. This leads to significant savings in hardware as well as network resources. In this paper, a method of efficient focused crawling is implemented to enhance the quality of web navigation. First we give query to the Google search engine and retrieve top k pages that are our seed URLs. Based on these seed URLs we create Topic Specific Weight Table. Then extract first link from the frontier, fetch the corresponding web page, parse the page and retrieve one by one all links present in that page. Before inserting that link into the frontier first check whether it's a duplicate link or not. If link is duplicate just skip it otherwise calculate the URL score based on various factors such as its description in Google search engine, its Anchor text relevancy, and compute the similarity measure of description with given query or topic keywords. Relevancy score is calculated based on vector space model. URL Queue optimization is done on the basis of duplicate link.
Keywords - WWW, Focused web crawler, TF-IDF, Query specific search, Queue optimisation.
The World-Wide Web, having over 25 billion pages and as so far growth shows no sign of end. Dynamic content on the web is growing as time-sensitive information such as news, financial data and entertainment become widely dispersed through the web. Search engines are therefore increasingly challenged when trying to uphold current indices using exhaustive crawling. Web crawlers are the heart of search engines, which updates the database as any page gets add or remove from the web. It has become a challenge to traverse all URLs in the web documents and to handle these URLs, because of growing and dynamic nature of the web. A focused crawler is a mediator that targets a particular topic and visits and gathers only relevant web pages.
Focused crawlers [2, 3] aim to search and retrieve only the subset of the world-wide web that pertains to a precise topic of relevance. The perfect focused crawler retrieves the maximal set of relevant pages while concurrently traversing the minimal number of irrelevant documents on the web. Hence focused crawlers offer a potential solution to the currency problem by allowing for standard exhaustive crawls to be supplemented by focused crawls for categories where content changes quickly. Focused crawler has the following two main components: (i) To find the relevancy of a specific web page to the given topic, and (ii) to find how to proceed from seed pages. A previous focused crawling strategy proposed in  founded on the perception that relevant pages contain relevant links mostly. Unfortunately, the crawlers discussed above illustrate a disadvantage when the pages related to a given topic are not connected directly then crawling may be stop rashly. Focused crawlers are well suited to efficiently generate indices used for niche search engines maintained by portals and user groups , where limited bandwidth and storage space are the norm . Also, due to the limited resources used by a good focused crawler, users are already using personal computer based implementations . Ultimately focused crawlers might become the method of choice for users to perform inclusive searches of web-related materials .
The crawler is kept focused through a crawling approach which finds the relevancy of the web page to the predefined topic keyword and depending on this relevancy a judgment is done for downloading of the web page . Focus crawler have been helpful for other appliances such as distributed processing of the whole web, with every one crawler allocated to a small part of the web.
The rest of the paper is structured as follows. In section II we present an appraisal of related work. Section III discusses about the system architecture. In Section IV we describe in detail the approach we have proposed. Section V includes the algorithm and in section VI we discussed performance analysis. At last, we conclude the paper with future work.
II. RELATED Work
S. Chakrabarti et. al, analyses that Topic oriented crawlers try to focus the crawling process on pages pertinent to the topic or query. They maintain the overall number of downloaded Web pages for processing to a least amount, whereas maximizing the percentage of pages which are pertinent [12,13]. Their performance mainly depends on the selection of starting URLs or starting pages which are also known as seed pages or seed URLs. In general, users provide a set of seed pages as input to a crawler or, otherwise, seed pages are selected amongst the top answers returned by a Web search engine, considering the topic as query [10,11]. High-quality seed pages can be either pages pertinent to the topic or pages from which pertinent pages can be accessed within a few number of routing hops.
S. Chakrabarti et. al, states that a focused crawler competently seeks out documents about a specific topic and guides the search based on both the content and link structure of the web . A focused crawler develops a strategy that associates a score with each link in the pages it has downloaded [12, 13, 14]. Those links are sorted according to the scores and inserted in a queue. This strategy ensures that the focusedcrawler preferentially pursues promising crawl paths.
A focused crawler is a computer program used for finding information associated to some specific topic from the WWW. However the main goal of focused crawling is that the crawler selects and retrieves pertinent pages only and does not need to gather all web pages. As the crawler is only a computer program, it cannot predict how pertinent a web page is . In an attempt to search pages of a specific type or on a specific topic, focused crawlers aspire to recognize links that are probably to direct to target documents, and pass up links to off topic. Fish Search algorithm and Shark Search algorithm were used previously for crawling with topic keywords mentioned in query.
In Fish Search algorithm, the system is driven by the query. It starts from a set of seed pages; it takes only those pages that have content similar to a given query and their relative pages. The Fish-Search technique allocates binary priority values (0 for not relevant, 1 for relevant) to pages candidate for downloading using simple keyword matching. Hence, all pertinent pages are assigned the equal priority value. Shark-Search is an advance algorithm of Fish Search algorithm in which a child link inherits a partial value of the score of its parent link and this score is united with a value depend on the anchor text that arise around the link in the Web page . The Shark-Search method proposes using Vector Space Model (VSM) for allocating non binary priority values to candidate pages.
A focused crawling process consists of two important stages. The first stage is to decide the starting URLs and identify user interest, because the crawler is incapable to traverse the web without starting URLs. The second stage is the crawling technique. A focused crawler elegantly chooses a direction to traverse the Internet. Best route selection technique of the focused crawler is to position URLs in such a way that the most pertinent ones can be situated in the first part of the queue. Depending on the relevancy of the URLs, queue will then be sorted . The efficiency and performance of a focused crawler is primarily decided by the ordering strategy that decides the order of page retrieval.
III. SYSTEM ARCHITECTURE
Fig.1 shows the system architecture. The input keyword given by the user first goes to the Google search engine and it will produce some resultant pages. These top K pages are nothing but the seed URLs for our Focused Crawler. Focused Crawler creates the topic specific weight table by using these seed URLs and inserts these seed URLs into the frontier, where Frontier contains a list of unvisited URLs. This list is always maintained by the Focused Crawler. Crawler extracts the first link from the frontier and fetches the corresponding web page. Then it will parse the page, and extract one by one all links from the page. Before placing these links to the frontier first Crawler will check whether that link is already existing in the frontier or not. If it exist that means it's a duplicate link so skip that link and extract the next link from page. And if link is not a duplicate link then check whether that link is relevant to the topic or not by calculating its link score. If link is relevant add into the frontier otherwise skip it.
Fig. 1 System Architecture
IV. PROPOSED APPROACH
Extraction of Seed URL
We make use of the Google search engine to get the seed URLs. First we give user query to the Google search engine and its top k resultant pages are considered as seed URLs.
Construction of Topic Specific Weight Table
Weight table states the crawling goal. For creating Topic Specific Weight Table we consider seed URLs, fetch web pages corresponding to those URLs. Parse these pages by avoiding ineffective words called stop words like article, prepositions, etc., and then Porter's stemming algorithm is used for stemming the words. Such as, the collection of words write, wrote, writing, share a common word stem, write, and all are different occurrences of the same word write.
Term Frequency - Inverse Document Frequency, generally known as TF-IDF, is a mathematical statistic. It calculates significance of a word to a document in a collection.
Term Frequency (TF): A term frequency is nothing but the weight which gets allocated to each term in a document depending on the occurrence of that term in the document.
tf(t,d) : frequency of term t in document d
freq(t,d) : number of occurrences of t in d
After calculating tf(t,d) for each word it is necessary to normalize the frequency, to avoid a bias towards bigger documents. Normalisation can be done by frequency divided by the maximum frequency of any term within the document .
2. Inverse Document Frequency (IDF): Actually, certain terms have little power in differentiating the document. For example, the stop words that is meaningless terms such as pronouns, articles, prepositions, etc.
|D|: Total number of documents in the corpus,
| dÐ„ D : t Ð„ d| : Number of documents in which the term appears i.etf(t,d) â‰ 0.
Suppose, if the term is not present in the collection, it will create a division by zero. As a result change the formula to
1 + | d Ð„ D: t Ð„d|
Then tf-idf is computed as:
In other words, tfidf(t,d,D) assigns to term t, a weight in document d.
Take out first 'm' words which is having high weight as the topic keywords. Then normalize the weights like:
Where "Wti" and "Wtmax" are the weights of keyword "i" and keyword with highest weight respectively.
For instance, we have taken topic keyword "tennis". Then we retrieved top 5 results from Google. These 5 links are act as seed URLs. Then by using TF-IDF we construct the Topic Specific Weight Table as shown in Table I. Here we have considered top 10 weighted words in the table. Before placing the words into the table they should be normalised using equation (5).
TOPIC SPECIFIC WEIGHT TABLE
Calculation of Relevancy of page
The exact weight of words in page matching to the keyword in the Topic Specific Weight Table is deliberated. The same technique which we have used for weight calculation for Topic Specific Weight table is used here. In this paper, to calculate the relevancy of the page with respect to a particular topic it uses a cosine similarity measure.
Where, " t " is the topic specific weight table, " p " is the web page under examination, and "wkt" and "wkp" are the weights of keyword "k" in the weight table and in the web page correspondingly.
Calculation of Link Score
In our approach the Link Score is basically used to assign scores to unvisited Links which are took out from the downloaded page. The link score is assigned based on the Anchor_Relevancy_score and Relevance_Score_URL_Description of that link.
Link_Score(i) = Anchor_Relevancy_Score(i) +
[Relevancy(P1) +Relevancy(P2) +â€¦â€¦â€¦â€¦.+ Relevancy(Pn)]
Where Anchor_Relevancy_Score is the relevancy score among anchor text and topic keywords. Our approach computes Anchor_Relevancy Score as anchor text depicts some information concerning to URL.Relevancy_Score URL Description is the relevancy scoreof URL description regarding topic keywords. TABLE II.shows the weight of topic keywords in Description of URLs.
The focused crawler's efficiency is not only based on crawling maximum number of significant pages but it also based on time required to complete crawling process. So we can minimise this time by minimising the size of frontier. On web there are so many pages which points to the same page, due to which we get duplicate links and it also increases the size of the frontier. So, whenever new link comes before calculating it's link score we first check whether this link is already exist in the frontier or not. If it's a duplicate link we just skip it and consider next link. In this way by avoiding duplicate links we can minimise the size of the frontier.
V. FOCUSED CAWLING ALGORITHM WITH QUEUE OPIMISATION
Step 1: Input Keyword and Maximum URL to Crawl = MaxURL
Step 2: Retrieve Top M pages from Google and Enqueue first link in Queue/frontier
Step 3: Create Topic specific weight table using TF-IDF
Step 4: Dequeue link from Queue/frontier and fetch respective page
Step 5: Parse the Web page
Step 6: Extract all link from page
Step 7: Check for Duplicate Link, if duplicate skip it, if not then
Step 8: Assign Score to all extracted non Duplicate links based on
a. Title relevancy score,
b. Relevancy score URL description
c. Anchor Text Relevancy Score
Step 9: Rank link based on Link score and remove all bottom non-relevant link
Step 10: Enqueue selected link in Queue/frontier
Step 11: If Crawled Link = MaxURL then stop crawling
VI. PERFORMANCE ANALYSIS
The purpose of these experiments is to study the performance of the crawler on different topics. Each topic is issued as query to Google and the results are inspected on the basis of relevancy of the link and link score. If the link score is greater than threshold value then and then only link gets add into the frontier or URL queue.
In our experiments performance of the crawler depends on the size of the frontier. As the size of the frontier decreases, performance of the crawler increases. Concept behind is that if the size of the frontier is less, it contains less number of links so it will take less amount of time to crawl. So efficiency of crawler is more. So by applying various techniques we try to minimize the frontier size and enhance the efficiency of the crawler.
The following crawlers are compared:
1) Non Focused Crawlers:
Breadth First Crawler
2) Classic Focused Crawlers:
a) Focused Crawler without Queue Optimisation
b) Focused Crawler with Queue Optimisation
For experiments the topics chosen are
All crawlers were initialized using the same set of seed pages.
We illustrate a crawled result on a two-dimensional graph.
Topic : Cricket
Fig.2 : Performance of three crawlers for the topic Cricket
Fig.2.shows the performance of three crawlers for the topic "Cricket". The Crawlers are namely BFS stands for Breadth First Search Crawler, NFC stands for non optimised Focused Crawler and OFC stands for Optimised Focused Crawler. X - axis represents Maximum number of URLs to crawl, that means resultant number of pages we want after giving a query to the Google search engine. Y - axis represents total number of crawled pages. Suppose we want max URLs =15. For fulfilling this requirement BFS has crawled total 4029 pages, NFC has crawled 3318 pages and OFS has crawled 1384 pages. As the number of crawled pages are less efficiency of crawler is more.
BFS crawler has crawled 4029 pages from which there are so many pages which are not relevant to the topic "cricket" but still they are crawled as no any relevancy of the page has checked in this algorithm. NFC crawler has crawled 3318 pages and saved some links from to get crawl by checking the relevancy of the page to the topic. NFC crawler is giving quite better performance as compared to the BFC. OFC crawler has crawled only 1384 pages as it has optimised the queue by avoiding duplicate links. It has saved total 1934 duplicate links from to get crawl. The comparison in above Figure demonstrates the poor performance of Breadth First Crawler and OFC is giving better performance as compared to BFC and NFC so, OFC is most efficient crawler.
Success Ratio Table :
The comparison between conventional crawler and focused crawler can also be represented with success ratio. The success ratio is the number of unwanted links saved by focused crawler over conventional crawler.
success ratio table
Link Crawled using BFS
Link Crawled using FC without Q Optimisation
Link Crawled using FC with Q Optimisation
Success Ratio without Q Opimisation
Success Ratio with Q Optimisation
Success Ratio Graph:
Fig.3.shows the Success Ratio graph of Focused Crawler with Queue Optimisation and without Queue Optimisation over non focused crawler.
Topic : Cricket
Fig.3 : Success Ratio for the topic Cricket
A focused crawler is a web crawler that aspires to download web pages that are pertinent to a predefined query or topic. To find out whether the web page is related to a particular topic, our focused crawler used link score technique. Link scoring decides which links to be crawled, and assigns priority among them. The accurate selection of the pertinent links provides performance enhancement in both page relevance and execution time.
In this paper focus crawler is implemented by using various link score techniques and queue optimization techniques. Queue is optimized based on duplicate link and content similarity. The appliance of these techniques provides substantial performance improvement.