The Vertical Web Search Computer Science Essay

Published:

Web crawler is a software that browses and downloads unstructured web data methodically and in an automated manner. Ant, spider, bot, robot are few other words that they are called. A web crawler basically downloads the page visited and stores it to process later for especially indexing the pages to provide fast web search by search engines. The crawler takes a URL input as a seed to crawl and gather information from the web. Fetches and parses the page and extracts the URLs in it and repeats it.

Keywords-Web Crawler, Web Crawling Issues, Web Crawler Prevention.

Introduction

Web crawlers are the retrievers of data before indexing it. It looks like there is not much work to do for a crawler. But if we are talking about the web even the tiniest problems gets bigger. Crawlers are not responsible for processing and indexing the data. But gathering the data to this point is a tough work to do on its own anyway. Steps of a crawling process are as listed below[1]:

Lady using a tablet
Lady using a tablet

Professional

Essay Writers

Lady Using Tablet

Get your grade
or your money back

using our Essay Writing Service!

Essay Writing Service

Let S be set of pages we want to index

S is initially {p}

Take a page p from S

Parse p to obtain a set of links L

Now S = S + L - p

Repeat as many times as required

Fig 1. Web Crawler Architecture [2]

Web crawlers handle huge amount of data especially which runs on global scale. It is important not to parallelize the web sites so the end users can access them. There are four policies that help to ensure that. Crawler algorithms should behave in a manner which is defined in those three policies.

Selection policy aims to visit the more important web pages.

Re-visit policy specifies that the data which has been gathering from the web sites should be as up to date as possible.

Politeness policy tells that while there is an ongoing crawler process the web site should be accessible at the same time. Crawler shouldn't be disrupting the web site.

Parallelization policy is a policy that states coordination of the distributed crawlers.

General Web Search

General web search engines should balance the quality and coverage. Web crawlers have finite resources so the complex combination of the policies defined before will help to reach the needed balance.

Vertical Web Search

Vertical search means crawling is aiming to gather a particular subset of information from the web. This subset can be linguistic, topical or geographical.

Focused Crawling

Focused crawling takes a description of a topic or a set of example documents as input. Only the pages that are relevant to the specified topic get crawled.

Web Characterization

Web characterization is the derived statistical information of the web sites. This is a difficult issue for the web crawlers. Because it is hard to determine whether the representing sample is correct for the web sites.

Mirroring

Mirroring is having a copy of a web site particularly or completely. The purpose of that is to share the load on the servers and shorten the response time of the web site.

Features a Crawler Should Provide

Web crawlers are affected from few characteristics of the web itself in a bad way. These characteristics are making crawling difficult to perform. These are the excessive volume of the web, dynamic generation of pages and its fast change rate. When developing a web crawler these difficulties must take into consideration if we want to avoid prevention of our crawler by web masters.

Freshness

How recent is the object? It is important to keep the pages stored up to date. In many cases it is useless to provide the older copy of a page. Considering that it is obviously important to revisit the pages to detect changes besides discovering new pages.

Quality

Crawlers differ from each other. Some of them aim to store topic centric, high quality data while some others are intended to cover a wide range but the quality of the data stored varies.

Fig 2. Types of Crawlers [3]

Volume

Lady using a tablet
Lady using a tablet

Comprehensive

Writing Services

Lady Using Tablet

Plagiarism-free
Always on Time

Marked to Standard

Order Now

Web is a large space to discover. This brings a real problem to web crawlers. Storing high quality or high quantity data fills lots of space. It also limits the downloaded files within a given time since the bandwidth is limited. So solutions like prioritization appear at that point.

Efficiency

Re-visit policy of a web crawler is very important for the webmasters and end users. Because if a web crawler is revisiting a page too frequently then complaints about the low bandwidth begins. Crawlers shouldn't have to be consuming too much resource of the pages that they are crawling.

Crawling Policies

Selection Policy

Large search engines have to collect large amount of data to return a better search result to users but they always have only a portion of the web that is accessible. To have a better result crawlers need some metrics to do a prioritization. This is done by the importance of the web pages.

There are various methods to prioritize the importance of the web pages[4]. For example Cho used pagerank to order websites by importance. Abiteboul designed a strategy based on OPIC(On-line Page Importance Computation) algorithm. Briefly, OPIC distributes "cash" among the pages that initial page links to. Chakrabarti comes up with the idea as focused crawling. Web is enormous and it is hard to take a sample to represent it all. That's why the crawler focuses similar pages and only downloads them so it can save precious time and storage.

Re-visit Policy

Crawlers spend lots of time when they crawl a set of page. It can take weeks or months. But when it finishes its crawl some pages has been already outdated until now. It is an important feature to keep indexed pages up to date as possible.

It is considered as a cost to not have a fresh sample. This cost is calculated by either its freshness or its age.

(1)

Freshness of a page is calculated by comparing it with the local copy. If the local copy is equal to the page online then it is indicated as fresh or else it is not. This is a binary metric.

(2)

If the page is modified then its age is zero but if it is modified some time ago then we calculate the age of the page. We get the age of the page by checking how much time passed since the first modification of the page.

Politeness Policy

Crawlers get reach data much faster than humans. The power to process data quicker brings along the overload possibility on the servers. Because a crawler does requests one after another and downloads files in a little time. Even a single crawler can exhaust a server multiple crawlers can make the server irresponsive.

Crawlers come with a cost absolutely. They consume the bandwidth of the server extensively when they are retrieving data. They make request consecutively or simultaneously to a server and this increases the response time for the user requests. Some crawlers are written poorly so they can crash the servers or routers when they try to handle the files that they meant to download.

Parallelization Policy

This policy concerns only parallel processing or especially distributed crawlers. These crawlers use multiple processes when they are gathering data and they should handle situations like what to do when they get the same URL from two different web pages or ignoring the web page another process has just crawled. It aims to increase the download rate.

Issues In Web Crawlers

Networking

Variable QoS

Crawlers retrieve data from multiple pages continuously. They download pages or files from the web sites. But sometimes the response time of some servers might be too high or even they might not be responding at all, they might be down. Then instead of skipping the web pages which can not be reached at the moment crawler should re-try the web pages that couldn't be reached a while ago. The reason of re-trying to reach those pages is some web pages might be down for long periods like hours, days or weeks. If the crawler re-tries for a couple of times the rate of the unreachable web pages in a set will be lower.

Web Server Administrators Concerns

Lady using a tablet
Lady using a tablet

This Essay is

a Student's Work

Lady Using Tablet

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.

Examples of our work

When the web crawlers first appeared back in 90s they caused suspicion and web site administrators were making complaints about them. Because they were consuming valuable bandwidth and threatening system security. Some of those concerns are still in place.

A lot of websites are using hosting providers. Many web sites share the same IP address. When a crawler requests too much from a set of web pages hosted by the same ISP it may be banned from the whole web sites.

The disturbance will trigger some alarms. Before getting banned it is better to identify itself as a crawler and provide the contact information to the administrator at first place. Another good solution to not sound the alarms is to use delays between repeated access to the pages.

Inconsistent Firewall Configurations

There might be inconsistent firewall configurations for some web servers. There are situation like, a server accepts a TCP(Transmission Control Protocol) connection by the crawler and write action succeeds afterwards but it doesn't return any result back. This can be avoided by defining an expiration time for network operations.

Massive DNS Resolving

Crashing Local DNS

Local DNS might crash when overloading occurs. Without any notification the crawler won't be aware of the DNS failure and will act like the webpage does not exist. And the crawling cycle will be useless. To prevent that there should be a predefined rate that will be used as; if the DNS lookups returning with failure and it exceeds the predefined rate then the crawling cycle will be cancelled. Limiting the lookup count for the same DNS for any limited time will be useful too.

Temporary DNS Failures

In small organizations usually the web server and DNS are administrated at one hand. They might be on the same machine physically. When DNS fails because of the default caching policy(one week) it will be unnoticed for days. To have a high coverage the crawler should re-try to resolve the record one week after a failure of a DNS lookup.

Malformed DNS Records

Significant portion of the DNS records are malformed [5]. They might contain inconsistent serial numbers or misspelling errors. DNS resolving should be error tolerant and malformed records will be handled. DNS resolvers should try to retrieve IP address from the web page address.

Wrong DNS Records

Wrong DNS records might occur accidentally or maliciously. When a DNS record points to a wrong IP the indexing will be faulty. And it might prevent the actual web sites download process because it will be treated as a duplicate. When the wrong DNS record gets fixed then the search result for the actual websites redirects to the wrong one.

Use of the "www" Prefix

Usually websites have both "www.url.com" and "url.com" addresses and they resolve to the same IP. That's the expected behaviour. But sometimes two addresses may be slightly different, i.e. advertisements. In that case the crawler can not consider them as duplicates. This problem is solved by treating both addresses as one web page.

HTTP Implementations

Accept Headers Not Honoured

It is not possible to determine a URL which has a ".pdf" suffix whether it is a file that can be downloaded or a dynamically generated page at first sight. User agents such as web browsers or web crawlers can not handle all types of data. To specify that they issue a request containing which types it can handle. It's specified in accept header of the request.

Accept: text/html (3)

Most of the web browsers making requests containing */* in their accept headers. Therefore web pages ignore the importance of accept headers and crawlers find themselves in a situations like downloading the data that they can not handle. Then Content-type headers helps us to identify the content inside downloaded web pages.

Range Errors

As politeness policy states the crawler shouldn't disturb the websites they crawl. To obtain that, crawlers has limits that tells them how big data they can download from a single web page. That is defined in range header.

Range: 0-300000 (4)

Range is 300-400 Kb per page by default. But in some cases smaller web pages respond with a HTTP 416 error code. HTTP 416 is thrown when range error occurs. To solve that crawler should make its second request without a range header after getting a HTTP 416 error.

Response Lacking Headers

Some web sites respond the request of the user agent with lack of some headers. Most web browsers can handle that by displaying the available or some display an error message.

Web crawlers should handle responses like that. The respond may be missing headers or the headers may be incomplete, i.e. location header might exist but it can be empty.

"Found" Instead of "Not Found"

Every user has seen the HTTP 404(not found) error code before. Web site administrators find these situations too annoying and they prefer to redirect users to visual error pages without signalling the error condition rather than an error message. These pages are called "soft-404". Web server administrators shouldn't have to miss signalling the error condition of the pages that resulted with an error otherwise the "soft-404" pages will be indexed too.

Wrong Dates in Headers

Crawlers check the last modified date before taking any action on that page. But sometimes last modified date gets irrational value, i.e. a date in the future, a date decades ago or no date. The reason for some cases might be the wrong configuration of time, date or time zone. In other cases that is done by intentionally to falsify the shareware software and use the software as a trial version. Its solution is very simple, if a web pages last modified date is between the invention of the Web and 24 hours from now it is treated as modified or else it is ignored.

HTML Coding

Malformed Markup

HTML coding is done by hand, by humans. When there is an absence of a source code editor then some errors in writing may occur. Wrong order of tags, mixing double quota and single quota or no quotas... Such a developer will be convinced when the web page appears on his browser. It is better to use a source code editor to avoid situations like that.

Physical Over Logical Content Representation

In most cases HTML files contain both visual characteristics and the content itself. When the crawler is indexing this can be misleading. The style rules should be kept in a separate file to have a better indexing.

Web Content Characteristics

Duplicate Detection

To save our valuable storage space we should ignore the duplicates. Two web pages that has a completely different visual context but has the same content should be treated as duplicates. A solution for that is parsing the HTML code first then hashing both files and comparing them. This way we only save content of one page. But this method doesn't solve the unnecessary downloading problem.

Blogging, Mailing Lists, Forums

Mailing lists and forums are large sources of data. They contain postings by individual users. The source is very important when it is the only web page that this topic is issued. Recently, spammers started to post links to Blogs. But Google suggested an extension that prevents automatic links that posted in comments.

Server Application Programming

Embedded Session Ids

Some web sites keep tracking the user by the session id written in the URL. It is most likely that two pages only differing by their session id's are duplicate. Crawlers have few predefined words to identify whether the information is session id. Few of them are cfid, jsessionid, phpsessid.

Repeated Path Components

Dynamic pages can repeat the path in the URL by mistake. If the crawler doesn't recognize that then it treats them as different pages even they are exactly the same page.

Slower or Erroneous Pages

Dynamically generated web pages are slower than the static ones. This ratio is sometimes is up to 10, 100. This makes the crawler to keep its connections open for a long time. Using a timeout duration will solve this issue.

Web Crawler Software

There are various web crawler software existing in both public and proprietary domain. Most of the proprietary crawlers are used by search engines like Google, Bing or Baidu. These crawlers are not open to public. They are specialized to gather data for those search engines.

There are commercial web crawlers too (e.g. Mozenda, Tricom) but the public and open sourced ones are preferred mostly.

Dozens of open sourced web crawlers are existing. Some web crawlers are supported by universities and some are the projects conducted by the ones willing to share the data with public.

Few selected crawlers of the types mentioned above are as listed below:

Googlebot: Googlebot is the web crawler which Google search engine uses to collect data. It is written in C++ and Python.

YandexBot: Yandex launched a non-Russian search engine Yandex.com in May 2010. Yandexbot is the web crawler that Yandex uses.

Yahoo! Slurp: Slurp is the name of the web crawler that gathers content for Yahoo! search engine.

Bingbot: It is a web crawling robot. It is developed by Microsoft to gather data for Bing. It replaced msnbot on October 2010.

Baiduspider: The biggest search engine in China, Baidu, uses Baiduspider as its web crawler.

Majestic-12: Majestic-12 is a distributed search engine project established in late 2004. It gathers data for Majestic-12 search engine.

Dotbot: Dotbot is an open source web crawler. The development team of Dotbot is aiming to publish the data they gather publicly. There is a link to an index file of the World Wide Web obtained from the crawls their web crawler has done. Dotbot is written in C and Python.

Nutch: Nutch is an open source web search project conducted by Apache Foundation Server. The project started on late 2004. Web crawler of Nutch has been written from scratch. Search engines like Krugle, mozDex, search2.net built with Nutch. It is written in Java.

Heritrix: Heritrix is a web crawler designed for web archiving by Internet Archive in early 2003. It is written in Java. Heritrix stores the data it gathered in ARC file format. Organizations like British Library, CiteSeerX, Smithsonian Institution Archives use Heritrix.

Lemur: Lemur project is conducted by the collaboration of University of Massachusetts Amherst and Carnegie Mellon University. It is an open source toolkit for information retrieval. It is written in C and C++.

Preventing Web Crawlers

Web crawlers sometimes become annoying because of their frequent revisits and usage of the bandwidth. You can configure your website to prevent the web crawler to collect the data of your web site if only the web crawler is obeying the robot exclusion standards. Malicious web crawlers ignore the robot exclusion standards file, robots.txt [6].

In robots.txt you can specify which directories are banned to the robots or you can tell the web crawler to download pages slower. Sample declarations are like the ones below:

User-agent: *

Disallow: /

Disallow directive is used to tell the web crawler which directories shouldn't be indexed so it should stay away from those. This declaration tells the whole directories of the website are disallowed. The next declaration is stating that cgi-bin directory under root directory is disallowed only.

User-agent: *

Disallow: /cgi-bin/

User-agent defines which web crawler should obey -the rule specified. Wildcard * is specifying that all web crawlers are subject to the rule.

User-agent: *

Crawl-delay: 10

Crawl-delay is not a directive for preventing the web crawler to do its job but it slows the web crawler. It adds a delay to the downloading process of the pages.

Another way to prevent is to password-protect the directories you don't wish the crawler to gather and index. This is done by editing the htaccess file.

There are two more ways to prevent web crawlers but it is not valid for all crawlers. The first is done by adding a META tag in the <HEAD> section of the page:

<META NAME="robots" CONTENT="noindex">

This tag is telling the page is not indexable so the crawler shouldn't be downloading it.

<META NAME="robots" CONTENT="nofollow">

nofollow tag telles the crawler to ignore the links on the page in it is creawling.

<META NAME="robots" CONTENT="noarchive">

noarchive tag is used to prevent the caching by the crawler. Crawlers take snapshots of the web pages when they crawl. These snapshots are called cached copies. Some web pages don't want this and they use this tag so they can show the users to see only up to date data. They avoid users to see old data on their sites like the price of an electronic device...

One last preventing technique is captcha. It is a computer generated test that helps to identify whether the user is human or machine. But this technique only prevents deep crawling.

Conclusion

This paper describes the fundamentals of web crawlers, types of the crawlers and policies that identify how they have to behave. Crawlers deal with huge amount of data. According to this they should be robust and polite at the same time.

If the crawler is demanding too many resources from the website it crawls then it can be annoying. Situations like these require prevention techniques. Prevention techniques that keep crawlers away from your website are specified in the previous section.