Process of Web Crawler Algorithm

Published:

Process of Web Crawler Algorithm

Databases are majorly used by the internet to store the data for future use. The use of internet is increasing incrementally because most of the individuals are accessing the internet to acquire information. Forouzan defines the World Wide Web (WWW) as a repository of information collected from different sources (2007, p. 851). The author also says that the main purpose of the WWW is to retrieve the document containing the data from the repository (2007, p. 854). Sharma, Sharma, and Gupta says that the data in the WWW database changes at regular intervals of time (2011, p. 38). Every individual uses the WWW extensively to acquire the information required.

The information retrieved are rarely relevant. According to Sharma, Sharma, and Gupta, the major reason for irrelevant information is due the presence abundance of data which makes the retrieval process challenging (2011, p. 38). The authors say that the retrieval of relevant data can be efficiently achieved using search engines (2011, p. 38). Search engines are designed to locate the data stored in the database identifying the indices. Forms one method to input the information. Data are retrieved based on the entries given the forms by the search engines. This process works faster when the forms and queries are filled appropriately.

Lady using a tablet
Lady using a tablet

Professional

Essay Writers

Lady Using Tablet

Get your grade
or your money back

using our Essay Writing Service!

Essay Writing Service

Ramakrishnan and Gehrke defines indexing as a technique which helps in faster retrieval of the required information (2003, p. 274). Indexes are assigned to webpage for the faster retrieval of the webpage. According to Singh and Sharma, the database from which the data cannot be accessed directly are called the hidden web, invisible web, or deep web (2013, p. 292). The authors say that the most relevant data are present in the hidden database (p. 292). The authors describes that the traditional search engines do not have access to the index of the hidden database because the forms are not filled automatically and search engines have to be developed to locate the relevant information accurately (p. 292). Search engines implements web crawler software to identify the data from the hidden database efficiently (Agrawal & Agrawal, 2013). According to the authors, “Web crawler is the software that explores the WWW in an efficient, organized and methodical manner” (2013, p. 12). According to Kurose and Ross, the contents of a webpage are indexed for the faster retrieval (2013, p. 274). Hence the data can be retrieved quickly if the data has an index.

According to Agrawal and Agrawal, the main purpose of the web crawler is to find the index of the web pages from the hidden database and download the webpages and send back to the user requested (2013). The authors describes that the webpages which were requested are downloaded and stored in the local database (2013, p. 12). The authors state that the indices are assigned to the downloaded webpages (p. 12, 2013). According to Singh and Sharma, an intelligent agent technique is used to identify the relevant data from the hidden database efficiently (p. 292, 2013).

According to Singh and Sharma, the intelligent agent determines the link to be crawled through (2013, p. 296). The authors say that the link are determined based on the feedback from the previous selection (p. 296). The authors describe that this can be efficiently achieved using the technique reinforcement learning (p. 296). The authors describe the reinforcement learning is a technique which determines the related link from a link (p. 296). The authors describe that the related link is retrieved based on the knowledge gained by interacting with the environment used like the data in the database (p. 296). The authors explain that the links are rejected if the data retrieved from the links is not relevant which is identified from previous selection (p. 296).

Search engines plays a major role in identifying the related data from the database. Every search engine implements a web crawler algorithm to retrieve the related data with respect to the users’ request. The traditional web crawler is inefficient to retrieve the related data (Singh & Sharma, 2013). Singh and Sharma suggests a web crawler algorithm which utilizes the intelligent agent technique (2013, p. 294). The architecture of the web crawler is shown in the Figure 1. According to Singh and Sharma, there are three main component in the web crawler (2013, p. 294). first page (2013, 12). C:\_PRIYA\GESP\Paper 3\i2.png

Lady using a tablet
Lady using a tablet

Comprehensive

Writing Services

Lady Using Tablet

Plagiarism-free
Always on Time

Marked to Standard

Order Now

Figure 2: The architecture of Web Crawler (Singh & Sharma, 2013, p. 294)

The three components are crawler, classifier, and link manager (Singh & Sharma, 2013). According to Singh and Sharma, the classifier determines if the retrieved information is relevant (p. 294). The authors say that the link manager links the relevant information retrieved and provide those information to the requested user (p. 294). The author also say that the retrieved information are stored in the local database in the server for future retrieval (p. 294). This paper deals with the detail process of each component in the Web Crawler algorithm which are crawler, classifiers, and link manager.

The first process involved in web crawler algorithm is crawler. Every search engine have a local database to store the retrieved information. According to Huang, Li, Li, and Yan, the reason for the storage of the information in the local database is for easier retrieval of the data when the same data is requested (p. 1081). The information retrieved are the whole webpage which is store in the local database. According to Sharma, Sharma, and Gupta, a webpage contains multiple pages within a single page which are called nodes and hyperlinks are called as edges (2011, p. 38). The authors say that a crawler browse through all the edges to reach the nodes (p. 38).

Sharma, Sharma, and Gupta state that a web crawler requires huge network resources like storage and memory because the crawler visits millions of web sites in a short period of time (2011, p. 38). The authors also state that this process should be distributed since it is consuming memory and resources (p. 38). According to Kurose and Ross, a web page contains many elements called the objects like image, text, and videos (2012, p. 19). The authors also describes the main aim of the web crawler is to discover the new web objects and to identify the change in the previously discovered web object (p. 38). Kurose and Ross say that for retrieval of each web object a process is triggered (2012, p. 20).

According to Sharma, Sharma, and Gupta, in the current world it is impossible for the crawler to scan through the entire web since the web is growing exponentially hence multiple process are invoked to search the entire webpage (p. 38). Search engines implements the multiple process to acquire the entire page which is called parallel crawler (Sharma, Sharma, and Gupta, 2011, p. 38). The webpages are retrieved based on the data entered in the search engine. Singh and Sharma defines the seed Universal Resource Locator (URL) as the URLs request by the user (2013, p. 294). According to the authors, the basic functionality of the crawler is loaded with the seed URL (p. 294). The authors also describes that the pages which were requested by the user using the URLs are retrieved and sent to the page classifier (p. 294).

The second process involved in web crawler algorithm is classifiers. There are different types of classifiers which are page, link, and form classifiers (Singh & Sharma, 2011). According to Singh and Sharma, the main functionality of the page classifier is to identify the domain of the web page. Every web page belongs to a domain. According to Kurose and Ross, there are various domains like com, org, net, edu, and gov which are categorized as top level domains and uk, fr, ca, and jp as country top-level domains. The authors also say that an address is assigned to every end or destined system which is called Internet Protocol address (IP address). The domain of each webpage is identified and the IP address of the webpage is recognized (Sharma, Sharma, & Gupta, 2011). The authors describe this process as Domain Name System (DNS). The required page is retrieved from the database.

Singh and Sharma state that the page can be identified to a domain on the basis of the similarity between the domains and the page (p. 294). The authors also describe that two step classification technique is used to identify the similarity between the page and domain (p. 294). The authors say that the web page and the domain details are collected as text (p. 294). The authors also state the advantage of the using the two step classification technique over traditional focused crawler is that the output of the similarity is precise and relevant results (p. 294). According to authors a threshold is assumed which is the constant value and a ratio is calculated based on the similarity between the page and the domain (p. 295). The authors concludes that if the calculated value is greater than the threshold value then the page and domain are similar else the page is discarded as an irrelevant page (2013). The page which is relevant is then led into the link classifier.

Lady using a tablet
Lady using a tablet

This Essay is

a Student's Work

Lady Using Tablet

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.

Examples of our work

According to Sing and Sharma, the link classifier determines the links between the relevant page retrieved (2013, p. 295). The authors also describes the purpose of the extraction is to identify the intended target page in the domain (p. 295). The authors say that the links in the page redirects to a different relevant form but there would exists a delay in redirection (p. 295). According to the authors, the links are extracted from the URL containing a hyperlink which is used to identify the relevance (p. 295). The authors describes the identification of the relevant data is high if the search term is the substring of the URL (p. 295). According to authors, the domain is retrieved when the hyperlinks are followed (p. 295). Search engine uses a concept of form to identify the relevant data. The search engines inputs the details about hyperlinks into the form based on which the preceding page are identified.

The form will provide details about the domain which the page requested is related. According to Singh and Sharma, the main functionality of the form classifier is to identify the searchable and the non-searchable form of the domain (2013, p. 295). The authors define the searchable form as a form through which the user can directly enter the data into web databases, for example a simple form in which values are filled for querying (p. 295). The authors also defines the non-searchable form as form containing details to submitting to the web database rather than entering data as a querying information, for example forms like login and registration (p. 295). The authors define that the searchable form of the domain identified and the output are stored in the database (p. 295). The links in the page are sent to the link manager.

The third process involved in the proposed architecture is link manager. The authors state that the searchable domain and the interested domain are provided as an input to the link manager (2013, p. 296). According to Sharma, Sharma, and Gupta, the URL containing the domain information are used to retrieve the webpage (2011, p. 39). According to authors Singh and Sharma, the output retrieved while crawling which are links of the webpage are led into the feature learner (2013, p. 296). The authors explain that the feature learner utilizes the links to identify the path through which the relevant web page is to be retrieved (2013). The paths indicates the links to be traversed through when the same query is requested again by the user. The authors also state that the successful path are stored in the data base (2013).

According to the authors, feature set is formed with the URL, text around it (p. 296). The feature set will contain all the information form the webpage like hyperlinks, texts, and images. The authors also says that the unwanted words like stop words, terms before the text are removed from the feature set (2013). According to the authors, the top most terms are selected based on the number of occurrence of the word (p. 296). According to authors, “The frequency of the term is increases by one when the term from the set, obtained earlier, becomes the substring of other term in the URL feature set” (p. 296). Thus the feature set will contain only the data relevant to the input URL because the words are chosen based on the URL.

It can be inferred that the data set generated by the feature learner is modified based on the input given by the user. Singh and Sharma define this process as an automatic feature selection process (p. 296). According to the authors, the link that to be followed is determined using an intelligent agent coordination (p. 296). This process will determine the link of each page that is request which in turns runs the request through the database and retrieves the data hidden in the database.

Crawler, classifiers, and link manager are the process involved in the web crawler which is discussed in this paper. Web crawler is one of the technique to identify the relevant data from the hidden database. There are various methodology which can be implemented to discover the data hidden in the database. According to Jian-Wei, Shi-Jun, and Qi say that data in the hidden database can be retrieved using the relevance based approach (2011, p. 1555). The authors say that ranks are assigned to the data which are requested frequently (2011). They also say that this method uses ranking technique to identify the relevance between the data based on the ranks assigned (2011). This method results in retrieves relevant data with respect to the data requested.

The web technologies are increasing exponentially. Almost all the data are in the hidden database for certain reasons like security from hacking and contamination of data that is edition or deletion of data from the database. Hence the search engine should be adaptable in discovering the data hidden in the database. The search engine should be built to process faster with less consumption of the memory and the internet resources.

References

Agrawal, S., & Agarwal, K. (2013). Deep web crawler: A review. International Journal of Innovative Research in Computer Science & Technology (IJIRCST), 1(1), 12-14.

Hristids, V., Hu, Y., & Iperioris, G. (2011). Relevance based retrieval on hidden test database without ranking support. IEEE transactions on knowledge and data engineering, 23(10), 1555-1558.

Huang, Q., Li, Q., Li, H., & Yan, Z. (2012), An approach to incremental deep web crawling based on incremental harvest model. 2012 International Workshop on Information and Electronics Engineering (IWIEE), 29, 1081-1087. doi:10.1016/ j.proeng.2012.01.093

Sharma, S., Sharma, A.K., & Gupta, J.P., (2011). A novel architecture of a parallel web crawler. International Journal of Computer Applications, 14 (4), 0975-8887.

Singh, L., & Sharma, D.K. (2013). An architecture for extracting information from hidden web database using intelligent agent technology through reinforcement learning. Proceedings of 2013 IEEE Conference on Information and Communication Technologies (ICT 2013), 13, 292-297.