This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.
While every move of Google, Yahoo! and Microsoft in the Internet Search engine space is monitored closely by the business community, a large number of vertical search engines are quietly coming up almost every other day. This stems from the general dissatisfaction with horizontal search engines because of their inability to provide context-aware search results. However, vertical search engines are complex in nature - even though these search engines are aligned towards specific business domain. There are two major challenges faced by almost all vertical search engines. One, how to extract information from web pages that are dynamic in nature, that stores data in its own database and renders that data using DHTML-like technologies. Secondly, the search engine also needs to extract meaningful and context-aware information from diverse HTML-based web pages - the challenge is due to unstructured nature of HTML pages. To address these problems, HCL has developed a vertical search engine framework that provides the backbone and can be easily customized to create a search engine solution for any business domain. Independent Software Vendors (ISV) can reuse this framework to develop vertical search engines in a short time span with reduced cost.
The framework comprises of a backend and a front-end subsystem.
The backend crawls a set of target websites (that are configured in the system by end-user) and indexes the information thereof. A unique feature about the backend system is its ability to simulate human submission of HTML FORMS to get hidden data stored in databases. This is called deep Web (or Invisible Web) content that are not retrieved and hence, are not searchable by traditional search engines. In order to get such hidden data, the backend crawler automatically generates meaningful queries that are issued against search forms. Once the data is retrieved, it is parsed through and contextual information is retrieved/stored/indexed using an end-user provided data dictionary.
The framework provides a front-end search API subsystem through which end-user can search for information using either Java-based or Web Services-based APIs.
With the explosion of digital information available online in recent years, people are increasingly finding it difficult to locate information quickly through traditional search engines of Google, Yahoo! and Microsoft. A recent Convera survey of more than 1000 business professionals showed that only one in 10 professionals always finds exactly what he or she is looking for in the first attempt. This has given rise to Vertical Search Engines that are designed for specific purposes (such as Jobs, Healthcare, and Finance etc) and provide better search results to end-users.
However, developing a Vertical Search Engine is no easy task, notwithstanding the plethora of new-age technologies currently available. Even though Vertical Search Engines focus on a narrow domain and thus know the context of user's search query, they still have to gather the content from websites that are mostly stored in backend databases and are shown as dynamic pages. These search engines also need to develop the technique of uniquely identifying key information from such disparate web pages. For example, a Vertical Job Search Engine should be able to extract Job Titles, Job Description, and Salary Offered etc from target websites. Similarly, a Vertical Search Engine for Home Tutors needs to determine Tutor's education and qualification information, subjects offered for tuition from each website.
The difficulties outlined above drove the need for developing a framework that can be leveraged by ISVs while creating newer vertical search engines. The framework provides a robust and scalable architecture, while being flexible at the same time. This allows ISVs to develop products that meet stringent performance requirements and can deliver better search results but with shorter time-to-market. Time to market is critical for the success of any vertical search engine since more and more advertisers are putting their ads in search engines related to their business domain. Therefore the framework not only results in reduced product development cost but also drives significantly higher revenue for the ISV.
The Vertical Search Engine framework comprises of two sub-systems, namely, a front-end system and a backend system.
The basic functionality of backend subsystem is to crawl a list of targeted websites, extract hidden data from them by simulating different end-user queries and then use pattern matching techniques and artificial intelligence to identify unique attributes (and their values) from these web pages. Finally, the backend indexes both the page content as well as domain-specific attribute values.
The backend has a highly scalable architecture that can support large number of websites. Figure 1 below shows a high level architecture of the backend system.
Figure 1 - Backend Crawler
A Vertical Search Engine, by definition, is suitable for a specific business domain. This requires crawling a list of websites that are specific to that domain. It is different from the way Generic Search Engines (such as Google) work - they keep on adding new hyperlinks to their database while crawling through web pages. Hence, the backend subsystem needs to be configured with the list of websites or URLs that are to be crawled.
Another important configuration relates to the definition of domain-specific data dictionary within the backend system. This metadata helps the backend in extracting specific parametric values from each webpage crawled.
The backend system revolves around a message-oriented middleware that helps break up the crawling process into granular steps. There are three main components in the backend, namely, Crawl Manager, Deep Crawler and Indexer.
Crawl Manager retrieves the list of sites from the database, identifies how each site to be crawled and then sends that information to a message queue.
Deep Crawler receives site information from message queue, dynamically identifies the HTML search form and then retrieves all the option values in the search form. After permuting the values it starts submitting the form using the permuted values one by one - akin to human search query submission. The response page is then parsed to retrieve the links for summary page and actual content pages. Then each and every summary page is deep crawled and all the content page links are extracted from it. These links are dispatched in the form of messages to the Indexer through message queue.
The Indexer comprises of two subparts - Content Filter and Indexer. Content Filter parses the HTML content page to extract the values of domain-specific parameters - this is the uniqueness of vertical search engines. For example, in the case of a Tutors search engine it may extract the Name, Fees and Experience of the tutor. These parametric values are finally indexed using Apache's Lucene Indexer. The framework is also capable of indexing complete page contents and ranking them using simple keywords present on the page. For example, if the keyword "New York" is present 10 times in one page and 12 times in another then the second page would have a higher rank than the first in the index.
Figure 2 describes the typical deployment of backend crawler.
Figure 2 - Deployment Architecture of Backend
The frontend subsystem provides a search API (both Java-based and Web Service-based) to client applications. The frontend comprises of a search manager and search servers, as depicted in Figure 3 below.
Figure 3 - Search Framework
For better scalability and search response time, a number of search servers are deployed across various nodes in a cluster. The search manager interacts with these search servers to serve a search query.
This component broadcasts the search query to all the search nodes in the cluster. After receiving the results from each of the search nodes it combines the results and ranks the results in a particular sort order (descending) before sending them back to the client application.
The search server will parse the query before querying the index and send back the results to the search manager.
The framework has been tested on two verticals, namely, Jobs and Tutors. In both these cases, it was able to extract and index vertical-specific parametric information from concerned websites and deliver better search results.
The vertical search engine framework provides a robust platform for developing newer vertical search engines with minimal effort. This will allow vertical search engine ISVs develop products with faster time to market and with reduced cost.
For More Information contact
Pradeep Kumar Mishra & Asvin Ramesh
HCL Technologies America Inc.
330 Potrero Avenue
Sunnyvale, CA 94085
Phone +1 408 523 8333 Fax +1 408 733 0482 Mobile +1 408 368 6814 +91-9818456560(India)