This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.
An information retrieval process begins when a user enters a query into the system. Queries are formal statements of information needs, for example search strings in web search engines.
The aim of the Project is to extract text and images from World Wide Web and matching the Images with the text associated with them. This tool uses a WebCrawler to do the job. The Crawler can be configured according to the needs. The Depth and no of links per depth can be adjusted. The extracted Images and text can be grouped in accordance with the relevance.
The Crawler is built is capable of spawing multiple threads. The no of threads that uses is given as choice for the users, although network configuration effect the efficiency of multithreading.
This chapter gives the detailed description about the system environment used for the development of the system. It also gives the brief explanation about the various technologies and tools used in the development of the system.
An information retrieval process begins when a user enters a query into the system. Queries are formal statements of information needs, for example search strings in web search engines. In information retrieval a query does not uniquely identify a single object in the collection. Instead, several objects may match the query, perhaps with different degrees of relevancy.
An object is an entity that is represented by information in a database. User queries are matched against the database information. Depending on the application the data objects may be, for example, text documents, images, or videos. Often the documents themselves are not kept or stored directly in the IR system, but are instead represented in the system by document surrogates or metadata.
Most IR systems compute a numeric score on how well each object in the database match the query, and rank the objects according to this value. The top ranking objects are then shown to the user. The process may then be iterated if the user wishes to refine the query.
The aim of the Project is to collect data from World Wide Web running required analysis and matching the Images with the text associated with them.
A Web crawler is a computer program that browses the World Wide Web in a methodical, automated manner or in an orderly fashion. Other terms for Web crawlers are ants, automatic indexers, bots, or Web spider, Web robot
This process is called Web crawling or spidering. Many sites, in particular search engines, use spidering as a means of providing up-to-date data. Web crawlers are mainly used to create a copy of all the visited pages for later processing by a search engine that will index the downloaded pages to provide fast searches. Crawlers can also be used for automating maintenance tasks on a Web site, such as checking links or validating HTML code.
A Web crawler starts with a list of URLs to visit, called the seeds. As the crawler visits these URLs, it identifies all the hyperlinks in the page and adds them to the list of URLs to visit, called the crawl frontier (queue). URLs from the frontier are recursively visited according to a set of policies.
As a crawler always downloads just a fraction of the Web pages, it is highly desirable that the downloaded fraction contains the most relevant pages and not just a random sample of the Web.
This requires a metric of importance for prioritizing Web pages. The importance of a page is a function of its intrinsic quality, its popularity in terms of links or visits, and even of its URL (the latter is the case of vertical search engines restricted to a single top-level domain, or search engines restricted to a fixed Web site). Designing a good selection policy has an added difficulty: it must work with partial information, as the complete set of Web pages is not known during crawling.
Restricting followed links
A crawler may only want to seek out HTML pages and avoid all other MIME types. In order to request only HTML resources, a crawler may make an HTTP HEAD request to determine a Web resource's MIME type before requesting the entire resource with a GET request. To avoid making numerous HEAD requests, a crawler may examine the URL and only request a resource if the URL ends with certain characters such as .html, .htm, .asp, .aspx, .php, .jsp, .jspx or a slash.
Some crawlers intend to download as many resources as possible from a particular web site. So path-ascending crawler was introduced that would ascend to every path in each URL that it intends to crawl.
The importance of a page for a crawler can also be expressed as a function of the similarity of a page to a given query. Web crawlers that attempt to download pages that are similar to each other are called focused crawler or topical crawlers.
The main problem in focused crawling is that in the context of a Web crawler, we would like to be able to predict the similarity of the text of a given page to the query before actually downloading the page..
The Web has a very dynamic nature, and crawling a fraction of the Web can take weeks or months. By the time a Web crawler has finished its crawl, many events could have happened, including creations, updates and deletions.
From the search engine's point of view, there is a cost associated with not detecting an event, and thus having an outdated copy of a resource. The most-used cost functions are freshness and age.
Crawlers can retrieve data much quicker and in greater depth than human searchers, so they can have a crippling impact on the performance of a site. Needless to say, if a single crawler is performing multiple requests per second and/or downloading large files, a server would have a hard time keeping up with requests from multiple crawlers.
The use of Web crawlers is useful for a number of tasks, but comes with a price for the general community.
The costs of using Web crawlers include:
network resources, as crawlers require considerable bandwidth and operate with a high degree of parallelism during a long period of time;
server overload, especially if the frequency of accesses to a given server is too high;
poorly-written crawlers, which can crash servers or routers, or which download pages they cannot handle; and
Personal crawlers that, if deployed by too many users, can disrupt networks and Web servers.
A partial solution to these problems is the robots exclusion protocol, also known as the robots.txt protocol that is a standard for administrators to indicate which parts of their Web servers should not be accessed by crawlers.
A parallel crawler is a crawler that runs multiple processes in parallel. The goal is to maximize the download rate while minimizing the overhead from parallelization and to avoid repeated downloads of the same page. To avoid downloading the same page more than once, the crawling system requires a policy for assigning the new URLs discovered during the crawling process, as the same URL can be found by two different crawling processes.
Web crawler architectures
A crawler must not only have a good crawling strategy, as noted in the previous sections, but it should also have a highly optimized architecture.
Web crawlers are a central part of search engines, and details on their algorithms and architecture are kept as business secrets. When crawler designs are published, there is often an important lack of detail that prevents others from reproducing the work. There are also emerging concerns about "search engine spamming", which prevent major search engines from publishing their ranking algorithms.
1.5 Web crawler architecture
This project is the development of web crawler which would in turn facilitate the development of Information Retrieval System for the purpose of Web Mining on data in text and image format over the internet.
The Web Crawler which is built using Java and uses concepts of multi threading and object oriented programming. These web crawlers facilitate the purpose of web mining by automating the metadata collection and downloading of documents from the urls given.
The brief summary of the requirements being carried out is as follows:
3.1 Functional Requirements
This tool can extract the text, images from the urls visited. The two main tasks of this tool are saving data from a URL to a file and extracting hyperlinks. Because of Javas URL class, both tasks are rather simple. In the process of extracting data from web, it saves only image and text data while suppress other MIME types.
3.2. Software Requirements
The Software requirements in this project are as follows:
Windows Operation System
3.3 JAVA and its Features:
3.3.1 JAVA is safe and simple
Java was designed from the ground up to allow for secure execution of code across a network, even when the source of that code was untrusted and possibly malicious.
Mostly there are no pointers in Java. Java programs cannot access arbitrary addresses in memory. All memory access is handled behind the scenes by the trusted runtime environment. Furthermore Java has strong typing. Variables must be declared, and variables do not change types when we aren't looking. Casts are strictly limited to casts between types that make sense. Thus we can cast an int to a long or a byte to a short but not a long to a Boolean or an int to a String. Java implements a robust exception handling mechanism to deal with both expected and unexpected errors.
Java was designed to make it much easier to write bug free code. The most important part of helping programmers write bug-free code is keeping the language simple.
3.3.2 JAVA is Object-Oriented
In object-oriented programs data is represented by objects. Objects have two sections, fields (instance variables) and methods. Fields tell you what an object is. Methods tell you what an object does. These fields and methods are closely tied to the object's real world characteristics and behaviour. When a program is run messages are passed back and forth between objects. When an object receives a message it responds accordingly as defined by its methods.
Object oriented programming is alleged to have a number of advantages including:
â€¢ Simpler, easier to read programs
â€¢ More efficient reuse of code
â€¢ Faster time to market
â€¢ More robust, error-free code
In practice object-oriented programs have been just as slow, expensive and buggy as traditional non-object-oriented programs.
3.3.3 JAVA is a Platform Independent
Java is compiled to an intermediate form called byte-code. A Java program never really executes natively on the host machine. Rather a special native program called the Java interpreter reads the byte code and executes the corresponding native machine instructions. Thus to port Java programs to a new platform all that is needed is to port the interpreter and some of the library routines. Even the compiler is written in Java. The byte codes are precisely defined, and remain the same on all platforms. However the virtual machine itself and some parts of the class library must be written in native code. These are not always as easy or as quick to port as pure Java programs.
3.3.4 JAVA is Multi-Threaded
Java is inherently multi-threaded. A single Java program can have many different threads executing independently and continuously. Three Java applets on the same page can run together with each getting equal time from the CPU with very little extra effort on the part of the programmer.
This makes Java very responsive to user input. It also helps to contribute to Java's robustness and provides a mechanism whereby the Java environment can ensure that a malicious applet doesn't steal all of the host's CPU cycles.
3.3.5 JAVA has Garbage Collector
We do not need to explicitly allocate or de-allocate memory in Java. Memory is allocated as needed, both on the stack and the heap, and reclaimed by the garbage collector when it is no longer needed. There are constructors and these do allocate memory on the heap, but this is transparent to the programmer.
The exact algorithm used for garbage collection varies from one virtual machine to the next. The most common approach in modern VMs is generational garbage collection for short-lived objects, followed by mark and sweep for longer lived objects. I have never encountered a Java VM that used reference counting.
3.3.6 Use of Try and Catch Mandatory in JAVA
For the most severe and common errors in Java the use of catch and try is mandatory. It is built into Java's grammar that these error checking precautions must be present or else our program will not compile.
3.4. Hardware Requirements
The Hardware Requirements in this project are as follows
1. 256MB RAM
2. 40GB Hard Disk
3.Pentium 4 2.0Ghz processor
Web Crawler Architecture
4.1 Processing items in a queue
Web crawling speed is governed not only by the speed of internet connection, but also by the speed of the sites that are to be crawled. Especially if crawling sites are from multiple servers, the total crawling time can be significantly reduced, if many downloads are done in parallel.
5.1.1 Processing items in a queue
Web crawling can be regarded as processing items in a queue. When the crawler visits a web page, it extracts links to other web pages. So the crawler puts these URLs at the end of a queue, and continues crawling to a URL that it removes from the front of the queue.
It is obvious, that every algorithm that just works by processing items that are independent of each other can easily be parallelized. Therefore, it is desirable to write a few classes that handle multithreading that can be reused. In fact, the classes that I wrote for web crawling originally were reused exactly as they are for a machine learning program.
Java provides easy-to-use classes for both multithreading and handling of lists. (A queue can be regarded as a special form of a linked list.) For multithreaded web crawling, it just needs to enhance the functionality of Java's classes a little. In the web crawling setting, it is desirable that one and the same webpage is not crawled multiple times. Therefore do not only use a queue, but also a set that contains all URLs that have so far been gathered. Only if a new URL is not in this set, it is added to the queue.
5.1.2 Implementation of the queue
It is possible to limit either the number of WebPages to be visited or the link depth. The methods in the queue interface resemble the desired functionality. Here if the depth level is limited it needs more than one queue. If it is only one queue, the crawler could not easily determine the link depth of the URL it is just visiting. But regardless of the link depth allowed, two queues are sufficient. In this case, the crawlers to only fetch URLs from queue 1 and add URLs to queue 2. When all URLs in queue 1 are processed, the queues are switched.
5.1.3 Implementation of a thread controller
As mentioned above, Java has a good interface for handling threads. A 'controller' should create new threads, if and only if there are still items in the queue to process and if the total number of threads does not exceed an upper bound.
In implementation of such a thread controller, provide the controller class on construction - among other parameters - with the class object for the thread class the queue. The queue should be pre-filled with at least one item. So require from our thread class, that it implements the run method it inherits from the Thread class.
5.1.4 Implementation of a thread
It is expected from the thread in the run method, that it fetches new items from the queue, and that it ends itself if there are no items left. This is common for all our possible threads;
If there are no more items to process, the Controllable Thread can terminate itself, but has to inform the Thread Controller about this. In the WebCrawler scenario, this is important when all URLs for the link depth of 2 are processed and the next deeper level is reached. The Thread Controller is now responsible for shifting the queues and starting new threads as needed.
During program execution, it might sometimes be necessary that a 'working thread' tells the 'main thread' what it does at the moment. In the web crawler application e.g. the user might be interested in what page the crawler is currently visiting. For this purpose the Message Receiver interface exists. The main class can register itself or another class as a message receiver. The message receiver class is then notified, when a thread does something, a thread is terminated or all threads are terminated. Reconsider that in the object oriented paradigm calling a method is equivalent to sending a message. In this scenario it gets quite clear why this is the case.
5.2 Crawling a web page
5.2.1 Saving a page
The two main tasks of a WebCrawler are of course saving data from a URL to a file and extracting hyperlinks. Because of Java's URL class, both tasks are rather simple. Although in this case this might not be the proper object oriented approach, these tasks are implemented in static methods in the SaveURL class. This class contains only static methods. A main method is supplied, so that the class can be used in a stand-alone version to save a file from a given URL
When doing input/output operations, it use buffering, because transferring data 'en-bloc' is more efficient than transferring data byte by byte. Java provides buffering off the shelf. In the saveURL(URL, Writer) method Java's BufferedInputStream is used for reading from the URL. In the saveURL(URL, OutputStream) method buffering buffering is done through a byte array.
5.2.2 Extracting hyperlinks
Using Java's String class, extracting hyperlinks from a web page is almost trivial. Here getURL method mentioned above is used to retrieve a String object containing the complete HTML page. The basic idea is to use the indexOf method of the String class to extract the beginning '≶a href'. A couple of other indexOf-calls and a StringTokenizer extract the URL and the link text from the HTML tag., first it need to convert the 'raw' HTML page from its mixed-case form into a form that is only lower case to allow easy extraction. it use the to Lowercase method of the String class to do this. it also convert all whitespaces (i.e. tabs, line breaks etc.) into simple spaces using the replaceAll method. This is necessary because it might be confusing if e.g. in the HTML page the '≶a' is followed by a line break instead of a blank.
The extractLinksWithText method in the SaveURL class stores URL/text pairs in a Map, extractLinks stores the URLs only in a Vector.
5.2.3 Putting the parts together
The two sample applications that are included are the 'PExtractor' and the 'TEXTCrawler'. The ' PExtractor ' saves image files from the web based on their file name extension. The ' TEXTCrawler ' saves the text data in the form of html files from the web, not based on the file name extension but on the content.
5.3 The 'PExtractor Application
The sample PExtractor application crawls the web and saves all images files that are linked.
A standard use for such a program is e.g. crawling a friend's web page where his/her holiday pictures are stored. Usually these 'photo book' pages tend to have small 'preview' pictures and links to the full resolution images. In this usage scenario, the PExtractor would save the full resolution images.
The testing process focuses on the logical internals of the software assuring that all the statements have been tested and also on the functional externals by conducting tests to uncover errors. This process also ensures that defined input will produce actual results that agree with required results. Testing is an important part of the Development Life Cycle. The amount of testing required is related to the size and complexity of the application. In this chapter, various testing strategies adopted in testing are explained.
6.1 UNIT TESTING
Unit testing concentrates on each unit of the software implemented in the source code. The developer at the module level carries this out and this testing is white-box oriented. The module interface is tested to ensure that information properly flows into and out of the program unit under test. The local data structure is examined to ensure that data stored temporarily maintains its integrity during all steps in an algorithm's execution. Boundary conditions are tested to ensure that the module operates properly at boundaries established to limit or restrict processing. All independent paths through the control structure are exercised to ensure that all statements in a module have been executed at least once. And finally, all error-handling paths are tested.
The local data structure for a module is a common source of errors. Test cases should be designed to uncover errors in the following categories:
Improper or inconsistent typing
Erroneous initialization or default values
Incorrect (misspelled or truncated) variable names
Inconsistent data types
Underflow, overflow and addressing exceptions
6.2 INTEGRATION TESTING
The testing team carries this out and this testing is black box oriented. Integration testing is a systematic technique for constructing the program structure while at the same time conducting tests to uncover errors associated with interfacing. The objective is to take unit-tested modules and build a program structure that has been dictated by the design. Both integration and testing is done in this phase.
6.2.1 VALIDATION TESTING
The testing team carries out this test. Software validation is achieved through a series of black box tests that demonstrate conformity with requirements. A test plan outlines the classes of tests to be conducted and a test procedure defines specific test cases that will be used to demonstrate conformity with requirements.
6.3 SYSTEM TESTING
The software and other system elements are tested as a whole. The testing team or the project team carries this out and is black box oriented. At this stage, both the target hardware and software are available. Test cases derived using black box methodology from the Software Requirement Specification. While functional testing can be taken up in any hardware, the system testing has to be taken up only with the target hardware. In addition to functionality, other aspects such as usability, performance, recovery etc., are tested.
6.3.1 STRESS TESTING
Stress testing is designed to confront programs with abnormal situations. Stress testing executes a system that demands resources in abnormal quantity, frequency or volume. For doing stress testing, the application under test is subject to peak volume of data in a short time. This helps in the fine-tuning of the system.
6.3.2 PERFORMANCE TESTING
Performance testing is designed to test the run-time performance of software within the context of an integrated system. Performance testing occurs throughout all steps in the testing process. If system cannot meet performance specifications, no reasonable change of hardware / tuning can work. More repeat tests can be conducted and each slightly different set of tests has potential to uncover new errors. This can be a part of regression testing to make sure that performance has not degraded from build to build.
6.3.3 SECURITY TESTING
Security testing attempts to verify that protection mechanisms built into a system will protect it from improper penetration. During this the tester plays the role of the individual who desires to penetrate the system. The tester may attempt to acquire passwords through external clerical means, may attack the system with custom software designed to breakdown any defenses that have been constructed, hoping to find the key to system entry.
6.3.4 USABILITY TESTING
This testing enables the user Check for human factor problems:
Are outputs meaningful?
Are error diagnostics straightforward?
Does GUI have conformity of Syntax conventions, format, style, and abbreviations?
Is it easy to use?
It should provide on-line help or user manual. This should be consistent in its function and overall design.
It was a wonderful experience doing this project. Through this project I got the exposure to coding standards was of immense help in coding for the web applications. The project also gave me an opportunity to experience an entire software life cycle in a real life situation. It was good to experience what we have only read in books and have heard from others.
There are so many possibilities of adding improvements in this project. This project can be extended to higher capabilities depending upon the needs.