With the rapid growth of information sources available on the World Wide Web, it has now become increasingly necessary for the users to make use of automated tools to find the needed information resources, and thereby tracking and analyzing their usage patterns. These factors give rise to the requirement for the creation of server-side and client-side intelligent systems that can mine for knowledge. Web mining can be largely defined as the discovery and analysis of valuable information from the World Wide Web. This depicts the automatic search of information resources accessible on-line, i.e. Web content mining, and the detection of user access patterns from Web servers, i.e., Web usage mining .
Extraction of implicit patterns and valuable information from artifacts or activity associated to the World- Wide Web is Web Mining. There are three knowledge discovery fields that are relevant to web mining: Web Content Mining, Web Structure Mining, and Web Usage Mining. The course of extracting knowledge from content of documents or their descriptions is Web Content Mining. The resource discovery based on concepts indexing, the web document text mining may also fall in this category. Now, Web Structure Mining is the process of inferring information from the World- Wide Web. And finally, web usage mining, commonly known as Web Log Mining, is the extraction of interesting patterns in web access logs .
1.2 CLASSIFICATION OF WEB MINING
Web content mining is a process that goes past keyword extraction. Some approaches have recommended to restructure the document content in a representation that could be used by machines, since the content of a text document presents no machine- readable semantic . The general approach for exploiting the structure in a document is using wrappers to map documents to some data model. The techniques that use lexicons for content interpretation are however,_still_to_come. There are basically two categories of web content mining strategies: First ones are those that directly extract the content of the documents and the other make improvements on content search of the other tools like search engine.
More information can be revealed by the World Wide Web than that contained in a document. For instance, links directing to a document shows its richness, while the links coming out of a document indicate the opulence or perhaps the variety of topics roofed in the document. When a paper is referred often, it has to be important. Now, the CLEVER and PageRank methods make use of this information that is conveyed by the links to find relevant web pages. By the means of counters, soaring levels cumulate the amount of artifacts subsumed by the notions they hold and the counters of hyperlink, in and out documents, retrace the construction of the web artifacts .
The web servers record and gather data about user interactions whenever requests for resources are received. The analysis of the web access logs of web sites can help in understanding the user actions and the web structure, and thereby improving the design of this enormous collection of resources. Now, there are basically two tendencies in Web Usage Mining determined by the functions of discoveries: - the General Access Pattern Tracking and the Customized Usage Tracking. The former analyzes the web logs to comprehend access patterns and trends, can be used for a better structure and grouping of resource providers. Various web analysis tools are there but they are limited and usually unsatisfactory. A web log data mining tool has been designed, WebLogMiner, and the techniques proposed for using data mining and On Line Analytical Processing on transformed and treated web access files . When data mining procedures for access logs are applied, it divulges access patterns that can be used to restructure the sites in a more efficient grouping, pinpoint helpful advertising locations, and target definite users for specific selling ads. Tailored usage tracking analyzes individual trends. Its idea is to tailor web sites for users. The information shows the format of the resources and depths of the site structure that can all be actively customized for every user over time, based on their access patterns. Although at one hand, it is encouraging to see the various applications of web log file analysis, it is also equally essential to know that the success of these applications depends on how much valid and dependable knowledge one can discover from the large raw log data. The current web servers only store certain information about the accesses. Some scripts custom -tailored for some site may be stored as additional information. But, for an effective web usage mining, a data transformation and cleaning step before analysis may be needed.
1.3 WEB MINING PROS AND CONS
Web mining has many advantages making this technology very attractive to corporations including the government agencies. Its technology has now enabled ecommerce to do personalized marketing, which ultimately results in higher trade volumes and the government agencies are utilizing this technology to classify threats and brawl against terrorism. The society can be benefited by the predicting ability of the mining applications by identifying criminal activities and also the companies can establish improved customer relationship by providing them exactly what they need. The companies also can understand the requirements of the customer in a better way and the customer needs can be met faster. The companies can also find, attract, retain customers that they can save on costs by using the attained insight of customer requirements . They can boost profitability by target pricing based on the profiles created and can even find the customer who may default to a competitor the company will attempt to retain customers by providing promotional offers to a specific customer, thereby reducing the risk of losing customers.
There are no issues created by web mining as such, but it might be a cause of concern when data of personal nature is being used. Invasion of privacy is the most criticized ethical matter involving web mining. Privacy is considered lost when information concerning a particular individual is obtained, used, or circulated, and especially if this occurs without their consent. The obtained data will be analyzed and grouped to form profiles and now the data will be made anonymous before grouping so that no individual can be linked to a profile directly. Hence these applications de-individualize the clients by judging them from their mouse clicks. Deindividualization can be termed as a tendency of judging and treating the users on basis of group characteristics instead of on their individual merits. Another vital concern is that the companies collecting the data for some particular purpose could use the data for a totally different purpose and this essentially defies the user's interests. The ever increasing trend of selling of the personal data as a kind of a commodity encourages website owners to trade this personal data obtained from their sites and this trend has led to the increase in the amount of data being confined and traded, increasing the likeliness of one's privacy being invaded . The companies that buy the data are obliged to make it anonymous. These companies are now thus legally liable for the contents of the release and any inaccuracies in the release of the content will result in lawsuits, but the thing is that there is no specific law that is stopping them from trading this data. A number of mining algorithms may use controversial attributes like race or religion to categorize individuals and these practices could be in opposition to the anti-discrimination legislation. These applications now make it difficult to identify the usage of such controversial attributes, as there are no defined strong rules against the use of algorithms with such attributes and thus this process could result in the denial of service or a freedom to an individual based on his race or religion, this situation can be avoided by setting high ethical standards by the data mining companies. The data that is collected is made anonymous so that they obtain data and patterns are traced back to an individual. Now, it might well seem as if this poses no danger to one's privacy, actually many additional information can be inferred with these application by combining two separate data from the user .
Utilizing the automated tools to find the desired information has become increasingly important due to the massive growth in the amount of information that is available on the web today. Web Mining has enabled the organizations to do personalized marketing in ecommerce and in the process of business decision making.
WEB CONTENT MINING
With the extraordinary growth of the web there is an ever increasing volume of the data published in numerous web pages and the study in Web mining aims to develop novel techniques to extract and mine constructive information from these web pages. Because of the lack of structure of Web data and heterogeneity, automated finding of targeted or unexpected information is a challenging task and it calls for fresh methods to draw from a range of fields that spans from machine learning to natural language processing and also data mining, databases, information retrieval and statistics. In the past few years there was an explosive expansion of activities in the Web mining field that consisted of Web structure, Web usage and Web content mining. Now, the aim of web content mining is to extract information from Web page contents .
It is said that the Web offers opportunities and challenges for data mining. It is believed that this is because of the following characteristics of the Web:
- The amount of data available on the Web is huge and still growing rapidly and it is also accessible easily.
- The coverage of information on the web is wide and diverse and one can very easily find information about almost anything on the Web.
- Data of all types such as structured tables, multimedia data (e.g., images and movies), texts, etc. is available on the web.
- Information available on the Web is heterogeneous. Multiple Web pages may present similar information using entirely different formats or syntaxes, now this makes integration of the information a tough task.
- Much of the information on the web is semi-structured due to the nested structure of the HTML code. There is a need to present information in a simple manner to facilitate human viewing / browsing.
- The information on the web is linked and the links are among the pages within a site and even across different websites. These links are like an information tool that indicates trust in the linked pages and sites.
- Information is redundant i.e. the same information may appear in many pages. This property has been discovered in many data mining tasks.
- A web page contains a mix of many kinds of information like advertisements, main contents etc.
- The Web consists of the deep web and the surface web. The surface web contains pages that are browsed using a normal Web browser and it's also searchable with popular search engines. On the other hand, the deep web consists of database that is accessed with the parameterized queries.
- Several websites allow a person to perform operations using input parameters that is they provide services.
- The Web is not only about data or information and services but it's also about contact amongst people and organizations.
- The Web is dynamic i.e the information on the Web changes continuously and keeping up with the changes is important for many applications.
2.2 WEB CONTENT MINING TASKS
The web is a fascinating place; it offers many opportunities for data mining. There are many tasks associated with Web content mining. Web page classification and page clustering are the traditional mining tasks that are applied directly to the web data.
2.2.1 Structured Data Extraction
The reason for the importance and popularity of structured data extraction is that structured data gives essential information like the list of products and services. Extraction of such data allows the user to provide value added services, for example, Meta search and comparative shopping. It is easier to extract structured data as compared to unstructured texts. Many approaches are there for structured data extraction, also known as wrapper generation. The first one is to write an extraction program manually for each website based on the observed format patterns of the site but this approach is not easy and is time consuming. Hence, it does not cater to a large number of sites. The other approach is wrapper induction / wrapper learning; this is the main technique currently. It works as follows: the user manually labels a set of guided pages and then a learning system generates rules from training pages . These resulting rules are applied to extract target items from webpages, e.g. wrapper induction systems like stalker. There is a third approach, which is the automatic approach. The automatic methods aim to find patterns from web pages and use them to extract data.
2.2.2 Unstructured Text Extraction
Most of the web pages are seen as text documents. The extraction of information from web documents has been studied by many researchers. Their research is related to text mining and information retrieval. Research has also been made in the use of common language patterns such as common sentence structures that are used to express certain facts and redundancy of information on the web to find concepts and named entities. These patterns can be learnt by human users automatically. Another course of research is web question-answering. It was first studied in information retrieval literature. It is very important on the web since the web offers the biggest source of information and the objective of a lot of web search queries is to obtain answers to some simple questions that extend question answering to the web by query transformation, expansion and then selection .
2.2.3 Web Information Integration
Various websites may use dissimilar syntaxes to express similar information because of the sheer scale of the Web. One needs to semantically integrate information from multiple sources in order to make use of or in extracting information from various sites to provide value added services, for example meta search, deep web search, and so on. The two problems associated with the web are Web query interface integration, which is to enable querying multiple databases that are hidden in the deep Web and the other is schema matching, for instance to integrate Yahoo and MSN's directories to match the concepts in the hierarchies .
2.2.4 Building Concept Hierarchies
Organization of information is an important issue because of the huge size of the Web. It is hard to organize the whole web but at the same time it is also feasible to organize the web search results of a given query. A list of ranked pages produced by search engines is insufficient for many applications and the standard method for the organization of information is concept hierarchy or categorization. Text clustering groups similar search results jointly in a hierarchical fashion . A different approach is proposed that does not use clustering but use the existing organizational structures in original web documents, giving emphasis on tags and language patterns in order to perform data mining to find the important concepts and their hierarchical relationship i.e. it uses the information redundancy property and the semi structure nature of the web so that it is able to find what concepts are important and the relationships they might have.
2.2.5 Segmenting Web Pages and Detecting Noise
A web page consists of many blocks / areas, for example the main content area, advertisements, the navigation area etc. It is helpful to separate these areas automatically for various practical applications. In web data mining identifying main content areas or removing the noisy blocks, like advertisements and navigation panels enables one to produce better results. The information contained in noisy blocks can badly harm web data mining . Identifying the different content blocks permits one to reorganize the layout of the page so the main contents are easily seen without losing any further information from the page.
2.2.6 Mining Web Opinion Sources
Before the Web was available the opinion of the consumer was very difficult to obtain. The companies used to conduct consumer surveys or appoint external consultants to find opinions about their products and their competitors. But now a great deal of the information is publicly available on the Web. Numerous websites and pages are available containing consumer opinions, like customer reviews of products, discussion groups, and blogs. Techniques are now being developed to utilize these sources to help companies and even individuals to gain information effectively and easily.
The web is not only about textual data but also about the interaction among people and also among organizations. The aim of Web Content Mining is to extract the information from web page contents. The information on the web keeps on changing, so keeping up with this change is vital in order to utilize the information optimally.
WEB WAREHOUSE-AN INFORMATION FUSION TOOL IN WEB MINING
We have witnessed an explosive growth in the amount of online information available on the World Wide Web which is a vast repository of information about various areas of interest. However, the full potential is far from being fully explored. Various applications are needed by the users in order to aid them in finding and extracting the useful information from the web data. The web was designed so the information is interpreted by humans but not automatically processed by the software applications. It is difficult to design an efficient system that would harness the complexity of the web data in appropriate time and because of the heterogeneity of standards, it is not easy to interpret the data automatically.
To address these challenges, a data warehousing approach can be adopted. The basic idea is first extracting the information from the web, then transforming and loading it to a system, which is called a Web Warehouse (WWh). A WWh provides access methods to enable the automatic processing of data. Web Warehousing can extend the lifetime of the web data and their reuse by varied applications . For the automatic offline processing by the web mining applications, the integration of the web data is required. Companies such as Google and Amazon rely on WWh for their business. Characteristics of the web, like the content size and their formats can be analyzed but it is not easy to recognize which ones are going to affect the design of the WWh. To gather samples of the web and derive models, there are several methodologies used, such as the analysis of web server traffic or query logs but the choice of methodology should be done with care to reflect the attributes of the web portion that will feed the WWh. Significant amount of resources and adequate tools are required because of the complexity and large size of the web .
Only a part of the information stored is in the form of structured data that belongs mainly to the Online Transaction Processing system (OLTP). Basically, this structured data is extracted from the operations system then it is transformed and then loaded into the data mart / warehouse. Based on the knowledge that is created by data mining and discovery technologies, the data warehouse finds use in business intelligence like in business decision makings .
A large percentage of information available on the web is semi-structured textual information, these is stored as HTML pages that can be viewed with a browser. The web sites that provide search engines their query functions are limited and the results that are returned are in the form of HTML pages. We can convert XML (extensible markup language) to HTML using a conversion tool that is available online, provided by Serprest. Much valuable hidden knowledge is there in this, so this information by no means is neglected by business organizations as it can be used for business decision makings. Almost 80% of an organization's information is in textual format and organizations use information from the internet for purpose of decision making. Now, more and more organizations have their business on the internet and the semi-structured web pages play an important role in providing latest information about their business domain. So, this means that the information collected should be put in a Web Warehouse (WWh). In order to support high level decision making, the users are supposed to use and analyze data from varied sources . Web warehouse is simply a data warehouse containing data that is obtained from web sources. The designing of a web warehouse includes transformation of the schema, describing the source data to a multi-dimensional schema for the information that will be analyzed and thus queried by the users. The construction of a web warehouse is like web information fusion from varied sources into a WWh. The Web information along with structured data, it also includes semi structured text.
3.2 WEB WAREHOUSE ARCHITECTURE
The emergence of the WWh architecture is for responding to the evolving data and the requirement for web information. Originally, the role of the data warehouse was to extract transactional data to perform OLAP from operational systems. There are different types of data types such as the structured and the semi-structured text, but the data warehouse, traditionally handles only the structured data. For this very reason, the data warehouse needs to be evolved WWh. The WWh architecture is shown in figure below.
The general WWh architecture consists of four layers:
- Data source layer
- Warehouse construction layer
- Web mining layer
- Knowledge utilization layer
The data source layer consists of the internal data sources including internal files and OLTP data, etc of the organization and also includes the external electronic messages and textual repository. This layer provides the data foundation for WWh construction. It has to be noted that while for the classic data warehouse the data sources were mainly internal, but for the WWh the data sources are external. Henceforth, maintaining an integrated schema is very important, that would give a unified view of the data.
The construction of the WWh is done by the EFML model, ie., extraction fusion mapping loading. In this model, the fusion process integrates the heterogeneous information on the web using the mediation service. Thus, the WWh is used as a web information fusion tool. The information extracted from the web can be loaded to the WWh using the EFML model, the various mining methods such as the OLAP or clustering, etc can be used for exploring the hidden knowledge and consequently formulating knowledge repository .
In the knowledge utilization layer of the WWh the users can search, query the knowledge repository through the knowledge portals to obtain information in order to take a decision.
3.3 EFML PROCESS MODEL
The overview of the EFML model for the construction of the WWh is given by the figure 3.2. It should be noted that this is just the Layer II in an unfolded form of the figure 3.1. This mainly has basically, five well defined activities to perform : first extracting data from web pages; integrating information from varied sources; mapping information into schema; refining the data and schema and finally loading refined schema into WWh.
In the extraction of web information the wrapper service plays a very important role. Its goal is accessing the source and then extracting the useful data and finally presents the data in a specific format. The object exchange format (OEM) is well suited to represent semi structured data. A configurable extraction program converts web pages to database objects, which is used as a wrapper for the retrieval of relevant data in OEM. Python and YACC are examples of some other extraction tools of web information.
The input for a wrapper is a specification that states clearly where the relevant data is on the HTML page and how will that data would be packaged to objects. Such a wrapper is based on text patterns which would identify the start and the ending of the relevant data. The extractor can analyze huge amount of web information because it does not use artificial intelligence. The extraction program parses the HTML pages as can be seen in the figure. The attractor specification file contains sequence of commands where each one defines an extraction step. Every command is given by : [variable source pattern ], here source the input text, pattern gives text of interest in the source, and variables hold the result that is extracted .
Many structured data having different schemas may be obtained when data is extracted through wrapper services. It's vital to maintain an integrated schema for the formulation of a unified view for the extracted data. Here we use two types of mediation services which are designed to fuse schemas, one is without and the other is with structural heterogeneities, the former is "m" that fuses data with similar schemas / structure. The latte one is "M" integrates heterogeneous schema / structure. From these two mediators an integrated schema which gives heterogeneous information is generated. A detailed illustration is given by figure 3.3. To solve conflicts in fusion of information a conceptual representation of data warehouse is needed . The main idea is to specify appropriate matching and reconcile the operations to be used. This is to solve conflicts among data from varied sources. An entirely different solution in information fusion is by the use of ontology based services. Its goal is to perform mediation processes by resolving the heterogeneity problem.
The mapping / transformation process is a vital process in the construction of a WWh. The application of high level transforms schemas to integrated schemas is done for the designing of the data warehouse. The designer needs to apply these mapping services / transformations, like the normalization and the de normalization design. One fact is stored in one place in the normalization design; redundancy is avoided in this kind of design. One fact is being stored in many places of the system for a de normalization design. This kind of design carries redundant data, but it's still preferred for producing reports and browsing data .
After the above steps, the information is now classified and indexed for the creation of metadata in terms of domain relationships. The domain concepts and usage constraints are specified in the WWh. Data pedigree information is supplemented to metadata descriptors. In addition, the techniques for data analysis and web mining are applied for the discovery of patterns in the data. The metadata needs to be loaded in the WWh for exploring the hidden knowledge .
The loaded information should be indexed for fast retrieval purposes. When there are multiple users to be supported, it must be indexed by threads.
3.4 A CASE STUDY
In this case study, it is required to make a decision about the trading of crude oil in terms of trading volume and web quotation information. In order to collect the related information, a website named Money Control (http://www.moneycontrol.com/) is used as one of the information sources. The future price in next few months and some earlier trading volumes about crude oil is reported by this site. A snapshot of the crude oil future quotation is shown in the figure. Since the future price and the earlier trading volumes are not part of the same web page, a WWh is needed to be used to integrate this scattered information for the purpose of decision making .
3.4.2 Using the EFML model
The information about the quotations is displayed in HTML format and the user cannot query it directly. Hence, the requirement is to extract the contents of quotation tables from the original HTML pages. The extraction process is carried out in five steps or commands. The first step is fetching the contents of the source file, the unified resource location (URL) is also given and is stored in a variable. After the file is fetched, the extractor will filter out the unwanted data like the HTML tags and other uninteresting text. In the second step, the result of application of pattern to the source variable is stored in a new variable and this variable now contains the information. In the third step the extractor splits the contents stored in the new variable into chunks of text. These chunks of text are stored in a temporary variable, the contents of which are not a part of the resulting OEM object. In the fourth step, the extractor copies contents of the temporary array to a new array. The final step extracts the individual values in each cell to the new array and assigns them to the respective named variables.
After all the steps are executed and the variables hold the data, the data is packaged to an OEM object with a structure which follows the extraction process. The OEM is suited to accommodate the semi structured data found on the web as it is a schema less model. The data represented in the OEM is in the form of graphs containing a label, type and a value .
This process however, can only extract information on the web but cannot integrate information from separate sources. To fuse the data from the web and for the creation of a WWh, the mediation and mapping services are needed to be used. The figure below shows the overall EFML process for the construction of the WWh.
There are two groups of web pages in the above example. The information about the quotations of crude oil in the first group is extracted by a specific extractor program or wrapper service in the extraction process. In the second group the information is extracted about trading volumes in a similar way. During the fusion and the mapping processes the data conflicts between different webpages of the similar group can be solved by mapping services and m mediator. The fusion between the groups is performed by the M mediator and the mapping services. Thus, an integrated schema is obtained by the two sub schemas. In the integration process the mediators use the similar correspondence to integrate different information. The M mediator uses the correspondence between the time of sub schema 1 and 2 in order to perform the information fusion. It consists of the relation quotation trading with attributes: time, last, high, low, changes and volume.
In order to provide uniform access of data so as to enable the automatic processing of the data can be done by creation of a Web Warehouse. The idea is to first extract the information from the web and then transforming and loading it to a system, this system is the Web Warehouse. This will extend the lifetime of the web content and also its reusability by various applications.
THE WGET APPLICATION
GNU Wget is a computer program that recovers content from web servers. The name has been derived from the World Wide Web and get which is connotative of the primary functions. The downloading via HTTP, FTP and HTTPS protocols is supported by Wget, which are the most popular TCP / IP based protocols used for web browsing .
Features such as recursive downloading, conversion of links for offline viewing of local HTML, and much more are included in Wget. Written in portable C Wget can be very easily installed on any Unix like system and ported to many environments, including Microsoft Windows, Mac OS, AmigaOS and OpenVMS. It had appeared in the year 1996, that was in tune with the popularity of the Web causing a wide use among Unix users and distribution with most major Linux distributions. Wget is a free software and has been used for graphical programs such as Gwget for the GNOME Desktop.
5.2 FEATURES OF WGET
- Portability: The GNU Wget is written in a highly portable style of C with the minimal dependence on third party libraries; something more than a C compiler or a BSD like interface is what is required for Wget for TCP / IP networking. It is designed as a UNIX program that can be ported to numerous Unix-like environment and systems such as Microsoft Windows via Cygwin and Mac OS X .
- Robustness: It has been designed for robustness over unstable network connections. If, for some reason a download does not complete, Wget would automatically attempt to continue the download from where it left and repeat this until the complete file is retrieved.
- Recursive Download: Wget can also work like a web crawler by extracting resources linked from HTML pages and downloading them in an order and the process recursively repeated untill all the pages have been downloaded or if a maximum recursion depth has been reached. Now, in a directory structure the downloaded pages are saved that resemble the one on the remote server. Recursive downloading allows partial or the complete mirroring of the web sites via the HTTP. The links in the already downloaded HTML pages can be changed so it points to locally downloaded content for offline viewing. When such sort of automatic mirroring of web sites is done, Wget would support the Robots Exclusion Standard (unless the option -e robots=off is provided). Recursive download works with FTP as well, when Wget issues the LIST command in order to find which further files are to be downloaded and this process for directories and files is thus repeated under the one specified in the top URL. Now, when the download for (FTP) URLs is requested the shell-like wildcards are supported .
While recursively downloading over HTTP or FTP the GNU Wget can be initiated to inspect timestamps of the remote with local files, this will allow only the downloading for only the remote files that are newer than the corresponding local ones. Now, the mirroring of HTTP and FTP sites would be made very easy but at the same time, it's considered inefficient and is more prone to error when it is being compared to a program that is designed for the mirroring from. On the other hand, there is no requirement for special server side software for this task.
- Non-interactiveness: Wget is a non-interactive program as in, when it starts it does not require any kind of user interaction and also there is no need for the control of a TTY as it can log its progress to an entirely separate file for later inspection. This way the user would be able to start the Wget and log off leaving the program unattended. However, in contrast most textual or graphical user interface web browsers need the user to remain logged in and the restarting of the failed downloads can be started manually, that can be a hindrance when transferring a lot of data.
- Some other features of Wget:
- Wget supports download through proxies that are deployed to provide web access inside company firewalls and to cache and swiftly deliver frequently accessed content.
- Persistent HTTP is used in connections where available.
- IPv6 is supported on systems that consist of suitable interfaces.
- SSL / TLS are also supported for encrypted downloading using the Open SSL library.
- The file that is larger than 2 GiB is supported on a 32-bit system that would include the appropriate interfaces.
- Downloading speed might be throttled in order to shun the exhaustion of all of the available bandwidth.
5.3 USING WGET
5.3.1 Basic usage
The most characteristic usage of the GNU Wget is invoking it from the command line and provide URLs as arguments .
- To download the title page of test.com to a file named index.html : wget http://www.test.com/
- To download the Wget's source code from the GNU ftp site : wget ftp: // ftp. gnu. org/public/gnu/wget/wgetLatest.tars.gz
- To download only *.mid files from a website: wget -e robots = off -r -l2 --noparent -A.mid http://www.jespero.com/dir/goto
- Downloading title page of xyz.com, with the images and the style sheets needed to display the page and then converting into content that is locally available: wget -p -k http://www.xyz.com/
- To download the full contents of abc.com: wget -r -l 0 http://www.abc.com/
5.3.2 Advanced usage
- For reading the list of URLs from a file : wget -i file
- Creating a mirror image of a website : wget -r -t 1 http://www.mit.edu/ -o gnulog
- To retrieve the first layer of msn links: wget -r -l1 http://www.msn.com/
- To retrieve the index.htm of www.jocks.com and showing the original server headers : wget -S http://www.jocks.com/
- Saving server headers with file : wget -s http://www.jocks.com/
- To retrieve the first three levels of ntsu.edu and save them to /tmp : wget -P/tmp -l3 ftp: // ntsu.edu/
- If in the middle of a download Wget is interrupted and the clobbing of the already downloaded is not required : wget -nc -r http://www.ntsu.edu/
- If it is required to keep the mirror of a page, `--mirror' or `-m' is used short for `-r -N'.
- To put the Wget in the crontab file and then asking it to check the file on a particular day : crontab 0 0 * * 0 wget --mirror http://www.zuma.org/pub/zumacs/ -o /home/mme/ weeklog
- To output the document to a standard output file : get -O - http://qwerty.pk/ http://www.qwerty.pk/
- It is also possible to combine 2 options and make pipelines for the recovery of documents from remote hotlist : wget -O - http://jot.list.com/ | wget --force-html -i -
5.4 AUTHORS AND COPYRIGHTS
The GNU Wget was written by Hrvoje Nikšic with contributions from Dan Harkles, Mauro Torttonesi and Ian Abbott. These significant contributions have been credited in the authors file and also been made a part of in the distribution and those that remain are documented in the change logs, also included with the program. Micah Cowan maintains the Wget software program. The Free Software Foundation owns the copyright to Wget. As its policy it requires the copyright assignments for the important contributions to GNU software .
The Wget software program is the descendant of GetUrl by the same author. Its development started in late 1995. Its name was then ultimately changed to Wget. There was no single program that could download files via both the FTP and HTTP protocols. The existing programs that were available either only supported FTP (such as dl and NcFTP) or were either written in Perl. While, Wget took inspiration from the features of the existing programs, but at the same time it's aim was to support both HTTP and FTP that would enable the users in building it by only using the standard tools that are found on each and every UNIX system.
But at that point of time, many UNIX users struggled because of the extremely slow dial-up connections that lead to the growth in the need for an agent for downloading which could deal with transient network failures with no assistance from the human operator.
5.5.1 NOTABLE RELEASES
These following releases marked the development of the Wget. The features for each release have subsequently been mentioned.
The GetUrl 1.0 was released in January 1996 and was the first one to be available publicly. The first English language version was Geturl 1.3.4 released in June
- The Wget 1.4.0. was released in December 1996 and was the first one to use the name Wget.
- Wget 1.4.3 was released in February 1997 and this was the first to be released as part of the GNU project.
- Wget 1.5.3 was released in September 1998 and was a milestone in the program's recognition. This particular version was bundled with many Linux distributions.
- Wget 1.6 was released in December 1999 and has incorporated many bug fixes for the 1.5.3 release
- Wget 1.7 was released in June 2001 and SSL support, persistent connections and cookies were introduced.
- Wget 1.8. was released in December 2001, this version added new progress indicators and introduced breadth first traversal of hyperlink graph
- Wget 1.9. was released in October 2003 which included experimental IPv6 support and the ability to POST data to the HTTP servers
- Wget 1.10 was released in June 2005 and introduced large file support IPv6 support on dual-family systems, SSL improvements and NTLM authorization. The maintainership was singled out up by Mauro Tortonesi
- Wget 1.11 was released in January 2008 and was moved to version 3 of GNU General Public License. This is often used by CGI scripts to specify the names of a file for the purpose of downloading. In HTTP authentication code security related improvements were made.
- Wget 1.12 was released in September 2009 added the support for parsing URLs from CSS content on the web and to handle Internationalized Resource Identifiers
5.5.2 Development and release cycle
The Wget is developed in an open fashion. Its design decisions were discussed on public mailing list, followed by the users and the developers. The patches and bug reports are also relayed to the same list.
The GNU Wget is distributed in the terms of the GNU General Public License from version 3 onwards with an exception that would allow the distribution of the binaries linked against the Open SSL library. It is supposed that the exception clause be omitted once Wget is modified to link with the Gnu TLS library. The Wget's documentation in form of a Texinfo reference manual is issued under the terms of the GNU Free Documentations License version 1.2 or afterward. The main page that is usually distributed on UNIX like systems is repeatedly being generated from a subset of the Tex-info manual and is under the terms of the same license.
The GNU Wget is for the retrieval of files using HTTP / HTTPS and FTP. It is a command line tool, which is non- interactive in nature, so it can be called from scripts and terminals without the support of Windows system, etc. Wget is important if large number of files are needed to be retrieved and for the mirroring of websites.
The World Wide Web is a system hypertext documents that are inter-linked and are contained on the internet. The webpages can be viewed using a web browser and a web search engine can be used to find the desired information on the web. The search engines store information that is retrieved from HTML. The retrieval of web pages is done using a Web Crawler that is a kind of an automated browser, following each and every link on the website. Analysis is done on the content of a page so that the page is indexed, doing so will make the answering of a search query very quick.
When a query is entered in a search engine, using keywords, the index is examined and the matching webpages are provided. Traditionally, a search engine looks for the exact word or phrase but advance feature enables proximity search as well. The results are ranked by the search engines so that the most appropriate result appears first. The methodology of ranking the results varies for each search engine. Though, these methodologies keep evolving with the constant increase in the amount of data that is available on the web.
The design element of the Thesis consist of a rigorous understanding of the Wget application (software program) and thereby using it in a C++ program, which is compiled and executed in Dev C++. The GNU Wget is used as an interface in the program code with the Internet. The aim of the program is to build a search result application in which the use of a web browser, as such, is not required. Furthermore, multiple searches can be done simultaneously and the result will be stored in a file containing all the web addresses i.e. the URLs.
6.2 THE DESIGN CODE
The program is written in C++ and uses Wget, which acts as an interface with the internet. The program defines two functions namely parsefile() and getfile(). These functions are called from the main function. The function parsefile() searches the internet for the desired file and getfile() retrieves the result. The main program calls these two programs and stores the result in the output file. Now, the result stored gives the user the search details such as the search number, the object which was searched and the source from the internet where the object was retrieved from. User can input multiple searches and get the results simultaneously.
The search engines operate using algorithm and human input. The methodology used by the search engines keeps on changing / improving because of the ever-increasing information on the web. The application that has been designed is capable of searching multiple keywords simultaneously without the use of a web browser. The GNU Wget acts as an interface to the internet in the application.
CONCLUSION AND FUTURE WORK
The aim of the research in Web Mining is to use novel techniques to extract and mine information from the web effectively. Searching the content on the web without going through the actual data is done by Web Content Mining. Such techniques, when implied will lead to considerable time improvement, which is a vital factor in the process of decision making in businesses.
The creation of a Web Warehouse is not only suitable for Web Mining and knowledge extraction, but also acts as a platform for web information fusion.
The Wget program is a powerful command line utility which enables users to download files from the internet via command line. Automating the process of downloading is also possible. Learning the different Wget commands in the command prompt will allow the user to download anything that is required to a fine point that would make the use of this application very effective. Using the Wget as an interface to the internet in a C++ program creates an application that helps in the mining of knowledge and hence getting the required search results from any given website, i.e. a search engine.
6.2 OPERATION PROBLEMS AND LESSONS LEARNED
Lessons were learned and opportunities for the improvement were discovered during the life cycle of this Thesis. The root causes of various problems in the field of Web Content Mining were discovered and understood so as to apply to other future projects. The amount of information on the web continues to grow rapidly but all the available data is not reliable. Different sources of data give different estimates and sometimes, even if the data exists it is not possible to access it due to various restrictions.
6.3 FUTURE WORK
Work is completed in the design and implementation of a Search Result Application, but certainly this is not the end of it. Web Content Mining is a field in which colossal advancements are yet to be made for the optimal utilization of the information and knowledge available on the web. The possibility in the improvement of the various Web Mining applications is a continuous adjustment of the system depending on user reactions.