Methods To Retrieve Web Documents Computer Science Essay

Published: Last Edited:

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.

Different research papers related to text mining have been reviewed, for the purpose of the survey to understand the domain of problem and possible approaches to solve the problem and improvements suggested by different authors. This may be helpful to design new approach which may solve problems and find some improvements. Thus, in this section the summary of different approaches as well as problems associated with them are discussed.

Annotation: A Consistent Web Documents Based Text Clustering Using Concept Based Mining Model

Author Name: V.M. Navaneet Kumar, Dr. C.C. Chandrasekhar

Published Year: July 2012

Work Built on: Methods to retrieve web documents based text clustering using concept based mining model.

New/Idea/Algorithm/Architecture: New concept based mining model opens the door between text mining restraints and natural language processing. Components of mining model collect appropriate mining text to estimate text clustering eminence.

Result Obtained: In this paper, comparison of new mining method for web documents with existing clustering process has been done. Performance analysis of proposed web document clustering method has been done with concept based mining models using different set of datasets with F-Measure and Entropy measures. A result of proposed model shows that it improves clustering efficiency.

Conclusion: The experimental result shows that proposed algorithm performs well in terms of sentence similarity and clustering efficiency with an existing document clustering algorithm.


Annotation: A survey of Text Mining Techniques and Applications

Author Name: Vishal Gupta

Published Year: 2009

Work Discussed: Application domain where text mining can be used is discussed such as information retrieval, topic tracking, summarization, categorization, clustering, concept linkage, information visualization and question answering. Text mining operations are discussed such as Feature Extraction, Search and retrieval, Clustering, summarization, text-based navigation and categorization. Detail description of primary objectives related to each these operations are discussed.


Annotation: An efficient Concept-Based Mining Model for Enhancing Text Clustering

Author Name: M. Menaga, B. Hemapriya

Published Year: 2013

Problem Addressed: Concept based mining model to calculate similarity between two documents.

Work Built on: Concept based statistical analyzer, conceptual onto logical graph and concept based extractor algorithm.

New/Idea/Algorithm/Architecture: Porter's algorithm for text mining

Result Obtained: Large set of datasets related to text mining are used to evaluate proposed concept-based mining model. Experiment results shows fundamental improvement in clustering quality using document-based, sentence-based, corpus-based and combined concept analysis approach. Quality of text clustering is significantly enhanced using concept based text mining model in comparison with traditional single term-based approaches.

Conclusion: After analyzing semantic structures of the sentences in documents, high improvement in clustering result is achieved.


Annotation: Concept Based Mining Model

Author Name: Shady Shehata, Fakhari Karray and Mohamed Kamel

Published Year: 2010

Work Discussed: Different text mining techniques used for statistical analysis, phrase analysis and word analysis to capture term frequency within the document. When we find same frequency for two terms within the documents, then the term contributes more to capture semantics of the text will get more focused by data mining model.


Annotation: Deploying approaches for pattern refinement in Text Mining

Author Name: Scheng-Tang Wu, Yuefeng Li and Yeu Xu

Published Year: 2006

Problem Addressed: Varying weighting schemes related to word frequency and inverse document frequency have been analyzed for their effectiveness in document clustering.

Work Built on: Two methods of deploying process are proposed - Pattern Deploying with Relevance Function and Pattern Deploying Method.

Result Obtained: Comparative analysis proposed methods with existing techniques such as Pr, Rocchio and PTM method has been done.

Conclusion: In this paper, two new pattern refinement techniques has been proposed to deploy discovered document patterns into feature space which is adopted to express the concept of documents. The experiment study proves that pattern refinement techniques can able to improve the efficiency of pattern-based methods.


Annotation: Enhancing Text Clustering Using Concept-based Mining Model

Author Name: Lincy Lipatha, Raja K. G.Tholkappais Arasu

Published Year: 2010

New/Idea/Algorithm/Architecture: Concept based analysis of algorithm has been proposed.

Result Obtained: Results show that concept based similarity measure collects accurate calculation of pair-wise documents is devised by combining he factors affecting the weights of concepts on the sentence, document, and corpus levels.

Conclusion: The authors have concluded by saying that four component of new concept based mining model which is used to improve quality of text clustering. A better clustering quality has been achieved by extracting semantics structure of sentences in documents.


Annotation: Feature Engineering for Text Classification

Author Name: Sam Scott, Stan Matwin

Published Year: 2011

Problem Addressed: Most of the text mining classification algorithm uses bags of words representation of text. But this technique of represent text is not suitable for rule-based learner.

New/Idea/Algorithm/Architecture: Ensemble architecture to combine results of multiple classifiers together and find out final result by using classifier voting techniques.

Conclusion: The authors show the significant improvement in results. Combining classifier approach gives always better improvement than single classifier.


Annotation: Performance Evaluation of K-Means and Fuzzy C-Means Clustering Algorithms for Statistical Distributions of Input Data Points

Author Name: T. Velmurugan, T. Santhanam

Published Year: 2010

Work built on: This research work deals with two of the most delegated clustering algorithms namely centroid based K-Means and representative object based Fuzzy C-Means.

New/Idea/Algorithm/Architecture: These two algorithms are implemented and the performance is analyzed based on their clustering result quality.

Result Obtained: The behavior of both the algorithms depends on the number of data points as well as on the number of clusters. The input data points are generated by two ways, one by using normal distribution and another by applying uniform distribution (by Box-Muller formula). The performance of the algorithm is investigated during different execution of the program on the input data points. The execution time for each algorithm is also analyzed and the results are compared with one another.

Conclusion: K-Means algorithm is better than that of FCM for both normal and uniform distributions. FCM produces close results to K-means clustering, yet it requires more computation time than K-means because of the fuzzy measures calculations involved in the algorithm.


Annotation: Statistical Phrases in Automated Text Categorization

Author Name: Maria Fernanda Caropreso, Stan Matwin and Fabrizio Sebastiani

Published Year: 2009

Work built on: In this work we investigate the usefulness of n-grams in TC independently of any specific learning algorithm. We do so by applying feature selection to the pool of all α-grams (α ≤ n), and checking how many n-grams score high enough to be selected in the top σ α-grams.

Conclusion: TC the dimensionality of the feature space is an important parameter, and because of this any comparison between different representation schemes is significant only if the numbers of features used are the same.


Annotation: Study on Information Extraction Methods from Text Mining and Natural Language Processing Perspectives

Author Name: Farhad Soleimanian Gharehchopogh, Zeinab Abbasi Khalifehlou

Published Year: 2011

Problem Addressed: In this paper author has addressed an issue related to unstructured data. It is very difficult to extract information from unstructured data because unstructured data is stored in archive forms in text files and data warehouse.

Work Built on: Author has worked on Natural Language Processing and artificial intelligence techniques to extract actionable information from unstructured data.

Conclusion: The author concluded with remarks that text mining and natural language processing techniques have the capability to understand semantics of web texts. Text mining and natural language processing are the two areas that comes under artificial intelligence which are used to extract effective information using machine learning, speech recognition and knowledge management system.


Annotation: Cluster-Preserving Dimension Reduction Methods for Document Classification

Author Name: Peg Howland and Haesun Park

Published Year: 2007

Problem Addressed: Dimension reduction is major problem when we have massive amount of data.

Work Built on: Author has proposed new two-stage algorithm using support vector machine technique.

New/Idea/Algorithm/Architecture: Author has proposed a two-stage approach that combines the theoretical advantages of linear discriminant analysis with the computational advantages of factor analysis methods.

Result Obtained: Author experiments show that the centroid method provide sufficient accurate SVD approximation for the purpose of dimension reduction.

Conclusion: Author concluded with remarkable note that dimension reduction should only performed in pre-processing stage of any classification you used for Document classification. Document classification and document retrieval are directly related to each other. The expense of dimension reduction will affect on effectiveness of reducing the cost involved in the classification process.


Annotation: Automatic Discovery of Similar Words

Author Name and Year: Pierre Senellart and Vincent D. Blondel

Problem Addressed: In this paper, author has review some methods used for data extraction of similar words from different kinds of sources such as World Wide Web, large corpora of documents and monolingual dictionaries. Author was searching for some advanced automatic discovery of synonyms. It has been found that it is very difficult to achieve automatic way to distinguish antonyms, synonyms and general words that found very close to each other semantically.

New/Idea/Algorithm/Architecture: Author developed two techniques that depend on the input word and automatically compile a list of synonyms and near-synonyms as well as generate a thesaurus.

Result Obtained: The results are compared with different methods such as Distance, ArcRank, and our method based on graph similarity and to evaluate their relevance, authors examine the first ten results given by each of them for four words, chosen for their variety.

Conclusion: A number of different methods exist for the automatic discovery of similar words. Most of these methods are based on various text corpora, and three of these are described in this chapter. Each of them may be more or less adapted to a specific problem. We have also described the use of a more structured source monolingual dictionary - for the discovery of similar words. None of these methods is perfect and in fact none of them favourably competes with human-written dictionaries in terms of liability. Computer-written thesauri, however, have other advantages such as their ease to build and maintain.


Annotation: Principal Direction Divisive Partitioning with Kernels and k-Means Steering

Author Name: Dimitrios Zeimpekis and Efstratios Gallopoulos

Published Year: 2007

Goal for Implementation: Author goal was to improve kernel k-means algorithm along deterministic approaches that are expected to give better results in case there are strong nonlinearities in the data at hand.

Work built on: We propose, implement, and evaluate several schemes that combine partitioning and hierarchical algorithms, specifically k-means and principal direction divisive partitioning (PDDP).

New/Idea/Algorithm/Architecture: we proposed a kernel version of the PDDP algorithm, along some variants that appear to improve kernel version of both PDDP and k-means. The implications on memory caused by the use of the tdm Gramian in kernel methods are currently under investigation. Using available theory regarding the solution of the clustering indicator vector problem, we use 2- means to induce partitioning around fixed or varying cut-points. 2-means is applied either on the data or over its projection on a one-dimensional subspace.

Result Obtained: Extensive experiments demonstrate the performance of the above methods and suggest that it is advantageous to steer PDDP using k-means. It is also shown that KPDDP can provide results of superior quality than kernel k-means.

Conclusion: Results indicate that our hybrid clustering methods that combine k-means and PDDP can be quite successful in addressing the non-determinism in k-means, while achieving at least its "average-case" effectiveness. The selection of a specific technique can be dictated by the quality or run-time constraints imposed by the problem.


Annotation: Text Clustering with Local Semantic Kernels

Author Name: Loulwah AlSumait and Carlotta Domeniconi

Publishes Year: 2007

Problem Addressed: The clustering of documents presents difficult challenges due to the sparsely and the high dimensionality of text data, and to the complex semantics of natural language.

Work built on: The proposed approach, called semantic LAC, is evaluated using benchmark datasets.

New/Idea/Algorithm/Architecture: This paper presents a subspace clustering technique based on a locally adaptive clustering (LAC) algorithm. To improve the subspace clustering of documents and the identification of keywords achieved by LAC, kernel methods and semantic distances are deployed. The basic idea is to define a local kernel for each cluster by which semantic distances between pairs of words are computed to derive the clustering and local term weightings.

Result Obtained: Our experiments show that semantic LAC is capable of improving the clustering quality. In addition experiments results show that other kernel methods, for example, semantic smoothing of the VSM, LSK, and diffusion kernels, may provide more sophisticated semantic representations.

Conclusion: In this paper, the effect of embedding semantic information within subspace clustering of text documents was investigated. A semantic distance based on a GVSM kernel approach is embedded in a locally adaptive clustering algorithm to enhance the subspace clustering of documents, and the identification of relevant terms. Results have shown improvements over the original LAC algorithm in terms of error rates for all datasets tested. In addition, the semantic distances resulted in more robust and stable subspace clustering.


Annotation: Knowledge Discovery in Text Mining Techniques Using Association Rule Extraction

Author Name: Vaishali Bhujade and N.J. Janwe

Published Year: 2011

Problem Addressed: Apriori algorithm takes multiple scan of the original document. It takes more time to generate keywords that satisfy the threshold weight value and their frequency in each document. Hence large number of candidate keyword sets created in the Apriori system caused the large gap between systems.

Work Built on: This paper describes automatic extraction of association rules from large amount of textual documents using text mining techniques.

New/Idea/Algorithm/Architecture: EART system based on GARW algorithm

Result Obtained: Experimental study shows different comparison between EART and other rule-based systems. Results indicate that EART system based on GARW algorithm always outperforms the Apriori-based system for all values of minimum support.

Conclusion: Extracting association rules from Text algorithm automatically index documents by labelling each document by a set of keywords that satisfy the given weight constraints based on the weighting scheme.


Annotation: A Framework for Emotion Mining from Text in Online Social Networks

Author Name: Mohamed Yassine H and Hazem Hajj

Published Year: 2010

Research Objective: The goal is to extract the emotional content of texts in online social networks.

Work built on: In this paper, a new framework is proposed for characterizing emotional interactions in social networks, and then using these characteristics to distinguish friends from acquaintances.

New/Idea/Algorithm/Architecture: The framework includes a model for data collection, database schemas, data processing and data mining steps.

Result Obtained: Experiments run on the data showed high efficiency of the method. The model predicted the right class with 88% accuracy. Additionally, when the resulting model was used to predict relationship strength between two users, the prediction reported an accuracy of 87% based solely on the subjective content of comments shared online.

Conclusion: This paper discusses a novel sentiment mining technique for texts in online social networks. It presents a new perspective for studying friendship relations and emotions' expression in online social networks where it deals with the specific nature of these sites and the nature of the language used.


Annotation: A Text Mining Model for Strategic Alliance Discovery

Author Name: Yilu Zhou, Yi Zhang, Nicholas Vonortas and Jeff Williams

Publishes Year: 2012

Problem Addressed: This research addresses the limitations by proposing a text mining framework that automatically extracts alliances from news articles.

Work Built on: Text mining models of Strategic Alliance Discovery

New/Idea/Algorithm/Architecture: We propose the ADT method, a template-based relation extraction method that utilizes entity extraction, POS tagging and dependency parse trees. Moreover, we aggregate information from single sentence and document and generate ACRank, a corpus based multi-feature ranking algorithm. An alliance knowledge portal is proposed to support alliance researchers to search, browse and visualize alliance extraction results. This portal could provide researchers with evidences of alliance announcement and assist in answering strategy and policy questions. The model not only integrates meta-search, entity extraction and shallow and deep linguistic parsing techniques, but also proposes an innovative ADT-based relation extraction method to deal with the extremely skewed and noisy news articles and ACRank to further improve the precision using various linguistic features.

Result Obtained: Evaluation from an IBM alliances case study showed that ADT-based extraction achieved 78.1% in recall, 44.7% in precision and 0.569 in F-measure and eliminated over 99% of the noise in document collections. ACRank further improved precision to 97% with the top-20% extracted alliance instances. Our case study also showed that the widely cited Thomson SDC database only covered less than 20% of total alliances while our automatic approach can covered 67%.

Conclusion: This research will not only encourage new research and discovery in economics and public policy, but also will advance techniques in text and Web mining research. By bringing text mining and knowledge discovery techniques into the field of economics and public policy, the research will foster the awareness of cross-disciplinary research and enrich collaboration between social science and computer science paradigms.

Future Work: In the future, we plan to expand our case study by including longer time periods and adding more companies from different industries. We also plan to study additional features in our ACRank algorithm and to add other templates to the relation extraction component.


Annotation: A Text Mining Technique Using Association Rules Extraction

Author Name: Hany Mahgoub, Dietmar Rösner, Nabil Ismail and Fawzy Torkey

Published Year: 2008

Problem Addressed: Manually assigning keyword approach has many drawbacks. In order to remove these all drawbacks, automatic index of textual document should be generated.

Work Built on: An algorithm for association rules generation based on the weighting scheme (GARW).

New/Idea/Algorithm/Architecture: This paper has presented a new text mining technique for automatically extract association rules from collection of documents based on the keyword features. The system has been designed to accept documents with different structures and formats to transform them into the structured form and it is domain-independent so it is flexible to apply on different domains.

Result Obtained: We compared the performance of our system that based on the GARW algorithm with a system that use Apriori algorithm. We noticed that the GARW algorithm reduces the execution time in comparable to the Apriori algorithm.

Conclusion: The EART extracted more interesting rules than the other compared system.

Future Work: We plan to extend our text mining system to use the concept features to represent text and to extract the more useful association rules that have more meaning.


Annotation: Using Decision Trees and Text Mining Techniques for Extending Taxonomies

Author Name: Hans Friedrich Witschel,

Published Year: 2012

Problem Addressed: Lexical taxonomies have tree-like structures and can thus be extended to become decision trees that serve for their own extension.

Work Built on: In this paper, a novel technique for extending lexical taxonomies was introduced that uses large corpora to identify concepts and calculate word similarities and a machine learning approach (decision trees) together with these similarities to insert new concepts at the right position of the tree.

New/Idea/Algorithm/Architecture: In this paper, a semi-automatic procedure for extending lexical taxonomies is proposed that makes use of term extraction methods for identifying new concepts and that uses co-occurrence data from large corpora to generate the necessary features (semantic descriptions) of the decision tree's nodes.

Result Obtained: Results show that (for reasonably large trees) the classification and learning accuracy can be improved quite significantly when compared to a baseline algorithm that classifiers all new concepts as direct hyponyms of the root node.

Conclusion: The automatic extension of lexical taxonomies is still a very difficult problem and that without human intervention; it will fail to reach acceptable performance.