Automated Text Classification Using Machine Computer Science Essay

Published: Last Edited:

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.

Amount of text documents is increasing exponentially with the advent of internet and related technologies. Most of the text documents are an unstructured form of data. Finding patterns in these documents requires an enormous amount of effort and very intelligent techniques.

Many data mining techniques have been proposed for mining useful patterns in text documents and for automated text classification. More than 80% of the world's data is unstructured. However, many questions related to this unstructured form of data are still to be resolved. Like, how to effectively use and update discovered patterns, process unstructured data to make it structured, classification of documents, are still open research issue in the domain of text mining. Most existing text mining methods adopted term-based approaches. Many other techniques like SVM, k-Nearest Neighbor, Artificial Intelligence, Machine Learning algorithms are used to solve these kinds of problems and maximize the system performance.

Automated classification refers to assigning the documents to a set of pre-defined classes based on the textual content of the document. The classification can be flat or hierarchical. Document classification will be resolved in this research using some machine learning techniques with some help of data mining concepts.

Performance and robustness of the proposed technique will be proven by applying the technique over some existing benchmark dataset and compare the results with relevant existing techniques.

Clear Statement of the Problem

Text coming from different media like blogs, Facebook, Magazines, Books, and Digital Libraries are in unstructured format and the information is not couched in a manner that is amenable to automatic processing.

To handle these kinds of unstructured data I want to apply text mining with machine learning algorithms and rule-based approach to classify the documents.


To get better understanding of the area.

To mine textual data in intelligent way.

To apply certain machine learning to classify text documents.

To improve the accuracy of state of the art.


More than 80% of the world's data is held in unstructured formats like web pages, Emails, Books, Digital libraries etc. Information intensive business processes demand that we go beyond simple document retrieval to "knowledge" discovery. This observation motivates me to convert unstructured data to structure format and classify text documents to solve this problem.

Build classifiers with high accuracy in all applicative contexts.

Classifying documents manually for use in training phase is costly.

Classification of news stories, web pages, according to their content.

Classifying in coming stream of documents.

Introduction and Background

Text mining, also known as text data mining or knowledge discovery from textual databases, refers generally to the process of extracting interesting and non-trivial patterns or knowledge from unstructured text documents. It can be viewed as an extension of data mining or knowledge discovery from (structured) databases.

As the most natural form of storing information is text, text mining is believed to have a commercial potential higher than that of data mining. In fact, a recent study indicated that 80% of a company's information is contained in text documents. Text mining, however, is also a much more complex task (than data mining) as it involves dealing with text data that are inherently unstructured and fuzzy. Text mining is a multidisciplinary field, involving information retrieval, text analysis, information extraction, clustering, categorization, visualization, database technology, machine learning, and data mining.

This article presents a general framework for text mining consisting of two components: Text refining that transforms free-form text documents into an intermediate form; and knowledge distillation that deduces patterns or knowledge from the intermediate form. We then use the proposed framework to study and align the state-of-the-art text mining products and applications based on the text refining and knowledge distillation functions as well as the intermediate form that they adopt.

Text mining can be visualized as consisting of two phases: Text refining that transforms free-form text documents into a chosen intermediate form, and knowledge distillation that deduces patterns or knowledge from the intermediate form. Intermediate form (IF) can be semi-structured such as the conceptual graph representation, or structured such as the relational data representation. Intermediate form can be document-based wherein each entity represents a document, or concept based where in each entity represents an object or concept of interests in a specific domain. Mining a document-based IF deduces patterns and relationship across documents. Document clustering/visualization and categorization are examples of mining from a document-based IF. Mining a concept-based IF derives pattern and relationship across objects or concepts. Data mining operations, such as predictive modeling and associative discovery, fall into this category. A document-based IF can be transformed into a concept-based IF by realigning or extracting the relevant information according to the objects of interests in a specific domain. It follows that

document-based if is usually domain-independent and concept-based IF is domain-dependent.

Text refining converts unstructured text documents into an intermediate form (IF). IF can be document-based or concept-based. Knowledge distillation from a document-based IF deduces patterns or knowledge across documents. A document-based IF can be projected onto a concept-based IF by extracting object information relevant to a domain. Knowledge distillation from a concept-based IF deduces patterns or knowledge across objects or concepts.

Whereas data mining is largely language independent, text mining involves a significant language component. It is essential to develop text refining algorithms that process multilingual text documents and produce language-independent intermediate forms. While most text mining tools focus on processing English documents, mining from documents in other languages allows access to previously untapped information and offers a new host of opportunities.

Domain knowledge, not catered for by any current text mining tools, could play an important role in text mining. Specifically, domain knowledge can be used as early as in the text refining stage. It is interesting to explore how one can take advantage of domain information to improve parsing efficiency and derive a more compact intermediate form. Domain knowledge could also play a part in knowledge distillation. In a classification or predictive modeling task, domain knowledge helps to improve learning/mining efficiency as well as the quality of the learned model (or mined knowledge).It is also interesting to explore how a user's knowledge can be used to initialize a knowledge structure and make the discovered knowledge more interpretable. [1]

Related Work

M.Ikonomakis used automated text classification to manage and process vast amount of documents in digital form. Text categorization problem is that the number of features can easily reach orders of tens of thousands. This raises big hurdles in applying many sophisticated learning algorithms to the text categorization; M.Ikonomakis applies SVM and k-Nearest Neighbor techniques to overcome this problem. [2]

In disgital and data [3]

The Debora Maria Rossi de Medeiros present a few experiments applying Text Mining and Machine Learning techniques to help associating meaning to genes clusters. These experiments were applied to papers abstracts and interaction database data related to Saccharomycescerevisiae genes both for identifying texts content and for explaining the biological meaning of the genes clusters found. [4]

Jae-Hong Eom and Byoung-Tak Zhang introduce PubMiner, an intelligent machine learning based text mining system for mining biological information from the literature which performs efficient interaction mining of biological entities such as gene, protein, and Enzymes. [5]

Pak Chung Wong, Wendy Cowley, Harlan Foote presents data mining and visualization techniques for discovery of sequential patterns from large datasets. They conclude that the strengths of the two approaches can compensate for each other's weaknesses. Then introduce a powerful visual data mining environment that contains a data-mining engine to discover the patterns and their support values and visualization front-end to show the distribution and locality of the patterns. Their result shows that we can learn more and more quickly in such an integrated visual data-mining environment. [6]

Hiroki Arimura, Junichiro Abe investigates a new access method for large text databases on internet based on text mining. First, they formalized text mining problem as the optimized pattern discovery problem using a statistical measure. Then, they gave fast and robust pattern discovery algorithms, which was applicable for a large collection of unstructured text data in real time. They ran computer experiments on interactive document browsing and keyword discovery from Web, which showed the efficiency of method on real text databases. [7]

Project Plan / Schedule



Research Survey

3 Weeks

Technique Development

1 Month


2 Weeks


3 Weeks

Result Generation

2 Weeks

Result Analysis

2 Weeks

Thesis Writing

1 Month

Resources Required




Visual Studio 2010

Sql Server 2008

Smart Draw\Corel Draw


Journals Access