Automated Text Classification Using Machine Computer Science Essay

Published:

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.

Amount of text documents is increasing exponentially with the advent of internet and related technologies. Most of the text documents are an unstructured form of data. Finding patterns in these documents requires an enormous amount of effort and very intelligent techniques.

Many data mining techniques have been proposed for mining useful patterns in text documents and for automated text classification. More than 80% of the world's data is unstructured. However, many questions related to this unstructured form of data are still to be resolved. Like, how to effectively use and update discovered patterns, process unstructured data to make it structured, classification of documents, are still open research issue in the domain of text mining. Most existing text mining methods adopted term-based approaches. Many other techniques like SVM, k-Nearest Neighbor, Artificial Intelligence, Machine Learning algorithms are used to solve these kinds of problems and maximize the system performance.

Automated classification refers to assigning the documents to a set of pre-defined classes based on the textual content of the document. The classification can be flat or hierarchical. Document classification will be resolved in this research using some machine learning techniques with some help of data mining concepts.

Performance and robustness of the proposed technique will be proven by applying the technique over some existing benchmark dataset and compare the results with relevant existing techniques.

Clear Statement of the Problem

Text coming from different media like blogs, Facebook, Magazines, Books, and Digital Libraries are in unstructured format and the information is not couched in a manner that is amenable to automatic processing.

To handle these kinds of unstructured data I want to apply text mining with machine learning algorithms and rule-based approach to classify the documents.

Objectives

To get better understanding of the area.

To mine textual data in intelligent way.

To apply certain machine learning to classify text documents.

To improve the accuracy of state of the art.

Motivation

More than 80% of the world's data is held in unstructured formats like web pages, Emails, Books, Digital libraries etc. Information intensive business processes demand that we go beyond simple document retrieval to "knowledge" discovery. This observation motivates me to convert unstructured data to structure format and classify text documents to solve this problem.

Build classifiers with high accuracy in all applicative contexts.

Classifying documents manually for use in training phase is costly.

Classification of news stories, web pages, according to their content.

Classifying in coming stream of documents.

Introduction and Background

Text mining, also known as text data mining or knowledge discovery from textual databases, refers generally to the process of extracting interesting and non-trivial patterns or knowledge from unstructured text documents. It can be viewed as an extension of data mining or knowledge discovery from (structured) databases.

As the most natural form of storing information is text, text mining is believed to have a commercial potential higher than that of data mining. In fact, a recent study indicated that 80% of a company's information is contained in text documents. Text mining, however, is also a much more complex task (than data mining) as it involves dealing with text data that are inherently unstructured and fuzzy. Text mining is a multidisciplinary field, involving information retrieval, text analysis, information extraction, clustering, categorization, visualization, database technology, machine learning, and data mining.

This article presents a general framework for text mining consisting of two components: Text refining that transforms free-form text documents into an intermediate form; and knowledge distillation that deduces patterns or knowledge from the intermediate form. We then use the proposed framework to study and align the state-of-the-art text mining products and applications based on the text refining and knowledge distillation functions as well as the intermediate form that they adopt.

Text mining can be visualized as consisting of two phases: Text refining that transforms free-form text documents into a chosen intermediate form, and knowledge distillation that deduces patterns or knowledge from the intermediate form. Intermediate form (IF) can be semi-structured such as the conceptual graph representation, or structured such as the relational data representation. Intermediate form can be document-based wherein each entity represents a document, or concept based where in each entity represents an object or concept of interests in a specific domain. Mining a document-based IF deduces patterns and relationship across documents. Document clustering/visualization and categorization are examples of mining from a document-based IF. Mining a concept-based IF derives pattern and relationship across objects or concepts. Data mining operations, such as predictive modeling and associative discovery, fall into this category. A document-based IF can be transformed into a concept-based IF by realigning or extracting the relevant information according to the objects of interests in a specific domain. It follows that

document-based if is usually domain-independent and concept-based IF is domain-dependent.

Text refining converts unstructured text documents into an intermediate form (IF). IF can be document-based or concept-based. Knowledge distillation from a document-based IF deduces patterns or knowledge across documents. A document-based IF can be projected onto a concept-based IF by extracting object information relevant to a domain. Knowledge distillation from a concept-based IF deduces patterns or knowledge across objects or concepts.

Whereas data mining is largely language independent, text mining involves a significant language component. It is essential to develop text refining algorithms that process multilingual text documents and produce language-independent intermediate forms. While most text mining tools focus on processing English documents, mining from documents in other languages allows access to previously untapped information and offers a new host of opportunities.

Domain knowledge, not catered for by any current text mining tools, could play an important role in text mining. Specifically, domain knowledge can be used as early as in the text refining stage. It is interesting to explore how one can take advantage of domain information to improve parsing efficiency and derive a more compact intermediate form. Domain knowledge could also play a part in knowledge distillation. In a classification or predictive modeling task, domain knowledge helps to improve learning/mining efficiency as well as the quality of the learned model (or mined knowledge).It is also interesting to explore how a user's knowledge can be used to initialize a knowledge structure and make the discovered knowledge more interpretable. [1]

Related Work

M.Ikonomakis used automated text classification to manage and process vast amount of documents in digital form. Text categorization problem is that the number of features can easily reach orders of tens of thousands. This raises big hurdles in applying many sophisticated learning algorithms to the text categorization; M.Ikonomakis applies SVM and k-Nearest Neighbor techniques to overcome this problem. [2]

In disgital and data [3]

The Debora Maria Rossi de Medeiros present a few experiments applying Text Mining and Machine Learning techniques to help associating meaning to genes clusters. These experiments were applied to papers abstracts and interaction database data related to Saccharomycescerevisiae genes both for identifying texts content and for explaining the biological meaning of the genes clusters found. [4]

Jae-Hong Eom and Byoung-Tak Zhang introduce PubMiner, an intelligent machine learning based text mining system for mining biological information from the literature which performs efficient interaction mining of biological entities such as gene, protein, and Enzymes. [5]

Pak Chung Wong, Wendy Cowley, Harlan Foote presents data mining and visualization techniques for discovery of sequential patterns from large datasets. They conclude that the strengths of the two approaches can compensate for each other's weaknesses. Then introduce a powerful visual data mining environment that contains a data-mining engine to discover the patterns and their support values and visualization front-end to show the distribution and locality of the patterns. Their result shows that we can learn more and more quickly in such an integrated visual data-mining environment. [6]

Hiroki Arimura, Junichiro Abe investigates a new access method for large text databases on internet based on text mining. First, they formalized text mining problem as the optimized pattern discovery problem using a statistical measure. Then, they gave fast and robust pattern discovery algorithms, which was applicable for a large collection of unstructured text data in real time. They ran computer experiments on interactive document browsing and keyword discovery from Web, which showed the efficiency of method on real text databases. [7]

Project Plan / Schedule

Steps

Duration

Research Survey

3 Weeks

Technique Development

1 Month

Experimentation

2 Weeks

Design

3 Weeks

Result Generation

2 Weeks

Result Analysis

2 Weeks

Thesis Writing

1 Month

Resources Required

Library

Books

Metlab

Visual Studio 2010

Sql Server 2008

Smart Draw\Corel Draw

Internet

Journals Access

Eclipse

Writing Services

Essay Writing
Service

Find out how the very best essay writing service can help you accomplish more and achieve higher marks today.

Assignment Writing Service

From complicated assignments to tricky tasks, our experts can tackle virtually any question thrown at them.

Dissertation Writing Service

A dissertation (also known as a thesis or research project) is probably the most important piece of work for any student! From full dissertations to individual chapters, we’re on hand to support you.

Coursework Writing Service

Our expert qualified writers can help you get your coursework right first time, every time.

Dissertation Proposal Service

The first step to completing a dissertation is to create a proposal that talks about what you wish to do. Our experts can design suitable methodologies - perfect to help you get started with a dissertation.

Report Writing
Service

Reports for any audience. Perfectly structured, professionally written, and tailored to suit your exact requirements.

Essay Skeleton Answer Service

If you’re just looking for some help to get started on an essay, our outline service provides you with a perfect essay plan.

Marking & Proofreading Service

Not sure if your work is hitting the mark? Struggling to get feedback from your lecturer? Our premium marking service was created just for you - get the feedback you deserve now.

Exam Revision
Service

Exams can be one of the most stressful experiences you’ll ever have! Revision is key, and we’re here to help. With custom created revision notes and exam answers, you’ll never feel underprepared again.