Arabic Text Mining Using Associative Approach Computer Science Essay

Published: Last Edited:

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.

A well-known classification learning problem in data mining is text categorisation, which involves assigning text documents in a test data collection to one or more of the pre-defined classes/categories based on their content. The problem of text categorisation has been active for four decades, and recently attracted many researchers due to the large amount of documents available on the World Wide Web, in emails and in digital libraries. In this project, we would like to investigate the performance of the different rule based classification approaches in data mining on the problem of text categorisation for Arabic text collections. Initially, we identified the following rule based classification approaches: Decision trees (C4.5), Rule Induction (RIPPER), Associative (CBA, MCAR), Greedy (PRISM), and Hybrid (PART). Particularly, we would like conduct comprehensive literature review and comparison experimental studies on the above rule based classification data mining algorithms against large, unprocessed Arabic text collection called Saudi Press Agency (SPA). The bases of the comparison are different evaluation measures from machine learning such as one-error rate. We use different open source business intelligence tools (WEKA, CBA) to perform the experimentations. The primary research question that we are trying to answer is which of these classification approaches are appropriate to Arabic text categorisation problem in data mining.

1. Background

There are several different operational definitions of text mining that have been proposed by many authors. [12] defined text mining as "the process of extracting interesting and non-trivial patterns or knowledge from unstructured text documents". It can be viewed as an extension of data mining or knowledge discovery from (structured) databases. Text mining is useful since it enables us to analyse and classify large amounts of textual data and to reveal the knowledge buried in it. Below are some points showing how important text mining is, and how it can help business [10].

It allows users to access documents by their topics.

It transforms huge volumes of data into detailed information, providing an

overview of its contents.

It helps users to discover either hidden and meaningful similarities among

documents or any related information.

It looks for new ideas or relations in topics.

Text Mining methods have been widely used in many different areas such as homeland security, health care, law enforcement, and bioinformatics. Many text mining approaches from data mining and machine learning exist such as: decision trees [9], and Neural Network [11]. Text mining tools focused mainly on processing documents (particularly English documents) but researchers have paid little attention to applying the techniques for handling Arabic documents. The Arabic language belongs to the Semitic family of languages, in which words in such languages may be formed by modifying the root itself internally and not simply by the concatenation of affixes and roots as occurs in an infecting (such as Latin), agglutinating (such as Turkish and Japanese) [8]. This type of processing is known as morphology. Arabic morphology has a great impact on word formation and may appear in a text in different morphological variations. Using morphological analysis to support text mining in Arabic is an important research problem. The underlying motivation driving the research is to conduct an experimental study on the different rule based classification data mining algorithms against Arabic text mining in order to extract non-trivial information the form of "If-Then" rules from an Arabic corpus.

In the past few years, the Arab world has witnessed a number of attempts to develop Arabic text mining systems, and the current study is one of these attempts. However, a number of problems have arisen (for example, language issues such as morphology, and processing of very large data sets for mining). Some of these problems have been solved such as infix and broken plurals, while others remain unsolved as a computational linguistics such as two letters verb words (nom, نم ', kom, قم ) [1]. We have placed the focus on the Arabic text mining, and the reason for this lies in modern history. The countries of the Arabian Gulf and North Africa have developed enormously since the discovery of oil in the1930s, and this has dramatically impacted the lives of the millions of people living there in terms of lifestyle, commerce and security. This oil discovery positively impacted the development and the growth of other sectors and industries in the Arab worlds, i.e. technology, education, trade, etc. Such development has resulted in a massive amount of Arabic data collections that exist nowadays which contain useful information and knowledge for decision makers. Thus, there is a need to come up with new studies that can determine the suitable intelligent techniques which are able to discover the useful information from the available large Arabic data collections.

. There are many classification approaches for extracting knowledge from data such as decision trees [9], separate-and-conquer [2] (also known as rule induction), and greedy [12], and associative [5] [6] [7]. The divide-and-conquer approach starts by selecting an attribute as a root node using criteria such as GINI Index, and then it makes a branch for each possible level of that attribute. This will split the training data into subsets, one for each possible value of the attribute. The same process is repeated until all data that fall in one branch have the same classification or the remaining data cannot split any further.

The separate-and-conquer approach on the other hand, starts by building up the rules one by one. After a rule is found, all instances covered by the rule are removed and the same process is repeated until the best rule found has a large error rate. Statistical approaches computes probabilities of classes in the training data set using the frequency of attribute values associated with them in order to classify test instances. Other approaches such as greedy algorithms select each of the available classes in the training data in turn, and look for a way of covering most of training instances to that class in order to come up with high accuracy rules. Lastly, associative classification (AC) is considered AC a special case of association rule mining in which only the class attribute is considered in the rule's consequent (RHS), for example in a rule such as, in AC Y must be a class attribute.

Numerous algorithms have been based on these approaches such as decision trees [9], PART [12], RIPPER [2], CBA [6], MCAR [10] and others.

Most of the above classification approaches have been investigated mainly on classic English classification benchmarks, which are simple and medium sized data sets. Further, and with regards to text mining, these approaches have been applied on English data collections. Thus, one primary goal of this project is to investigate the above classification approaches on Arabic text mining in order to evaluate their effectiveness and suitability to such a problem.

2. Aims and Objectives

This research ultimate goal is compare the state of the art rule based classification data mining algorithms using WEKA and CBA business intelligence tools against Arabic text documents. Text categorisation also known as text mining is one of the important problems in data mining. This problem is considered large and complex since the data is massive and have large dimensionality. Given large quantities of online documents or journals in a data set where each document is associated with its corresponding categories. Categorisation involves building a model from classified documents, in order to classify previously unseen documents as accurately as possible. This project aims to investigate the different rule based classification algorithms in solving the problem of TC in Arabic text collections. Another primary aim beside the experimentations and evaluation is a comprehensive literature review on the state of the art classification methods that re related to Arabic text mining. . The research aims to the following objectives:

A comprehensive and critical study in the state of the art rule based classification algorithms and Arabic text mining.

Design a relational/object relational database that will hold the documents and their categories for large text data collections

Large experimental study to compare the different classification algorithms performance with respect to one-error-rate and number of rules generated against Arabic text collection called SPA

Perform an extensive analysis and comparison on the results derived by the selected classification algorithms

3. Approach

In a digital library journal, there are large numbers of journals which belong to several categories. The process of assigning a journal to one or more applicable categories by a human requires care and experience. However, a classifier system that assigns journals based on their contained words to the correct category or set of categories could reduce time and error substantially. Methodology used will be against traditional classification techniques, such as rule induction approach [2], decision trees [9] and neural networks [11].

In this project, we are going to utilise the mixed research method [3] for the general methodological research. This type of research includes both quantitative and qualitative techniques, and since we are using data sets for experimentation and we also comparing different existing classification data mining techniques with our associative classification technique according to a number of certain evaluation measures, the mixed research method is highly suitable for our project.

We can divide the project research method into five phases. Firstly, comprehensive literature reviews about Arabic text mining and rule based classification Algorithms in data mining are conducted. This is important since we would like to shed the light on the problems and challenges associated with Arabic text mining as well as the associated classification algorithms. Secondly, the Arabic data set (SPA) will be processed and normalised in order to easy the process of mining. This phase involves 1) removing unnecessary keywords, numbers and symbols, stop words elimination, stemming, etc, and 2) designing and implementing an object relational database that is able to hold the processed information outputted after applying the processing operations described in step (1) of phase two. We are going to build the database in an open source relational/object relational database.

Once the Arabic corpus becomes processed and dumped into the relational database, the third phase which involves running large numbers of experiments on the selected classification algorithms using two open source business intelligence tools

WEKA, CBA). In this step, we are going to modify the source code of WEKA [13] and CBA [14] in order to deal with Arabic text since these tools are designed to deal with English text. The results consist of the hidden knowledge and relationships in the SPA data set. Lastly, a critical analysis of the generated results is conducted where the focuses of the analysis are the one-error rate and the number of rules produced by the algorithms.

4. Plan

A comprehensive and critical study in the state of the art associative classification and English and Arabic text mining.

Design a relational/object relational database that will hold the documents and their categories for large text data collections

Design the associative classification model that will discover and extract the most obvious category which belongs to a document

Implement the model designed in step (3) using an object oriented programming language

Perform an extensive experimental study on common text mining data collections such as Reuters, SPA to compare the derived results with the current traditional classification approaches