This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.
Today, due to the increasing revolution of technology, and especially, the Internet as the primary source for the last few years of this century, the world is witnessing a huge accumulation of such valuable information which is increasingly growing each day. Although such a huge accumulation of information is valuable and most of these information are texts, it becomes a problem or a challenge for humans to identify the most relevant information or knowledge. Therefore, text categorization (TC) comes to the scene where it plays a crucial rule in helping information users overcome such a challenge. As a matter of fact, within the increasing advancement of knowledge and the accumulation of information, many sciences have emerged as to investigate new phenomenon in new areas, and for this, TC is concerned with the area of information and knowledge documentation categories. Since information and knowledge stored and divided into categories of documents or texts, the TC assists the users of such information to navigate to the information he/she would like to obtain.
TC, as defined by (Manning and Schutzi 1999) and (Sebastiani 2002), is the task of automatically assigning selected documents into categories from a pre-defined set of categories. It is also referred to as document classification or topic spotting. It has many applications such as document indexing (Biebricher et al. 1988), document organization (Larkey 1999) and hierarchical categorization of Web pages (Ozgure 2003). This task is usually solved within combining information retrieval (IR) technology and machine learning (ML) technology which both work together to assign keywords to the documents and classify them into specific categories (Sebastiani 2002). ML helps us to categorize the documents automatically and IR helps us to represent the text as an attributes.
Supervise learning which this research is concerned with, as indicated by (Moens 2006), is a very popular ML approach, in which, classification patterns derived from a set of labeled examples are learned by TC algorithms, given a huge number of labeled examples (training set), and the task with the aim of building a TC model. Then, the TC model can be used to predict the category of new unseen examples (testing set).
Statistical-based algorithms, Bayesian classification, distance-based algorithms, k-nearest neighbors and decision tree-based methods are some of the different ML algorithms which have been applied for TC (Dunham 2003). Most of these algorithms applied in different previous studies in TC are designed and tested for documents in English language. However, it is stated that some TC approaches were carried out for TC in other European languages such as German, Italian and Spanish (Ciravegna et al. 2000), and some others were applied in TC in Chinese and Japanese languages (F. Peng et al. 2003; J. He et al. 2003). However, for the core area of the current study, which is TC in Arabic language, there was a lack work which was carried out for Arabic TC referred as "Sakhr Categorizer".
Moreover, in comparison to TC conducted in other languages as previously stated, developing text categorization systems for documents written in Arabic language is stated to be a challenging task because of the complex and rich nature of the Arabic language. Arabic language is characterized by its highly flexional and morphologically rich system. Therefore, such complex linguistic system raises serious challenges and obstacles to the task of automatic processing and classification which should be indispensably overcome. Moreover, the use of applied automatic TC techniques for Arabic TC is not an easy task, but it is time- and effort- consuming. What makes it more complex is that applying some automatic TC techniques for Arabic documents is not as efficient as for English because linguistic structures of the two languages especially in morphology and syntax are totally different. Such reasons seem to be some of the main reasons (Samir 2005; El-Halees 2007) which can justify the lack of much research in the field of Arabic TC as compared to TC in other languages and especially in English. Generally, there are two problems involved in the processing of automatic TC: the first problem is related to the extraction of feature terms which are recognized as effective keywords in the training phase, and the second problem is concerned with the actual classification of the document using these feature terms in the test phase.
In the current study, a new classification technique in Arabic TC term called the Frequency Ratio Accumulation Method (FRAM) is investigated. It was proposed by (Suzuki and Hirasawa 2007), and this method is characterized as it classifies documents without extracting feature terms in the feature selection stage. To prove the effectiveness of the proposed method, it is compared with the state-of-the-art Naive Bayes method, which is one of the most famous techniques that are available at present. Moreover, we applied six variant feature selection methods with NaÃ¯ve Bayes namely: Mutual Information (MI), X2 Statistic (CHI), Information Gain (IG), GSS Coefficient (GSS), Odd Ratio (OR) and F1-meature (F1).
1.2 INTRODUCTION OF ARABIC LANGUAGE:
The Arabic alphabet consists of the following 28 character:
Ø§ Ø¨ Øª Ø« Ø¬ Ø Ø® Ø¯ Ø° Ø± Ø² Ø³ Ø´ Øµ Ø¶ Ø· Ø¸ Ø¹ Øº Ù Ù‚ Ùƒ Ù„ Ù… Ù† Ù‡ Ùˆ ÙŠ
In addition to the Arabic hamza (Ø¡) which is regarded as a letter by some Arabic linguistics. The three Arabic letters (Ø§ Ùˆ ÙŠ) are considered vowels letters, and the remaining letters are consonants. For the writing system followed in Arabic, it is written from the right to the left which is opposite to the English system of writing. Moreover, Arabic letters or alphabets take diverse shapes or styles of appearance when being used in a word. Such various styles of appearance usually depend on the letter position in the word (beginning, middle or end of a word) and on whether the letter can be connected to its neighbor letters or not. For example, the styles in which the Arabic alphabet (Ø¹) appears in a word are (Ø¹Ù€) in a case when it appears at the beginning of a word as in the word Ø³Ø±Ø¹Ø© which means clock); (Ù€Ø¹Ù€) when it is used in the middle of the word as in the word ØªØ¹Ø§Ù…Ù„ which means transact), and it takes the style of (Ù€Ø¹) when the alphabet appears at the end of a word as in Ø¨Ø¯ÙŠØ¹ which means adorable. Finally, the same alphabet (Ø¹) has the shape of (Ø¹) when it occupies the end-word position but being disconnected from the letter located to its right as in Ø²Ø±Ø¹ which means plant. In Arabic, diacritics are identified as singles or symbols placed below or above letters to double the letter in pronunciation or to act as a short vowel as the case in the English vowels. They include: shada, dama, fatha, kasra, sukon, double dama, double fatha, double kasra. It is pointed out that the differnt letter styles and diacritics make parsing Arabic text a complex task (for more detailed information as introduction to the Arabic language, please refer to (Duwairi 2002).
1.2.1 Challenges of Arabic Language in TC Tasks
It is evident that the difficulty of Arabic TC comes from several sources, some of which as listed by Khoja (2001) and Hmeidi, Hawashin & El-Qawasmeh (2008) are presented as the following points:
Arabic language is different from other Indo-European languages in terms of its syntax, morphology and semantics (Khoja, 2001).
In comparison to English, Arabic language is recognized to be sparser, meaning that the Arabic words are repeated less than the English words for the same text length. Thus, in this sense, sparseness results into less weight for Arabic terms (features) compared to the English features. Since the difference of weight for the Arabic word features is less than that of the English word features, it becomes more difficult to differentiate between the different Arabic words, which consequently may negatively affect Arabic text classifier's effectiveness (Yahya, 1989; Goweder & De Roeck, 2001).
In written Arabic, most letters assume many forms or shapes of writing. Moreover, the use of the Arabic punctuation associated with some letters may change the meaning of two identical words. In other sense, in written Arabic, there may be two identical words but with different punctuation and different meaning. As examples of such words, they are displayed in Table 1.1 where each row has the same Arabic word, but according to the Arabic TC system, they may be handled differently unless a special care is taken in the pre-processing phase in the TC system.
The diacritics (vowels) in written Arabic as in the Arabic word "altashkiil" can be omitted.
The Arabic punctuation connected or associated with each letter may conflate the meaning of the Arabic word (see Table 1.1).
Different forms for the first letter
Different forms for the last letter
Difference is the conjunction letter
Different punctuation on the last letter
Table 5.1 Identical Arabic Words with different forms.
In comparison to the English roots, the Arabic roots are more complex (Darwish, 2002). As a matter of fact, depending on the context in which it is used, the Arabic root may be derived from multiple Arabic words. On the contrary, the same Arabic word may be derived from diverse roots.
Table 5.2 shows examples of the Arabic words which are derived from the same root 'ÙƒØªØ¨' "ktb". At the same time, Table 5.3 shows some examples of the possible Arabic roots derived from the same word "Ø§ÙŠÙ…Ø§Ù†" "AymAn".
He is writing
Table 5.2: Some Arabic words derived from the same root "ktb".
Tow poor people
Table 5.3: Four possible roots for the word "Ayman".
There is a lack of Arabic TC corpus which is publicly available so that it can be used to train and test Arabic text classifiers. In such a current situation, it is unfair to make a comparison among different Arabic TC approaches.
Although there are still challenges encountered in TC in Arabic language as previously mentioned and others not mentioned, there are some efforts attempted in Arabic natural language processing community which have analyzed Arabic texts to better classifying Arabic documents. However, as compared to other languages, it is indicated that the analyses carried out in recent years still insignificant. Most previous studies conducted in Arabic TC have examined a "text classifier construction phase" and little work has been carried out to investigate the "document pre- processing phase". as attempts needed to overcome some of the previously mentioned challenges, before constructing the text classifier, a special (a careful) pre-processing of the Arabic TC corpus is required to be done. By so doing, the special Arabic dataset pre- processing steps being applied in the current study with the purpose of transforming the Arabic documents into a compact and an applicable form to train the text classifier can be justified.
1.3 RESEARCH PROBLEM
Compared to other languages, there is still a limited body of research which has been done for the Arabic TC, owing to the complex and rich nature of the Arabic language and most of such research includes supervised ML approaches (Moens 2004) such as NaÃ¯ve Bayes, KNN, Support Vector Machine and Decision Tree. Most of these techniques have complex mathematical models and they are difficult to be successfully implemented in a Arabic TC term because the complex morphology and large vocabulary size of Arabic language. Moreover, most of them do not usually lead to accurate results for Arabic TC and all the previous research tended to deal with the feature selection and the classification respectively as independent problems in automatic TC which lead to the cost and complex computational.
As a result, the need arises to apply new techniques suitable for Arabic language and its complex morphology. One of the simplest methods proposed by (Suzuki and Hirazawa 2007) is called the Frequency Ratio Accumulation Method (FRAM). It has a simple mathematical model and it has been proved to achieve acceptable results of automated TC. Moreover, this method is characterized by combining two processes of the main components of developing TC system (feature selection and categorization) in one process to reduce the computational operations of Arabic TC system.
1.4 RESEARCH OBJECTIVES
The objectives of our research are:
To design a model for the Arabic TC based on the Frequency Ratio Accumulation Method (FRAM).
To evaluate the proposed method (FRAM) of Arabic TC model by comparing it with the state-of-the-art NaÃ¯ve Bayes (NB) and six variant feature selection (FS) methods.
1.4 RESEARCH SCOPE AND LIMITATIONS
The scope and limitations of the current study are summarized in the following main points:
The research focuses on supervised ML approaches by designing a model for the Arabic automated TC based on the FRAM method and applies it on Arabic documents collected from the websites.
The current study makes a use of the Light Arabic Stemmer to remove the most frequent suffixes and prefixes for given Arabic words in the corpus.
Bag-of-word(BOW) is used in the study and a character level n-gram with the length 3, 4, and 5 gram for text representation.
The study enforcement state-of-the-art NaÃ¯ve Bayes method with SIX variant feature selection methods in order to evaluate the proposed method (FRAM) by comparing the result of them.
Finally, many performance evaluation measures are used in this study as to evaluate the system performance.
1.5 MOTIVATION OF CARRYING OUT THE STUDY
The personal motivations of the researcher in carrying out the study were to gain more information and knowledge for self education on Arabic TC research by learning more about the frame work methodology suitably used in the Arabic TC, and providing a software resource to others who wish to use text categorization methods in their software projects. Such motivations are derived from the researcher's strong interest in ML methods used for various purposes, and his enjoyment in working with natural languages. Moreover, carrying out corpus-based work in Natural Language Processing as it combines ML and linguistics, and reading the previous literature on the same topic are sources for exciting the researchers to embark on conducting this work as an initial and basic foundation for his future project.
Other motivating aspects of carrying out this study come from the importance of the automated TC in our life. Its important applications includes a wide range of applications such as automated indexing of scientific articles according to predefined thesauri of technical terms, filing patents into patent directories, selective dissemination of information to information consumers, automated population of hierarchical catalogues of Web resources and spam filtering. Unfortunately, as a beginning step in doing this work on the researcher's own T C research, it was found that there was a limited body of research and studies conducted as to investigate the common TC techniques for Arabic documents compared to other languages. Yet, this was another source of self motivation for the researcher as to contribute additional ideas and information by obtaining more significant and accurate findings from this study.
SEGNIFICANCE OF THE STUDY
The current study seems to be of valuable significance. The first aspect of the significance of this research is derived from the urgent need of classifying the huge number of electronic Arabic documents offered through the rapid and increasing growth of the Internet as to enable readers read each text and its part to assign a correct category to it. However, since manual classification of such huge documents needs a considerable time and effort as well as it is costly, the significance of the study emerges through the application automated TC as to meet this urgent need of readers in catching up with the increasing revolution of information in Arabic. Finally, the expected accuracy of the automated TC represented through the findings can be significant since such systems can rival that of manual TC.
This thesis is organized into five chapters. Chapter I is an introduction chapter that explains the background overview, problem statements, research objectives and motivation, as well as the scope and significance of the research. A survey of the literature review including different components of the TC process and the related work as well as the discussion of Arabic TC are all presented in Chapter II. Chapter III describes the methodology used in this study. The results and the comparison of these results obtained through the proposed method and state-of-the-arts methods are presented in Chapter IV. Finally, Chapter V presents the conclusions of this thesis and the suggested future work.