Arabic Information Filtering System English Language Essay

Published: Last Edited:

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.

Recently, Automated text categorization has become a hot issue in the research community due to its useful applications. Therefore, many tools were designed to categorize short texts that exist usually in the social media. However, these tools can be used for English text only, which presents the need for giving more effort to the Arabic text categorization research. Actually, the research in this field is limited and much of them focused on long Arabic text. Therefore, this paper attempts to find out the problems of designing text categorization tool since the identifying of problems can be considered as a step toward designing efficient tools. Moreover, giving attention to these challenges can encourage researchers to provide studies in order to handle them

Table of Contents

Table of figures

1 Introduction

In the last decade, it can be seen that there is a dramatic increase in the number of Internet users in the Middle East. This increase has contributed mainly to enriching the amount of Arabic content on the web. Therefore, the categorized of this enormous content has to be given more attention especially the one that exists in the social media. Text categorization is a task that aims to classify text into one or several predefined labels, also known as categories.

The recent interest in conducting research in Automatic text categorization has actually resulted from its essential applications and its usefulness. Automatic text categorization can be used mainly for sorting of files into folder hierarchies, topic identifications, text filtering and documents organization for databases and web pages. Furthermore, these tools are so useful since the performing of categorization manually will be time-consuming and labor-intensive.

Even though there is an increasing interest on developing effective techniques for text categorization, most of these techniques are developed only for English content. Therefore, it can be seen that there is an obvious deficiency in the number of research that conducted for Arabic text categorization. In addition, most of the available research focused only on long text categorization while the content of social media is usually no longer than 140 characters our research will focused on one of the most popular social media, which is Twitter.

Twitter is an online social networking and micro-blogging service that enables its users to send and read text-based messages of up to 140 characters, known as "tweets"[1]{, #39}{, #39}. This lets the users present any information with only a few words, optionally followed with a link to a more detailed source of information. Each user in Twitter usually follow group of people depending on his interest in order to receive all messages that are posted by them. However, the user may want to read the tweets that related to specific topic. Therefore, Twitter gives the ability to manually create different lists; each list contains group of users who are interested in specific topics. However, there is still a noise in the lists due to irrelevant messages. For this reason, there is a need for an information filtering mechanism in Twitter lists that tries to make a classification for the tweets into relevant and irrelevant ones in order to remove unrelated tweets from the list feed. The goal of filtering mechanism is to automatically classify incoming tweets into different categories so that users are not overwhelmed by the raw data. This is particularly useful when Twitter is accessed via hand held devices like smart phones. This research aims to present the Arabic text classification techniques in general and to suggest an information-filtering algorithm for Arabic tweets.

This paper will be organized as follows: section 2 gives background information about Text categorization. Section 3 presents some of the related works of Arabic and other language in text categorization. Section 4 describes the algorithm that used for the presented experiment. Finally, section 5 presents the conclusion and future works that suggested from this paper

2 Background

Filtering irrelevant tweets done by classifying tweet into relevant to topic of list or irrelevant and remove irrelevant one. In this chapter we will present background information about text classification process.

Text classification also knows, as text categorization is process of classifying natural-language text into a number of predefined categories based on its content[2]. It has important application in the real world such as: spam filtering, topic spotting, language identification and detection of text genre[3].

Text classification has two main approaches, knowledge engineering which involves defining rules manually to encode the expert knowledge on how to classify the documents under a given category (Figure 1). This approach has some disadvantaged such as: it required large amount of time, it difficult to redefine and create new rules with the increasing of text formats. Moreover, it needs highly skilled expert knowledge to create and maintain the knowledge rules and categories[3].

Figure Knowledge engineering for TC[4]

The second approach is machine learning, which gives better classification accuracy than classifiers built by knowledge engineering. It building a classifier based on learning the characteristics of the categories from a set of pre-classified training document whose categories are predefined.

There are two common techniques for machine learning: supervised and unsupervised learning. In supervised classifier, it used external source of information to carry out classification and it has to be trained based on a set of predefined classification. Where in the unsupervised learning (also known as clustering) the system creates new category and groups the document under them[5].

There are set of steps followed to classify a text using machine-learning approach, these steps are: preprocessing the text, dimensionality reduction techniques and finally classification by applying one of the classification or clustering algorithms.


This is an important step of learning process which has significant effect in accuracy of classifier, it consists of different methods the most usable methods are stop word removal and stemming.

Stop Word Removal

Stop Word Removal is a process of eliminating words that do not give any meaning to a text. These words are called stop words. Categories of stop words cover adverbs, conditional pronouns, prepositions, pronouns, transformers (verbs, letters), referral names and affixes (prefixes, Infix and postfixes). Few examples of stop words are: ((ابو ،منها ، قبل ،حتى ،ان ،اما،دون [6]


Stemming is a process of truncating a word to a simpler form. There are two major approaches for Arabic stemming; light stemming where a word's prefixes and postfixes are removed (ex. لاستعمالاتهم: استعمل ) and Root based stemming that reduces a word to its root (ex. لاستعمالاتهم: عمل ). It is helpful before stemming a word to normalize it by replacement of (أ،آ ،إ) with ا , replace ى with ي and replace ة with ه [6].

Document representation

The original text must be converting to suitable representation before processing it with one of the learning algorithms .The text represented as a series of feature-value pairs. The features can be arbitrarily abstract (as long as they are easily computable) or very simple. There are two important text representation modes in text mining; bag of words, the representation based on the N-grams.

Bag of words:

Bag of Word (BoW) is one of the basic methods of representing a document where the occurrence of each word is used as a feature for training a classifier[3].


N-gram is a sequence of symbols (byte, a character or a word) with the length of N. For example if N=3 and the word is "المطارات" N-grams will be الم ٫ طار ٫ مطا٫ رات .This method used for represented a document by using the occurrence of each N-gram as a feature for training a classifier[6].

2.3 Dimensionality reduction techniques

Techniques used to reduce the size of document vectors and increases the speed of learning and categorization phases for many classifiers.

Document frequency

Document Frequency (DF) measures are used in text classification. These measures compute frequency of terms within document group and ranked them to find a set of terms that can be used to represent the category topics, in such away that most of documents belong to the same category should used these terms[7].

TFIDF Method

Term Frequency-Inverse Document Frequency (TFIDF) measures is used in text classification. These measures indicate how important a term is to a document set. It is compute by multiply term occurrence frequency (TF), which is measure how often the word occurs within a document, and the inverse document frequency (IDF), which is measure how often the word occurs in other documents[7].

2.4 Classification Methods

As we mention before text classification system can be supervised or unsupervised. In supervised classifier, it used external source of information to carry out classification and it has to be trained based on a set of predefined classification. Where in the unsupervised learning (also known as clustering) the system creates new category and groups the document under them. Some of the most well known supervised learning techniques are:

Naive Bayes classifiers

Naive Bayes is probabilistic classifier predicate the category of given document by using joint probabilities of words and categories. It is the most frequently used system in text categorization because it efficiency than other approaches. However, it is fail with data set that is sparse and inconsistent, which are the characteristics of our data (tweets)[8].


KNN is the k-nearest neighbors classification predicate the category of given document by calculating the distance between the vector representing the document and each vectors that representing documents from the training set. Then K Nearest instances are selected and the document is assigned the majority class(Figure2)[8].

Figure K-Nearest Neighbor using a majority-voting scheme[4]

2.5 Evaluation measures of Text Classifier

There are some popular measures used to evaluate the performance of text classifiers recall, precision and the combination of them that called F1 measure. These measures are computed as follows[7]:

Precision =

Re call =

F − Measure=

3 Literature review

Text classification system consists of several steps ; preprocessing , document reduction and classification the following section will contains the related work of these steps .

3.1 Preprocessing

[3] [9]Proposed new stem mechanism for Arabic word called ETS2 stemmer it based on the concept of local stem, which is the shortest form of a word among syntactically related words in a document. For example, in a document containing the words: Loves, Loving; the words "loves" and "loving" are syntactically related. In this case "loves" are local stems. ETS2 stemmer and other leading stemmers; root based stemming and light stemming were tested to observe their effect in improving performance of Arabic classification .The author found that ETS2 stemmer outperformed the other leading stemmers in improving classification accuracy.

[14] [10]proposed method aim to enhance the classification performance of KNN classifier for Farsi text classification. Furthermore, studies the effects of using N-grams of characters for converting documents to numerical vectors on text classification performance. The authors improved the KNN text classifier by inserting a factor to the KNN formula for considering the effects of unbalanced training datasets and used of N-grams with lengths more than 3 characters in text preprocessing. The proposed methods was tested on Hamshahri1 Farsi corpus and articles of some archived newspapers using different performance measures such as precision, recall and F-measure .The experiment result show that the KNN classifier outperform the SVM with 92%, 91%, 91% in the values of micro precision, micro recall and micro F-measure respectively, where SVM achieved 88%, 90%, 89%.

[15] [11]improve automatic key phrase extraction by using additional semantic features and pre-processing steps. These features include the use of signal words and freebase categories. For preprocessing step two methods are used; light filtering which is based on assigning relevance measure to each sentence of the article using centrality-as-relevance methods that calculates pair-wise distances between sentences and finds a centroid for the article. The second pre-processing method is Co-reference normalization, which normalizes multiple forms of the same named entity into a single form (e.g., Michael Jackson as Jackson or Michael normalize to Michael Jackson). The experimental result show that the system using the proposed preprocessing and feature achieved best result in both measure 78.99%, 55.4% respectively precision and nDCG than without use them 68.93%, 49.4%.

3.2 Dimensionality reduction techniques

[6][7] evaluated and compared five different reduction techniques, Document Frequency (DF), TFIDF and Latent Semantic Indexing (LSI) methods, Light-Stemming and Root-Based stemming. The dataset used in the experiment is a set of prophetic traditions or "Hadiths'. It includes 453 documents distributed over 14 categories. The performance of categorization system was evaluated by macro averaging F-measure. The experimental results showed that the performance of Back-propagation learning in neural networks was improved by using reduction techniques .The F-measure of BPNN using DF, TFIDF or LSI methods is better than the F-measure of BPNN using Stemming, Light- Stemming and All Features (without reduction). Therefore the DF, TFIDF and LSI methods are favorable in terms of better classification accuracy when compared with the two other methods.

3.3Classification Methods

[4] [12] A modified version of Artificial Neural Network (ANN) method is proposed for classifying Arabic texts by [8]. The authors have used a Singular Value Decomposition (SVD) for data representation, which is a new representation space of the observations. A collection of Prophet Mohammad's 'Peace Be Upon Him' Hadeeth was collected from the "Nine Hadeeth Book". The data consists of 453 documents that are associated with fourteen categories. A comparison between the proposed method (ANN with SVD) and the original ANN was carried out against the Arabic corpus with reference to F1 evaluation measure. The results revealed that (ANN with SVD) outperformed the classic ANN when dimensionality increased.

[5] [13] evaluated the performance of four different classification approaches, which are: decision tree (C4.5), rule induction (RIPPER), simple Rule (One Rule), and hybrid (PART) on CCA text collections. The CCA taken from the online sources of Leeds University consist of 415 documents. The authors have processed the data before performing the training where punctuation marks, numerical data, non-Arabic text, function words and stop words have been removed. Moreover, the documents in the CCA data set are stemmed using Khoja Stemmer in which all-Arabic word return to their root. The experimental results showed C4.5 algorithm achieved more precision on average 77.6%, 1%, and 1.2% respectively than OneRule, RIPPER, and PART algorithms. Additionally, this algorithm gained respectively more recall on average 59.7%, 1.6 %, 1.1% than OneRule, RIPPER, and PART algorithms.

[9] [14] evaluated the performance of two common clustering techniques SVD and K-means on 611 tweets gather from Twitter using TweetMotif from different topics. The authors have processed the data before performing the training where stop words have been removed and stemmed the words to their base form used Porter Stemmer. The experimental tests show that hat graph- based approach using affinity propagation performs best in clustering short text data with minimal cluster error.

[13] [15]proposed multi classifier system to improve accuracy of text categorization, where a separate model is built for each of the pre-defined categories. The model for each category acts as the main classifier for that category and as an auxiliary for all the other categories. The classifier system is composed of the following parts: (1) Document scoring; compute the relative frequencies of words to each pre-defined categories, (2) Classification Rule Generation; set a threshold value for the document score, such that documents with scores above the threshold are belonging to the category and the rest are not belonging to that category. (3) Tie Breaking, After the initial classification done a subset of the documents could be assigned to more than one class. In order to break these the ties formal was defined (4) Residual Classification; assigned class to the remaining residual (unclassified) document. The author has tested the proposed system on the BBC Sports corpus. The results showed good classification accuracy of both macro-averaged and micro- averaged F-measures 94.7% and 94.3% respectively.

[20] [16]evaluated the performance of two common data mining approaches Naïve Bayesian method (NB) and Support Vector Machine (SVM) on different Arabic data sets. The data set consist of 2244 Arabic documents of different lengths that belong to 5 categories. The experimental tests have revealed that SVM classifier have 6.9%, 6.5% and 7% higher Recall, Precision and F1 than NB respectively.

[2] [17] proposed system to detect novel Arabic news based into two steps: topic detection follows by novelty detection. Topic detection process aims to detect the topic for new arriving news article, this process based on a manual extraction of Arabic keywords related to each category. Then extract the title word for new news and compute score similarity between title word and the keyword for each category. Finally, the new news will belong to the category that keywords are highly similar to extracted title words, which indicate by taking the highest computed similarity score. The second major step for the novel system detection is Novelty detection this process achieved by many step: preprocess the news article by removing stop words and finding the stem of term. Then compute the cosine similarity between news vector and other news vectors exist in dataset with threshold 60%. The news article is considered new when its computed similarity score is low, otherwise it is considered redundant.

3.4 Short Text classification

Short text mining has problem caused by their highly sparse representations most of related work solved this problem by extend the short text with additional information using external information sources such as Wikipedia and WordNet to make it appear like a large document of text, or by using additional feature set rather than using just words as features. The following paragraph will contains related work deal with short text problem :

[7] [18] proposed novel approach to cluster short text via enrich the original short text with an additional set of auxiliary data. Unlike others enrichment techniques which ignore the semantic and topic inconsistencies between the short text and auxiliary data, the authors proposed a novel topic model Dual Latent Dirichlet Allocation (DLDA) model, which jointly learns a set of target topics on the short texts and another set of auxiliary topics on the long texts while carefully modeling the attribution to each type of topics when generating documents. The experimental results show that DLDA can outperform other methods, which proved that the clustering quality on short text can be improved by considering the difference between auxiliary data and target data.

[8] [19] proposed framework to improve the performance of short text clustering by using combination of internal and external semantics, where the internal semantics exploiting from the original text and external concepts from world knowledge. The experiment result show that the framework proposed in this paper outperformed the other previously proposed knowledge based short text clustering methods.

[10] [20] proposed method to clustering text by enriching document representation with Wikipedia concept and category information using two approaches .The First one called exact-match, is a dictionary-based approach that map topical terms used in a document to exactly match the Wikipedia concepts denoting the same topic. The second mapping approach is called relatedness-match. Instead of mapping Wikipedia concepts to each document directly, this approach builds the connection between Wikipedia concepts and each document based on the contents of Wikipedia articles. This approach is more useful when Wikipedia concepts cannot fully cover the topical domain of a collection. The propose method was tested with two clustering approaches (agglomerative and partitional clustering) on three datasets: 20NG, LATimes and TDT2, and the experimental results show that clustering performance improves significantly by enriching document representation with Wikipedia concepts and categories.

[16] [21] presents a method used hyperlink text in Wikipedia to improve quality of Identifying Document Topics. The authors proposed an improvement of Schonhofen's method[22], which computed the relatedness between a document and a Wikipedia category based on the weights of words that simultaneously occurs in both the document and articles belonging to that category, by exploiting and combining the hyperlink texts in articles with titles of articles and categories of Wikipedia. The methods was evaluated on 2000 Wikipedia articles of Computing and Team Sport and achieved better results compared to the results of Schonhofen's method.

[18] [23] proposes a novel approach that identifies a document's topic using Wikipedia. The authors used this approach to group researchers on academic social networks basis on the area of publication topics. Each researcher publications are treated as a single document. Complete list of academic topic names and their corresponding equivalent topic names and the semantically related topic names were obtained from Wikipedia category-concept hierarchy to generate topic dictionary. Then scanned for occurrence of this topic in each researcher's publications in order to generate document topic mapping vector. Then, soft clustering algorithms Adaptive Rough Fuzzy Based Leader (ARFL) is applied for grouping related researchers. From the clusters concise topics for each cluster were obtained and the researchers were linked if they were in the same cluster, through the common topics.

[12] [24] presented different approaches for improving the task of clustering company tweets using different methodologies for enriching term representation of tweets (S-TEM, TEM-Wiki, TEM-Positive-Wiki, TEM-Full). These methodologies have been evaluated with respect to F-measure on WePS-3 corpus, the subset used in the experiments includes only 20 companies and each selected company must contain at least 90 tweets associated to a company found in the collection, The experiment result show that clustering company tweet is difficult task due the nature of writing style tweets; a poor grammatical structure with many out of vocabulary words, this kind of data leads to obtain low performances for most clustering methods.

[11] [25]proposed approach to classify tweets into to a predefined set of generic classes such as News, Events, Opinions, Deals, and Private Messages, by using eight features which consists of one nominal author and seven binary features (presence of shortening of words and slangs, time-event phrases, opinioned words, emphasis on words, currency and percentage signs, "@username" at the beginning of the tweet, and "@username" within the tweet). In the classification step, the learning model trains itself using these features. The Experiments were conducted using Naïve Bayes classifier and preprocessing step such as removing the stop words was applied to dataset that consist of 5407 tweets from 684 authors .The experimental result showed that the authorship feature has the significant improvement in accuracy comparing with the traditional "Bag-Of-Words" strategy and other seven feature. Moreover, the proposed approach using all eight features outperforms the traditional "Bag-Of-Words" strategy.

[1] [26]build filtering system focusing on Twitter lists that tries to make a classification for the tweets into relevant and irrelevant ones in order to remove unrelated tweets from the list feed. The author use 10 different features that help in building an accurate classifier such as text-based similarity score which measures the similarity of a post's text to a feed's key topics. Moreover, the authorship feature and additional features extracted from social network information. Also use the temporal features which are extracted from the create timestamps of posts and link domain features, which is extracted from links that are included in the text of posts. The system was evaluated on a labeled dataset of lists and achieves very good accuracies between 85% and 95%.

4 Summary and Conclusion

Text classification becomes an interest topic in recent years this result from its essential applications and its usefulness .One of the most useful application is filtering which use in email spam or in social network such as Twitter. In case of Twitter, the goal of filtering mechanism is to automatically classify incoming tweets into different categories so that users are not overwhelmed by the raw data. Since the tweet is short text with 140 character limits and short text classification has been considered as a challenging topic. This research attempts to support this field by discuss some challenges of this process and suggest the solution to deal with these challenges. Moreover, it presents the classification processes starting from preprocessing techniques, document reduction techniques and classification methods and suggests the techniques which success with Arabic text