A Review of Feature Selection Strategies

Published: Last Edited:

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.


1. Introduction

The managing and filtering a large amount of data that explosively increased every day especially textual data is a complex task; so that the need to effectively manage them in order to help users to retrieve what they want increase. Typical textual data includes text files, web documents, natural language speech etc. In addition high dimensionality of data causes many problems so that the feature selection process is included in systems that deal with data mining tasks and data management. Feature selection FS means the selection of optimal or the relevant subset of features among the whole sets of feature [1], it is deal with data reduction, this process reduce the space of high dimensional data into a small sample dimension that represent the best features of data[2]. In Data mining discipline the feature selection process affect the performance of data mining tasks (Classification, Association Rule Mining for classification,…); therefore, it is not independent of data mining processes. FS is an important preprocess through the building of text classification systems, the identification of the discriminative features (terms or words) between different classes is a critical problem, the good choosing of features enhance the classification accuracy and minimize the classification errors. Therefore, many research efforts interested by feature selection problem, in addition to the research efforts in the field of building and implementing efficient classification systems. Research in feature selection problem has a long history, a lot of methods and approaches have been applied to feature selection for text classification such as wrapper methods [3][4][5] filter and ranking methods [6][7][8].

On the other hand, the association rules mining techniques are used in the last decade for building the classification models. Associative Classification which means merge association rule mining with classification rule mining is a method for classification proposed in many researches such as [9] [10] [11] [12] [13].

Many methods can be used for association rule discovery, i.e Apriori Association Rule Mining Algorithm [14], Partition algorithm [15], Frequent Pattern (FP) Growth Method [16], FOIL Algorithm [17], TID-List Intersections [18] and others. Unfortunately, Arabic language suffers from the lack of researches that organize and manage information, and that interested by feature selection problem or association rule mining for classification. However the Arabic language is used as standard and mother language in twenty two countries (Arabic countries), and it uses roughly by 300 million people [19]. Nevertheless, some researches focused on feature selection [20][21][22][23] and text classifiers [24][25][26][27] for the Arabic language.

In this paper we review the existing techniques of feature selection and association rule mining for text classification and related work of Arabic text domain. First we begin with the feature selection problem and discus this process. Next we define the general methods of feature selection and review the selection approaches with regards to researches that applied those approaches, and then we shows the current researches that focused on feature selection problem for the Arabic text classifiers. After that we move to the association rule mining algorithms that have been used for text classification. We discus the rule mining techniques and presents researches that uses these techniques. The last section concludes this review paper.

2. Feature selection process and its Architecture

Feature selection is a process of selecting the most relevant and informative features (sub-set of features) among the whole set of features. This best subset of features are most important for accurate out put predictive in the machine learning. To select a subset f features among the whole F feature vectors a discriminating approach is required such that the selected features are the most discriminator of the data. For the classification problem suppose that F = {f1, f2, f3, …fn} is the whole set of features that are forms the documents that labeled with its classes C. Given that (f1,c1) , (f2,c2), …(fn,cn) are the training documents and their classes, where fi є F and ci є C. then the aim is to find the most important features f that construct the classifier and minimize its errors and these features considered as the more descriptive of the data.

According to [], the feature selection process has a four steps; feature generation, feature evaluation, stopping criteria and feature validation, the next figure presents this process.

Sub-set of Feature

Feature evaluation

Feature Generation

Original Data





Feature Validation

Figure 1: Feature Selection Process Steps.

2.1 Feature Extraction and Feature Selection

Feature extraction and feature selection methods are two dimensionality reduction categories. Feature extraction compact the feature space and produces a new transformed dimensions of features by using all features dimensions and measurement space without losing information at the transformation stage from any dimension[]. The feature extraction can be called a feature construction because this process reconstructs the data, reduce the data dimensions and produce the most predicting features.

2.2 Feature Selection and Classification

3. Feature Selection Methods: General Methods

According to literature there are two types of method to solve the feature selection problem, these methods include wrapper and filter methods [ ], …..

In addition to wrapper and filter methods a hybrid methods is proposed in many research which combine the wrapper and filter methods……

Beside the above categories of FS methods an embedded methods [] of FS is also considered as a type of FS methods. This method ……..

Ranking methods is a branch of filter method which ………

3.1 Review of Existing Strategies and Researches of Feature Selection

For text classification there are many methods that have been introduced to solve the feature selection problem that passed as preprocess for the classifiers, the following shows some of the new and popular feature selection methods that used and proposed by many researches.

Mutual information

Information gain

Chi square

Term strength

Odd ratio

Document frequency

Term frequency/ Inverse document frequency



Multi class odd ratio MOR and class discriminating measure CDM are two feature evolution metrics using with Naïve Bayes classifier presented in [], it were applied on multi class text collection. Those metrics was built based on modification on odd ration for terms that compute the probability of occurrences of terms with respect to the positive and negative class. Reuters and Chinese text are two corpora used for experiments in this study and the methods in this study were compared with three variation methods of odd ratio. The experimental results present that the CDM and MOR methods are best performing metrics for Naiv Bayes classifier and the CDM is simpler method than other feature metrics in term of computation.

SAGA [] is a hybrid algorithm, which combines the advantages of some existing algorithms that select the features subset among a large set of features. The SAGA algorithm is based on a simulated annealing, a genetic algorithm, a generalized regression neural network and a greedy search algorithm. It first employs the simulated annealing for global search guidance in the solution space. Next it is uses the genetic algorithm to make the solution optimization. At the last, a local search is done on the k-best solution by using the greedy algorithm and the best neighbors (Defined in term of Euclidean distance between a pair of feature subsets) are selected. The generalized regression neural network was used for assessing the candidate solution. SAGA algorithm is applied to synthetic datasets, real-world benchmark datasets and one new real-world dataset (Smoking Dataset) and it was compared to other algorithms. It shows the best performance over various time intervals.

Graph mining based feature extraction proposed by [] was used as an approach for text classification. In this study, the documents are represented as graphs. This representation allows to captures the term stem, term order, sentence structure and other aspect of text document. In addition, weighted sub-graph mining method is proposed and then the most relevant sub-graph are extracted for classification. The authors used three classifiers to evaluate the graph mining based extraction method. The results show that the proposed approach outperformed over the existing text classification algorithms on some data set.

Class dependent feature weighting method presented in []. Which combine Class dependent feature weighting and the recursive feature elimination. In this research the document are represented using the bag-of-words, and the weight is assigned to each term based on class label of the documents. This method used naïve Bayes classifier and applied to two text datasets, news group and protein sequence datasets. The result shows this method is better than other methods when it compared to those methods.

The research by [] present amethod called Quantum-Inspired Clone Genetic Algorithm QCGA. The QCGA is used to determine the space of all sub-sets of the whole feature set. When a new subset is generated the clone process work on the selected antibodies from the generated subset. These antibodies are cloned according to its affinities. After clone step a crossover operator is used for qubit and then the mutation operator is used to make a change of the antibody characteristics and prevent early convergence of the QCGA. For experiments Reuters - 21567 dataset was used, the results show that the QCGA algorithm is better than other common methods.

In addition to the above research that present many feature selection methods, there are many new and well-known feature selection methods that are used for classification. As an example of those methods, feature selection by clustering using conditional mutual information [], feature selection using fishers' linear discriminant [], feature selection using ant colony optimization [], iterative feature construction with X2 statistic [], and many other new feature selection methods.

The following table presents some of these methods. It presents many feature selection FS methods that used for text classification TC; we can see that the English language is considered more than other languages. There are many FS methods applied on it. Table 1 below summarized some of these methods.

Table 1: Summary of Some of Feature Selection Approaches.




FS Method




Yang al et.



Reuters 22173, and

OHSUMED Collection

Information Gain


Mutual Informatiom (MI),

Document Frequency (DF),

Term Strength


X2- test .



DF,IG,and X2- test are strongly correlated (Up to 90% of term removal).

TS is combarable with up to 50-60% term removal.

MI has inferior performance compared to the other methods.

Jing al et.



CUM standard text data

TF_IDF based FS

Naive Bayse

TF_IDF Result in classification accuracy up to 76%

Yu et.



UCI Machine Learning Repository

A Correlation-Based Filter (FCBF)


89.13% accuracy average of C4.5.

7 features as average of selected features in all data sets. 995 ms as average of running time of FCBF.

Gabrilovich al et.



Collection of 100 datasets

Aggressive Feature Selection




Accuracy with 100% features:

SVM: 76%

KNN: 80%

C4.5 :74%

Accuracy with the optimal FS level:

SVM: 85%



Lei Yu al et.



UCI Machine Learning

Repository1 and the UCI KDD Archive.

NIPS benchmark data

Correlation-based method for relevance and redundancy


NBC and C4.5

This method achieves high degree of dimensionality reduction and enhances or maintains predictive accuracy with selected features.

Doan al et.



Reuters-21578 benchmark data

FS based on multi-criteria of each feature

Naive Bayes (NB)

Macroaveraging of F\ with this method is 72.83%.


of Break-Even Point (BEP) is 69.26%.

Bakus al et.



Reuters-21578 data set and 20 Newsgroups data set.

MIFS-C variant of the mutual information






With 320 Features, the breakeven is:

NB: 82.33%

Rocchio 81.26%

KNN: 82.76%

C4.5: 76.81%

SVM: 84.06 %.

MIFS-C has improved classification breakeven and F-measure performance.

Shang al et.



Reuters-21578 and other data set.

Gini index theory based FS



Fuzzy KNN

Gini index has a better performance and simpler computation than other feature selection method.

Mondelle al et.



OHSUMED, 20Newsgroups,

and Reuters-21578

Categorical Proportional Difference




CPD outperformed other frequently studied feature selection methods.

Shoushan Li al et.




20Newsgroup dataset.

Movie and DVD datasets.

Weighed Frequency and Odds (WFO)


WFO perform robustly across different domains and feature numbers.

Sotoca al et.



Twelve artificial and real databases have been used from 92AV3C and DAISEX'99 data source and from the UCI repository,

Supervised feature selection by clustering using conditional mutual information-based distances

SVM, KNN and C4.5

The performance of theis method is satisfactory ,it performs better than the other methods for most of databases and classifiers used.

Advantages and Disadvantages of Feature Selection Process

The motivations for using feature selection and dimensionality reduction for text classifiers are [][]:

Reducing the size of data and saving computer resources i.e. memory and time.

Improving the classifier performance through reducing noise, stop words, and redundancy.

Improving the scalability of the text classifier.

Improving the classifier accuracy and reducing classification errors.

The following table shows the advantages and disadvantages of the different approaches of feature selection that presented in many researches.

Summary of feature selection methods








Hybrid W/F



3.2 Arabic Related Work

3.2.1 The Arabic Language Structure

Arabic language is a Semitic language, the structure and possibilities in Arabic language make it very rich and difficult. Arabic language is a mother language in 22 countries (Arabic countries) and it is widely spoken language in the world, in addition it is the native language of almost 300 million people, and it is used roughly by 1.2 billion Muslims in religious ceremonies (prayers, holy book,…etc). Arabic language is not as any other language, it is well known for it's rich of vocabulary [25محمد]. These features make the interest in Arabic language grows fast, to present new techniques for computerize this language and evaluate the current techniques that available in other languages and applied these techniques into Arabic.

Arabic Morphology

One property of Arabic language is that it is a derivational language so that the morphology takes a significant role when we deals with Arabic computerized systems (natural language processing systems). The term morpheme means the smallest element that has meaning, and it cannot divided into smaller ones with meaning. In any language some morphemes can exist as words and as morphemes at the same time [survey]. For example the word كتب (he wrote) in Arabic language is a morpheme and word.

The Arabic morphemes are defined in three, or four or five consonants; however most of the Arabic morphemes are defined in three consonants. The affixes can be added before or after or inside the morpheme to produce a new word and that is mean Prefixes, Suffixes, and Infixes respectfully.

The Differences between Arabic and English Texts

The Arabic and English are two different languages, they have different alphabets. The English alphabet has 26 characters while the Arabic alphabet has 28 characters. The written way is also differ, in Arabic language the text is written and read from right to left while in English language the text is written and read from left to right. Unlike English there are no capital letters in Arabic language. In addition, the shape of the Arabic character is context sensitive, depending on its location within a word while the English characters have the same shape in any location within the word. For example, the letter ب (baa) in Arabic language has the following shapes:

At the word beginning: بــ , such as بØب (Baab, Door).

At the middle of the word: ــبــ , such as in the word حبيب (Habeeb, lover).

At the word end: ــب such as in the word قلب (Glb, Heart).

When it is Isolated or at the word end after some Arabic characters: ب, such as سحØب (Sahab, Cloud) and قلوب (Gloob, Hearts).

Moreover the following point presents some of the differences between Arabic and

English language:

The usual word order in Arabic language is verb then subject then object VSO, while the subject before verb SVO in the usual word order in English language.

Arabic language is highly inflectional and derivational, which makes morphological analysis a very complex task while English is concatenate [].

Most words in Arabic have different forms for male/female and singular/plural.

Most of the Arabic characters can be connected from both sides, the right and the left one

Some Arabic letters/sounds are not found in English such as: ح خ ص ض ط ظ ع غ ق

The English sounds P and V are not found in Arabic.

Arabic Word Categories

The Arabic word as any language word is the single and isolated lexeme that represents a meaning alone or in the context, or it is a unit between two spaces. The all character of Arabic word must be Arabic alphabets. [ØلرسØله, survey]. The Arabic language has only 3 categories of word (3 Parts of Speech). Unlike English, which has 8 categories of word (8 Parts of Speech), The Arabic word classes are noun, verb and article. The following table gives a comparison of the parts of speech between English and Arabic [] []:

Table 1: Arabic and English Part of Speech Classes.






Noun, Pronoun, Adjective, Adverb

The word in this class is not linked to time and it is indicates upon a meaning in itself.




This also indicates upon a meaning in itself but is also linked to time (Thus the concept of tenses).

حرف Article

Preposition, Conjunction, Article

It does not have a full independent meaning of it's own and It Indicates the meaning of something else .

3.2.2 Related Work of Feature Selection and Classification for Arabic Text Documents

Most of the systems and methods that have been proposed for classification do not deal with Arabic language. However, there are a few research interested by Arabic text classifications like [13], which proposes a categorization system based on machine learning approach. Two classifiers are used; K-Nearest Neighbor and Rocchio classifiers. In addition, three stemming algorithms have been evaluated to see which of them are best for Arabic text, the results show that hybrid approach of light and statistical stemmer is the applicable method for Arabic text. In addition, several methods for selecting terms are used. A hybrid method for term selection is proposed by combining document frequency threshold and information gain, this method gave good results. After term selection, every document is represented as a vector of terms' weights. Normalized term frequency and inverse document frequency TF-IDF is the weighting method suggested by the authors in this study. Collection of text contains 1,132 documents collected by the authors from Egyptian newspapers is used for experiment, the collection cover six category, and the result show that the Rocchio classifier (Up to 98% of accuracy) perform well than the K-Nearest Neighbor classifier.

Statistical approach was used in [14] to construct Arabic classification system for news articles. This system is used for clustering and classification without using any text preprocessing. The experiments were performed on Arabic NEWSWIRE corpus of LDC, which cover four categories (Politics, Economy, Culture and Sports). For document clustering the system produce good results, but the precisions of the classification process did not exceed 50%. The authors considered that result as satisfactory result because there is no morphological analysis of the data before classification process.

Another classification system for Arabic text presented in [15], which compare six classification methods, these methods are inner product, cosine, jaccard, dice, naïve bayesian, and Euclidean. The purpose of this study is to evaluate which of these methods are most applicable for classifying Arabic textual data. In the first phase of this research, the documents are represented as vectors using the vector space model that based on term frequency and inverse document frequency. Then the document vector is used to evaluate four similarity measures (Inner product, cosine, jaccard and dice measures) that related to vector space model. The result obtained in this phase present that the cosine measure -with precision equal 85%- is the best among four measures for classification process. The second phase aims to compare the cosine measure-based classifier with the euclidean and naïve bayesian classifiers. The data that is used for evaluation is obtained from Saheeh Al-Bukhari book, which contains prophet's Mohammed sayings. The result shows that Naïve Bayesian (with precision 91%) method is the best among all methods that used in this study.

In addition, the similarity measures are used in conjunction with N-Gram Frequency Statistics in [12]. In this study, "Manhattan distance" dissimilarity measure was compared to Dice similarity measure. Collection of documents from Arabic newspaper were used for experiment. The results show that the classification using N-gram method with Dice measure is better than classification using N-gram with Manhattan distance measure.