Text mining has no canonical definition. The most commonly used definition of text mining is "the non-trivial extraction of previously unknown, interesting facts from a collection of texts" [24, 26, 27], which has been derived from the definition of data mining.
Intelligent text analysis and knowledge-discovery in text (KDT) are also the alternative names for text mining. Information retrieval, information extraction, data mining, machine learning, statistics and computational linguistics, patter recognition, artificial intelligent etc. are the sub-processes or called research areas of text data mining [28, 29, 30, 31], and where as text mining is an extension or is a sub field of data mining in regard of the implementation of standard data mining techniques for textual data. According to [32, 33] "Text mining attempts to discover new, previously unknown information by applying techniques from information retrieval, natural language processing and data mining.
In , the authors describe that text or text data mining is the process of finding useful or interesting patterns, models, directions, trends, or rules from unstructured text with the help of data mining techniques . Text mining has been viewed as a natural extension of data mining, and is a corpus-based computational linguistics [36, 37].
Overview of Data Mining
Here the data mining process and techniques are briefly summarized along with their benefits and disadvantages.
After reviewing literature, and coming to know how different scholars describe data mining [38-46], and would like to say what [47, 48] has said in 1999 that:
data mining is a knowledge discovery, pattern discovery, data dredging, knowledge mining and data archeology; data mining techniques are Statistics, Decision Support, Data Management and warehousing, Parallel Processing, Visualization, and Machine Learning; and data Mining is the process of posing various queries and extracting useful information, patterns, and trends often previously unknown from large quantity of data possibly stored in databases.
It is appropriate to commence with a quotation from Simon [cited in 1, [2,3]) that any study of machine learning should begin with a formal definition of what is meant by learning. "Learning denotes changes in the systems that are adaptive in the sense that they enable the system to do the same task (or tasks drawn from a population of similar tasks) more effectively the next time". Machine learning refers to a system capable of automatically integrating knowledge that it has learned from experience .  states that the details of machine learning depend on the underlying knowledge representations (e.g., learning in neural network will be different to learning in rule-based systems).
 explains that learning is the ability to improve one's behaviour based on experience, and represents an important element of computational intelligence. Machine learning is the core concept of data mining, and it blends artificial intelligence, statistics, and human-computer interaction concerned with programs that learn from experience [6,7,8] and by "examples" . Machine learning involves people from the fields of computer science and artificial intelligence. In machine learning, patterns [i] drawn from the database are presented to the learning machine that makes predictions [10, 11].
Types of Machine Learning
Types of learning strategies can be classified according to the amount of assumption the system has to perform on its training data. [ii]  describes that there are three types of learning: rote learning, [iii] supervised learning and unsupervised learning, whereas most literature defines only the last two types of learning [7, 12, 13, 14, 15].
In a supervised learning approach, training examples consist of inputs and corresponding outputs. A series of rules, a neural network, and decision trees are examples of supervised learning. The learner should be able to generalize from the presented data to unseen examples. Classification is referred to as supervised learning as the classes are determined before examining the data. In supervised learning, there is a priori desired output-feedback.  describes that the problem of supervised learning involves learning a function from examples of its inputs and outputs, and that the presence of the outcome variable guides the learning process. In supervised learning, a teacher set provides a label for each pattern in a training set.
In unsupervised learning the algorithm is presented with input data and a set of training examples, but no priori-"no feedback in the form of desired output" [5, p. 135]. In unsupervised learning, we only describe how the datasets are organized or clustered without any measurements of the outcome. With unsupervised learning no external teacher set is used. A training set may be provided, but no labeling of the desired outcome is included. According to  in an unsupervised approach we learn without knowledge of correct answers.
Classification (HMM, Bayesian and Trees)
The main focus point in data mining is the process of building a model to represent a data set. The process of building a model to present a data set is common to all data mining products, but what is not common to all data mining products is the manner in which model is built .
In classification, each tuple/sample is assumed to belong to a groups or classes. It is referred to as supervised learning because the classes are determined before examining the data. Classification is a supervised two-step process. During the first step, the training set- a set consisting of records whose class labels are known- is used to extract an accurate model that maps the data to the training set into users' predefined classes. During the second step, the classification model is used either to classify any new tuple or introduced dataset, or to extract a more accurate classification rules after applying it to the text set, which consists of records with unknown class labels[18, 19].
In general classification, classification algorithms require that the classes be defined based on the attribute values. They often describe these classes by looking at the characteristics of data already known to belong to the classes.
Hidden Markov Model - forward algorithm
"An HMM is a doubly stochastic process with an underlying stochastic process that is not observable(it is hidden), but can only be observed through another set of stochastic processes that produce the sequence of observed symbols" 
In regard of natural language processing, Information extraction has a long history started from manually annotation and then the use of statistical methods including hidden Markov Models (HMM) . In NLP perspective, HMM is a probabilistic finite state machines that are commonly used for speech and pattern recognition; especially in bioinformatics for computational biology [50, 35]. HMM is a "transition probabilities between states and the probabilities of emissions of certain symbols, usually words, of the alphabet are modeled" . Markov Chain method __ set of states that produce symbols/words and a set of transitions between states __ is the fundamental part of HMM that can be used along with the sliding window. In the Markov model diagram, states are shown as circles and transitions are represented by arrows, and each transition has an associated transitions probabilities. HMM have three components like priors (initial state of the probabilities), state of transition model, and evidence observation model. The transition and observation models do not change . The specification of an HMM can be described:
t=Q= q1,q2â€¦qN __ a set of N states
Ptt = A = a11,a12,â€¦an1â€¦ann __ a transition probability matrix in which each aij representing the probability of moving from state i to state j such that
= O =o1,o2,â€¦,oT __ sequence of T observations/words/symbols drawn from a vocalulary V = v1,v2,â€¦,vv
Ptw= B= bi(ot) - a sequence of emission probabilities/observation likelihoods/ observation probability distribution where each expressing the probability of an observation ot being generating from a state i
QO,qF -- a start state and end/final state that are not associated with observations
The Forward, Viterbi, and Baum-Welch algorithms can be used to solve the following three canonical problems associated with HMMs:
Evaluation/likelihood: given a HMM model = (A,B) and an observation sequence O, determine the likelihood P(O/)
Decoding: given an observation sequence O and an HMM = (A,B), discover the best hidden state sequence Q
Learning: given a training/observation sequences O and general structure of HMM (transition and emission probabilities/number of hidden and visible states) to determine that HMM parameters (triplet A,B, Ï€) that best fit training data.
For the soundex, we will use the HMM forward algorithm instead of Bayes's algorithm. With Bayes's algorithms (below equestion xxxx) we can search all possible state sequences to compute p(O|Î») which is of order O(NT), here N states and T observations, and it is computationally not feasible for lager N and T. With HMM forward algorithm we can reuse the calculations without repeating earlier calculations .
-- Bayes' Algorithm 
Function FORWARD (observations of len T, state graph of len N) return forward probability
Create a probability matrix forward [N+2, T]
for each state s from 1 to N do :initialization step
Forward[s,1] â† ao,s * bs(o1)
for each time step t from 2 toT do :recursion step
for each state s from 1 to N do ot as',s
forward[s,t] â† as', s * bs (ot)
forward[qF, T] â† as, qF :termination step
return forward[qF, T]
A Hidden Markov Model (HMM) is the triple (Ps , Ptt , Ptw)
Let t be the number/set of states
Let be the number of observations or words
Let Ps be the initial state distribution
Let Ptt be the state transition probability matrix
Let Ptw be the observation probability distribution (emission Probability)
We refer to states as and observations or words as
Definition: A HMM is a triplet M = (S, Q, T) where:
â€¢ S is an alphabet of symbols
â€¢ Q is a finite set of states, capable of emitting symbols form the alphabet S
â€¢ T is a set of probabilities, comprised of:
o State transition probabilities, denoted by akl for each k, l âˆˆ Q.
o Emission probabilities, denoted by ek(b) for each k âˆˆ Q and b âˆˆ S.
We now need to distinguish the sequence of states from the sequence of symbols. The path Ï€ is a sequence of states. The path itself follows a simple Markov chain, so the probability of a state depends only on the previous state. As in the Markov chain model, we can define the state transition probabilities in terms of Ï€:
The probability of going to l given that we are at k. Since we have decoupled the symbols b from the states k, there is no longer a 1-1 correspondence between states and symbols. Thus we must introduce a new set of parameters for the model:
ek(b) is the probability that the symbol b is seen in state k. These are known as emission probabilities.
Where for our convenience we denote Ï€0 = begin and Ï€L+1 = end.
First a state Ï€1 is chosen according to the probabilities a0i. In that state an emission is emitted according to the distribution eÏ€1 for that state. Then a new state Ï€2 is chosen according to the transition probabilities aÏ€1i and so forth.
P(X,Ï€) is the joint probability of an observed sequence X and a state sequence Ï€.
In practice, this is not very useful since very often we do not know the path. So we have to estimate the path either by finding the most likely path or using an a posteriori distribution over states.
Bayesian classifiers, NaÑ-ve Bayesian- various attribute values are independent, and Bayesian belief networks - subset of attributes have dependency relationship, are statistical classifiers that can be used in such applications where the relationship between the set of attributes and the class variable is non-deterministic ; so according to , we can say that Bayesian classification algorithms are descriptive and a predictive behavior. It means that probabilities are descriptive that are used to predict the class membership of the test record. Bayesian classifiers use the Bayes theorem, a statistical theorem for combining prior knowledge of the classes with new evidence gathered from data, to model probabilistic relationships between the attributes and the class variables.
Before we describe NaÑ-ve Bayesian classifier, we have to describe Bayes theorem. Bayes theorem shows the relation between two conditional probabilities which are the reverse of each other, and here its components and equation:
P(H) is the prior probability/priori probability associated with H
P(X) is the prior probability (the probability of occurrence) of data value X
P(X|H) is a conditional probability of X based of H
P(H|X) is the posterior probability of H conditioned on X
When comparing the posterior probabilities for different values of H, the P(X) is always constant, and P(H) can be easily estimated from the training set; so to estimate the class conditional probabilities P(X/H), we either use naÃ¯ve Bayes classifier or Bayesian Belief network.
Now we will look at how Bayes' theorem is used n the NaÑ-ve Bayesian classifier.
Let T be the training set of records and their associated class labels, and each record is represented by an n-dimensional attribute vector say X = (x1,x2,â€¦,xn)
Let there be m classes say C1,C2,â€¦,Cm. Given a tuple, X, the classifier will predict that X belongs to the class having the highest posterior probability conditioned on X. That is naÃ¯ve Bayesian classifier predicts that tuple X belongs to the class Ci iff
P(Ci|X) > P(Cj|X) for 1â‰¤jâ‰¤m, jâ‰ i
Now the Bayes theorem is
P(X|Ci) is computationally expensive; so need to make naÃ¯ve assumption of class conditional independence:
P(X|Ci) = )
Write down some advantages here
In other words, Naive Bayes classifiers assume that the effect of a variable value on a given class is independent of the values of other variable. This assumption is called class conditional independence. It is made to simplify the computation and in this sense considered to be naÃ¯ve [20, 21].
Bayes classifier offers a simple and powerful supervised classification technique. The model assumes all input attributes to be of equal importance and independent o one another. Bayes classifier can be applied to datasets containing both categorical and numeric data. Unlike many statistical classifiers, Bayes classifier can be applied to datasets containing a wealth of missing items .
Decision Trees (DT)
Decision Trees (DT) are the most popular technique for predictive modeling technique used in classification, clustering, and prediction tasks. Decision trees use a "divide and conquer" technique to split the problem search space into subsets. It is based on the "Twenty Questions" game that children play .
In order to successfully apply a decision tree algorithm, a well-defined set of classes (categories) and a training set of pre-classified data are necessary. Decision tree quality is highly associated with the classification accuracy reached on the training dataset, as well as with the size of the tree. DT algorithms are two-phase processes called building phase and pruning phase. In building phase, the training data set is recursively partitioned until all the instances in a partition belong to the same class, and in pruning phase, the nodes are pruned to prevent over-fitting and to obtain a tree with higher accuracy.