The Overview Of Text Mining In Email English Language Essay

Published: Last Edited:

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.

Modern society has seen a massive explosion in email. This report will examine the major financial and time consuming problems surrounding the explosion of email. This report will also study how techniques such as text mining can be used to filter email into respective categories. These techniques work in real - time to examine the content of incoming emails and classify or filter them as needed.

2 Overview of Text Mining in Email

2.1 The Rise in Email

In recent years, the number of emails a user can receive on a daily basis, including all filtered email, has risen dramatically. Email has become a fundamental part of the modern economy. It is central to communication between organizations, but users receive a huge number of spam emails daily. A recent Symantec report [1] on the types of cybercrime attack saw the number of spam emails drop in 2011. This number still accounted for seventy five per cent of all email sent during the year. This meant a total of 42 billion spam emails were sent daily throughout 2011. A report by the Radicati Group [2] stated that an average corporate user receives around 105 emails daily and even excluding filtered spam, the user still receives around 20 spam messages per day.

Because of this rise, employers have viewed the need for some type of filtering application; not just to filter spam emails, but to categorize or classify emails to be sent to their respective departments. This is to avoid unnecessary time wasting when emails are sent to the organization.

2.2 Characteristics of Email Mining

Text mining itself is a relatively new research area. Email analysis falls within the area of text mining, although it does have certain characteristics that make it differ to ordinary text mining. The characteristics of email mining include:

Length of Email. The length of typed text contained within email can be considerably brief. This could, therefore, make it unsuitable to regular text mining that requires large amounts of data to classify text.

Different Themes. One email may contain two or more themes. This could mean that categorizing the email may become extremely awkward.

New Words. As new words appear in email analysis, these must be dealt with appropriately. Although similar to problems with ordinary text mining, where new words form in everyday language, email analysis may require new classes to be formed.

Testing the Application. Because email is privately owned, testing of an application may be difficult. There are some datasets available online to achieve this but they may not be suitable for every application.

Noise. Noise is a big problem with email analysis. Attached documents and images and the actual code behind the email may have to be removed before classification. Spam emails have evolved to purposely contain noise to deceive email filtering applications.

Required Filtering. The required filtering of emails may be different from person to person.

Errors in the Text. The way in which some people write emails is becoming similar to that of text messages. In this type of message, the text may be written in a format unknown by the classifier. Also, spelling errors can happen often.

Header. Although most of the code behind an email should be removed as noise, the header can contain vital information about the email itself. This could be used for classification.

3 Email Analysis

The steps involved if email analysis are as follows:


Feature Selection

Email Classification

3.1 Pre-processing

The first step into analysing email to filter and classify is pre-processing. This step involves extracting the raw data and turning it into a structure that can be understood by the application. Lately, emails have contained HTML code which is used to format text within the email. This code could be removed as noise with the use of a HTML parser, although certain HTML formats could be examined to classify the email as explained by Corney, Vel, Anderson, & Mohay, 2002 [3]. They use the total number of HTML tags contained within the email and how they are used as a separate feature or attribute.

The standard "Term Vector Model", which is used to denote every email as an ordered array of data, is the model most commonly used. Each element of the array is called a "token". To represent the presence of each element within each vector (email), each element in the "bag of words" is given a Boolean representation. The number of occurrences of a particular element is denoted as its "weight", which will have a value between 0 and 1. Instead of single words being tokens, multiple words may be expressed as a single token. These multiword expressions could be important for classifying emails for certain departments. Although these are particularly difficult to determine, they are extremely important for classification. As explained by Sag, I. A., Baldwin, T., Bond, F., Copestake, A., & Flickinger, D. (2002) [4] a simple phrase such as "Oakland Raiders", if not treated as one token could be construed incorrectly. Therefore, it is important that relevant multiword expressions be defined during pre-processing.

Other issues that arise during pre-processing include frequently occurring words and words from a shared stem. Shared stem words are words that are from the same family, for example "require", "required" and "requiring". These words may need to be treated as one token. Algorithms exist to enable easy stemming of words, such as Snowball or Porter's. To deal with frequent unnecessary words, such as "it" or "and", a "stop - words" operator could be applied to the classifier. One possible reason for not removing these words would be how they are sometimes used in authentication. As stated by Vel, Anderson, Corney, & Mohay, 2001 [5], the way in which these words are used and how often they are used can be an influential factor for authentication.

The significance of words and where they are situated within the email are important considerations. For example, the word "From" is far more important if contained within the header. Examples like this may be treated as different tokens if they appear in different sections. The importance of a word can also be determined by using the TF-IDF (Term Frequency - Inverse Document Frequency) algorithm. This algorithm determines the weight of a token by first calculating how often the token occurs within the email and comparing it to how often it occurs in other emails. A token is considered significant if it occurs frequently within an email and infrequently in others.

3.2 Feature Selection

As the feature set builds, the number of features may grow to a number that is too heavy on resources. In fact, these numbers can grow into the tens if not hundreds of thousands. A report in 2010 by Harvard University and Google researchers found the English language to contain over one million words [6]. This, therefore, requires Feature Selection to reduce the number of features to a workable feature set size. To do this, algorithms are used. Algorithms, like the TD - IDF explained earlier can be used to select features by importance. They do this by ranking each feature in the bag of words by some determining factor and selecting the "n" highest ranked features.

More popular algorithms used for feature selection are "Information Gain" (IG) and "Chi Squared" (CHI) explained by Yang & Pedersen, 1997 [7]. They find these methods to be best at removal of features without loss of accuracy.

3.3 Email Classification

Step three of the email analysis is email classification. This area covers classifying each email into respective categories. Two types of classification exist. They are "Flat" and "Hierarchical". In flat classification, all classes are at the same level, whereas in hierarchical classification, classes are split into classes and sub-classes. To build a model which classifies emails, one or more classifiers are applied. Examples of classifiers include "Naive Bayes", "Support Vector Machines" and "Back-Propagation Neural Networks" (BPNN).

Originally, the most common classifier used in the categorization of emails was Naïve Bayes but as early as 2001, Carreras & Marquez [8] showed the ability of better algorithms. They showed that the "AdaBoost" algorithm outperformed Naïve Bayes and Decision Trees for spam email filtering. More recently, "Semantic feature space" (SFC), which is a technique for extracting more important features from the dataset, has been used along with modified versions of the BPNN. These have increased the value of the results. Zhu & Yu, 2009 [9] proposed this mechanism. The SFC is used to reduce the number of dimensions that are fed into the BPNN. The BPNN is also modified to save computational time. Huang & Li, 2012 [10] also used a similar technique to achieve a rich feature set. They built an SFC from training data and a thesaurus of words from relationships between them, combined them, and applied them to an Adaptive Back-Propagation Neural Network (ABPNN). The ABPNN algorithm applies statistical methods to assess each learning stage.

3.4 Email Clustering

The next step in the process is an optional step. The goal is to enter each email into its respective folder (cluster). This is done automatically by the clustering algorithm. The most popular algorithm for this step is the "k-means" algorithm.

4 Uses of Email Mining

There are many reasons why a modern organization would employ some kind of email mining. Mostly, they all centre round saving time. As explained earlier, the amount of time used up during examination of emails is creating major problems for organizations. Email mining and subsequent automated handling of the emails can save a significant amount of hours.

4.1 Automated Email Response

After an email has been categorized, it is possible that a response can be sent automatically. This happens using a Question - Answer (QA) System (Gupta, Kashyap, Kumar, & Mittal, 2005) [11]. Firstly a classifier finds and categorizes the "question" within the email. The question is then parsed to extract relevant information. After processing the question a relevant response is calculated by weight and rank and submitted to the user. This can be very helpful in call centres, where questions on one subject can be numerous.

4.2 Email Separation by Folder

Many email programs today allow the user to separate emails by folder. The level of importance of an email or how they are separated can then be determined by the user. Email classification can automate this process and allow employers categorize email by importance, e.g. business emails over personal emails. A study by Koprinska, Poon, Clark & Chan, 2007, [12] showed the difficulty in classifying emails in this way. Their study showed user classification style greatly affected results. The classifier performed well for subjects like "sender" but performed poorly when attempting to classify by areas such as "action performed". There are many softwar applications avaiulable to classify emails in this way, such as TITUS, POPFile and janusSEAL.

4.3 Email Summarization

Email Summarization incorporates two areas, i.e. Collective Message Summarization (CMS) and Individual Message Summarization (IMS). CMS is the summarization of a collection of messages pertaining to one subject, while IMS is the summarization of individual messages. Before a meeting, for example, an employee may need to review a conversation on a particular subject. CMS would solve this and has been demonstrated using "Clue Words" to determine if messages belong to a certain conversation (Carenini, Ng & Zhou, 2007) [13]. CMS is relatively new and is still under research.

IMS has been used for a little longer as it is simpler to summarize each message independently. A system introduced by the IDS and CCS called CLASSY, has been shown to summarize text documents, and then email messages, into smaller more manageable text without loss of important text. Conroy, Schlesinger, O'Leary & Goldstein, 2006, [14] showed how they used CLASSY to split, trim and score sentences to achieve high scores using the ROUGE package for evaluation of summaries.

4.4 Spam Filtering

The filtering of spam email messages is becoming more and more complex as spammers alter their methods. Spammer's motive is usually financial and, therefore, as spam filters evolve, so do spamming methods. There are two types of spam filters, i.e. List-based or Non-statistical and Machine Learning or Statistical. Non-statistical methods use DNS blacklists, which are lists of domain names of identified spam sources and whitelists (lists of accepted domain names) to filter spam messages. Statistical methods use machine learning to detect spam. If a user flags an email as spam, the machine uses the content to further learn. The classifier is then applied, e.g. Neural Networks can then use weighted learning to classify the messages.

Spam filtering can be further split into two categories, i.e. Server - side and Client - side. ISP's filter spam messages on their email servers and are sent to the user as spam or junk messages. This takes some of the burden from users, but may still require the user to check the spam messages to ensure none have been incorrectly assigned as spam. The handling of spam has a greater importance than most email mining. The possibility of putting an important email into a spam folder could be a serious error. Therefore, a classifier should be extremely accurate in its decision.

There have been many studies into which classifier works best for filtering spam, with the common consensus being Naïve Bayes. More recently, a combination of both statistical and non-statistical methods have been applied to combat spam. Wu, 2008, [15] examined the possibility of using a BPNN with spamming behaviours, instead of keywords, and found behaviours to be a good identifier, although the convergence time of the BPNN was "unstable".

New forms of spam include image spam which is a more difficult type of message to detect. Regular spam filters use text within the message to detect spam messages, so spammers have started using images that contain the message to avoid filters. Khanum & Ketari, 2012, [16] noted that most image spam techniques today use pattern matching and these messages are considerably larger than conventional messages. As this is, therefore, more computationally expensive, this could use up a lot of server resources.

4.5 Email Ownership

In areas of forensic investigation, email ownership is very important. Mining emails to classify ownership of emails could be done using many of the emails characteristics. Apart from the obvious header information, which can be modified, the traits within the body of text are used, i.e. greetings and leavings, blank spaces and length of words and sentences.

5 Conclusion