E Mail Mining And Stylometric Analysis English Language Essay

Published: Last Edited:

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.

E-mail communication is abused for numerous illegitimate purposes due to its simple and inherently vulnerable nature. Some common e-mail mediated cyber crimes are e-mail spamming, phishing, drug trafficking, cyber bullying, child pornography, sexual harassment etc. In this context, forensic analysis plays a major role by examining suspected e-mail accounts to gather evidence. We perform statistical analysis, e-mail mining and stylometric analysis towards an e-mail forensic to assist investigators gather clues and evidence in an investigation in which e-mail communication is relevant.

Keywords : E-mail forensics , Classification, Clustering , Statistical analysis


In the majority of e-mail mediated cyber crimes, the victimization tactics used vary from simple anonymity to identity theft and impersonation. Due to two inherent limitations, e-mail communication is exposed to such illegitimate uses. One, there is no mechanism for message encryption at the sender end and/or an integrity check at the recipient end. Two, the widely used e-mail protocol, Simple Mail Transfer Protocol, lacks a source authentication mechanism. In fact, the metadata in the header of an e-mail, containing information about the sender and the path along which the message has traveled, can easily be forged or anonymized. Installing antiviruses, filters, firewalls, and scanners is insufficient to secure e-mail communication. In this context, cyber forensic investigation (also called digital investigation) is employed to collect credible evidence by analyzing e-mail collections. The scope of e-mail analysis ranges from simple keyword searching to authorship attribution of anonymous emails. For instance, an investigator may want to get an overview of an e-mail collection by computing simple statistics such as the distribution of e-mails per sender/recipient domains. In some situations an investigator may try to narrow down the scope of investigation by selecting (usually few) malicious e-mails from regular ones. For this purpose, usually content based clustering is applied to divide e-mails into different groups on the basis of the subject matter of e-mails. The conceived subject matter could be the type of crime, such as pornography, hacking, or terrorism, etc., E-mails can be clustered on the basis of stylometric features to determine the writing styles of different individuals contained in an e-mail collection.

2. Literature Review

Wei et al. have proposed a clustering algorithm for detecting relationships among different spam emails to identify relationships between spam campaigns. The features they used are extracted and derived from e-mail headers and attachments of spam e-mails. An investigator may be interested in detecting a similarity between certain emails in cases of plagiarism detection and authorship analysis. Stolfo et al. suggested that e-mail social network analysis techniques are used to study the communication patterns of individuals at the e-mail account level, without analyzing the actual contents of e-mails. The need exists to develop an integrated e-mail analysis tool by using the above-mentioned innovative techniques. This will help forensic experts to efficiently analyze e-mail collections (which are usually huge), within a limited time frame. E-mail Mining Toolkit (EMT) is one such framework that computes the behavior profile of users based on their email accounts. These profiles are then employed to detect the anomalous behavior of those users. This toolkit is useful for generating reports by summarizing e-mail archives. However, the toolkit does not address the issue of authorship attribution and similarity detection. Abbasi , Chen and Zheng et al. proposed a stylometry-based framework that is used for e-mail authorship identification only. Rachid Hadjidj et al. designed and implemented a comprehensive software toolkit called Integrated E-mail Forensic Analysis Framework (IEFAF). The framework will help to assist investigators in gathering clues and evidences during investigations in which e-mail communications are relevant. It is usually assumed that the stylometric features found in one's documents remain consistent and are not controlled (neither consciously nor unconsciously) by the writers. However, the fact is that a substantial variation in the style of an individual can be seen in both the contents as well as the stylometric features depending on the recipient and the context.

3. Proposed approach

The theoretical foundation of our study is based on different well established techniques of statistical analysis, e-mail mining (classification and clustering), and stylometric features analysis. Stylometry is the statistical study of five different writing style (lexical, syntactic, structural, domain-specific and idiosyncratic) features. Stylometric features analysis is applied to learn about a user writing behavior at the content level. In forensic investigation, it is imperative to localize individuals and their resources for collecting more concrete evidence.

4. E-mail statistic analysis

Statistical analysis of e-mail accounts by observing their communication patterns manifests a great deal of information. For instance, to view a brief summary of an e-mail corpus, simple statistics like number of e-mails per sender, per recipient, per sender domain, per recipient domain, per class and per cluster are calculated. Moreover, computing similar statistics, including e-mailing frequency during different parts of the day, average e-mail size, and average attachment size (if any) of a user help reveal some non-trivial information. For instance, an e-mail user may send on average 20-30 e-mails to his/her co-workers during day time, which may drop to 5-10 e-mails at night. Similarly, the average mail size of a user may be 2-5 KB, with usually short attachments. If the same e-mail account suddenly transmits hundreds of large sized e-mails with heavy attachments towards certain unknown recipients, which reveals the possibility of suspicious behavior. This may help investigators to narrow down the investigation scope by short listing the number of suspects. More explicitly, e-mail accounts that show some kind of unusual behavior are selected for further investigation. Determining the total number of users (senders/recipients) within an e-mail collection, finding all the recipients of each user, and determining whether an e-mail has been replied to or not, helps during investigation. Statistical distributions can be computed over a certain period of time and for a specific set of e-mails. Additional statistics can be computed dynamically by sending appropriate SQL queries to the database. A more advanced use of statistical distributions can help compute users' profiles that can be used for authorship identification. To compute statistics on an e-mail corpus, each e-mail is first loaded from its raw files, and relevant fields, such as the sender, recipient, subject, and message body, etc., are extracted. Extracted information is stored in database tables.

5. E-mail mining

Data mining is a mathematical process designed to explore large amounts of data by capturing consistent patterns and relationships between data objects. By employing mathematical models, the knowledge acquired from interesting patterns is applied to make predictions about the unseen dataset. The application of data mining techniques to an e-mail dataset has been very successful in cyber crime investigation. Several studies signify the importance of e-mail mining for resolving issues of identity theft and plagiarism in e-mail forensic investigation. Classification is used to identify the topic and/or the author of e-mails. Clustering, on the other hand, is used to cluster e-mails on the basis of contents and stylometric features.

5.1. E-mail classification

In general, the process of classification starts by data cleaning, followed by features extraction. The extracted features are bifurcated into two groups, training and testing sets. Each instance of the training data has a definite category, called class label. The training set is given as input to a classification function (classifier) to develop a model. Common classifiers include decision tree neural and Support Vector Machine (SVM). The developed model is tested with the testing set by assuming that the class labels are not known. The validated model is then employed for classification of unseen data. Usually, the larger the training set, the better the accuracy of the model. In the context of e-mail classification, the body and subject of an e-mail are converted to a vector of metrics called features. Usually, each e-mail (subject and body) is converted into a stream of characters. Using java tokenizer API, each character stream is converted into distinct tokens or words. Some of the words may appear in different forms (for instance, verb, noun, and adjective, etc.), or different tenses (such as present, past, and future). Such words are stemmed to their common root. For instance, finance, financial, and financing may be converted to finance. Syntactic features, also known as style markers (punctuation and all-purpose short words called stop-words), are treated differently in different data mining applications. For example, they are dropped in topic-based classification and kept in author-based classification due to their significant discriminating capabilities in identifying authors based on their writings. Certain word sequences like 'United States of America' and 'United Arab Emirates,' etc., often appear together; that may increase features' dimensionality. A module is developed so that it can automatically scan those sequences and treat them as single tokens. Using vector space model representation, all the e-mails are converted into feature vectors, normalization is applied to the columns as needed. The purpose of normalization is to limit all the values of a certain feature in a specific range and avoid overweighing some attribute over others.

5.1.1 Topic-based classification

Most spam filtering and scanning techniques are based on topic or content-based classification. Analogously, in forensic analysis, e-mails are classified as malicious if their contents are matched to a particular cyber criminal taxonomy. In contrast to traditional keyword searching, which is inefficient and error prone, classification techniques are more precise and robust to noise and dimensionality. For instance, to identify e-mails (usually from a huge e-mail collection) that promote drug trafficking, one can perform a simple search with the word 'drug' or other related keywords. However, the criminal community often uses special expressions and encrypted messages to communicate covertly with each other. Most of the culprits use different names and speech artifacts to hide information. Classifiers, on the other hand, are not limited to a few keywords and instead are trained on multidimensional data, and thus do not suffer from information hiding. Topic classification is achieved using a classical text mining approach. Each instance from the data set may contain either two class labels or multiple class labels, depending on the number of target groups/categories. The investigator, for instance, may want to classify an e-mail as 'malicious' or 'non malicious' (normal), or classify e-mails in more than two categories, such as 'pornography', spamming and 'terrorism'. The class label in this case is the crime type/group. In topic-based classification, the context-independent words, called stop words (function words and punctuation), are removed and only the contents specific features are retained. Frequency of each of the token is calculated. The resultant frequencies are normalized to a value between 0 and 1. As a result, each e-mail Ei is represented as {f1, ., fn}, where each feature fi is a normalized frequency of a word wi. The next step is to apply a classification model to the set of feature vectors. For this purpose, a data mining software called Weka is used. The feature vectors are converted into Weka compatible format, Attribute-Relation File Format (ARFF).

5.1.2. Author-based classification

The second application of classification is to identify the author of an anonymous e-mail. The class label used for this purpose is the 'author' or 'sender' of an e-mail.

5.2. E-mail clustering

Clustering is the process of grouping data in semantically similar sets to achieve simplification by modeling data by its clusters. In case of e-mail mining, clustering is used to group e-mails on the basis of discussion topic and authorship.

5.2.1. E-mail discussion topic

To cluster e-mails by discussion topic, e-mails are processed for features extraction. There are three most commonly used clustering algorithms: Expectation Maximization (EM), K-Means, and bisecting K-Means. Once the clusters are obtained, each cluster is tagged with the most and the least frequent words/ phrases found in the respective cluster. Tagging clusters with the least frequent words, helps in finding the inter-cluster relationship. In addition to identifying the 'subject matter' of a group of e-mails, clustering can also be employed to speed up query-based keyword searching. Instead of scanning each e-mail for a keyword, all the e-mails are first clustered and then each cluster is tagged with the most frequent words, which are then matched with the keyword in question. The matched clusters are retrieved in the order of relevance to the search criterion (query contents). Another application of clustering is to identify the most plausible author of an anonymous e-mail. In this case, the stylometric features are not discarded but are used to differentiate between writings of different suspects. Clustering is applied to anonymous e-mails, as well as e-mails with known authors. Resulting clusters are tagged with the most frequent senders. The anonymous e-mail appearing in a cluster where a specific sender is the most frequent, then that particular sender is declared to be the most probable author of the disputed anonymous e-mail. This is because that specific sender is the one who has more e-mails similar to the disputed e-mail.

5.2.2. E-mail authorship attribution

Anonymity in e-mail communication is one of the main issues exploited by terrorists, pedophiles, and scammers. Falsifying sender name, address, and the path along which an e-mail travels is generally termed as spoofing and forging, which can be done even by a novice user. Traditionally, 'finger prints' are used to uniquely identify individuals during criminal investigations. Analogously, word-prints or write-prints constituted by the writing style features of an author can be used to discriminate his/her writings from that of others. The goal is to determine the likelihood that a specific individual is the author of an anonymous e-mail by examining his/her previously written e-mails. The problem of authorship identification in the context of e-mail forensics is distinct from traditional authorship problems in two ways. First, by assumption, the true author should certainly be one of the suspects. Second, e-mails, though are short in size but usually contain rich information as an e-mail normally consists of a header, subject, body and attachments. The authorship identification is considered as a text classification problem. The process starts by extracting the writing style features from the previously known e-mails of a person. Using these features, a classifier is trained. The authorship attribution technique has been successful in resolving ownership disputes over literary and historic documents.

6. Stylometric features

Writing styles are defined in terms of stylometric features. Writing patterns are usually the characteristics of words usage, words sequence, composition and layouts, common spelling and grammatical mistakes, vocabulary richness, hyphenation, and punctuation. However, there is no such features set that is optimized and is applicable equally in all domains. The commonly used features that are found in various authorship analysis studies contain lexical, syntactical, Token-based. Features are collected either in terms of characters or words. In terms of characters, for instance, frequency of letters, frequency of capital letters, total number of characters per token and character count per sentence are the most relevant metrics. Word-based lexical features may include word length distribution, words per sentence, and vocabulary richness. Structural Features are used to measure the over all appearance and layout of a document. For instance, average paragraph length, number of paragraphs per document, presence of greetings and their position within an e-mail, are common structural features. Content-specific Features are collection of certain keywords commonly found in a specific domain and may vary from context to context even for the same author. Idiosyncratic Features: common spelling mistakes such as transcribing 'f' instead of 'ph' say in phishing and grammatical mistakes such as sentences containing incorrect form of verbs. The list of such characteristics varies from person to person and is difficult to control.

6.1. Proposed attribution approach

An investigator is provided with e-mails previously written by potential suspects. The available e-mails could be in different formats, written in different languages, and may contain images, video clips, and/or HTML/XML tags. The textual part of the e-mail body are extracted written in English, and drop all other parts of an e-mail message. The proposed approach consists of two major steps: e-mail grouping or categorization, followed by classification. First, the entire e-mail collection Ei of a suspect Si, where Si Ë› {S1,., Sn}, is divided into distinct groups {SiG1,., SiGk}. For instance, grouping is performed on the basis of e-mail recipient, e-mail sender, e-mail time stamp, and combination of them. In case of the e-mail body, the known data mining technique called clustering is applied to detect similarity among e-mails based on contents. Clustering is performed on contents and stylometric features. Next, using sender-recipient, sender-time stamp, and cluster tag as class labels, a classifier is built. The classifier thus built captures the isolated and distinct styles without being misled by the overlapping behavior of an author. The anonymous e-mail is parsed and its features are extracted. The extracted features are applied to the developed classification model to identify its true author. In this case, the matching paths within the classifier are increased, thus increasing the chances that the anonymous e-mail is precisely attributed to its true author.

6.1.1. Categorization phase: mining class labels

Grouping e-mails of a suspect is done on the basis of e-mail body, as well as e-mail header information. To perform the first type of grouping, clustering technique is used. Clustering is on the basis of either contents or writing style features. The latter type of grouping is straightforward and is done by using e-mail sender, e-mail recipient, and e-mail time stamp. At the end of grouping phase, each e-mail of a group is tagged with the respective group label. These labels are later used as class labels during the process of classification. Categorization based on e-mail body

There are two types of clustering: content-based and stylometry-based. Content-based clustering is used to determine the topic of discussion within a dataset . Stylometry-based clustering, on the other hand, is used to identify the different writing styles contained within a data collection. The process of applying clustering in both cases is the same. The only difference is in the preprocessing step. In content-based clustering, the common type of preprocessing is performed. More explicitly, once each e-mail is converted into a bag-of-words, the style markers (function words and punctuation) are dropped. In stylometry-based clustering the syntactic features are maintained Once all the e-mails of each author are converted into vectors of features, clustering is applied. Clustering is applied to e-mails of each author independently. The resultant clusters of an author, for example S1, are labeled as fS1C1; S1C2;.; S1Ckg. Similarly, e-mails of another author, S2, are clustered separately, and resultant clusters are labeled as fS2C1; S2C2;.; S2Ckg. The cluster labels are used as class labels during the classification phase. Categorization based on e-mail header

In the traditional classification approach of authorship identification, email sender is used as a class label. Here, e-mails of the same author are divided into different groups. This division is based on e-mail recipient and e-mail time stamp, differentiating the different writing styles of the same user. People usually communicate with different categories of people at different times. For instance, most of the e-mails that a person writes during day time are exchanged with his/ her co-workers. Similarly, e-mails written in the evening may be exchanged with his/her family members and friends. Likewise, very few of the e-mails that are exchanged at midnight may be written to one's job colleagues. For simplicity, divide the whole 24 hours day into three time brackets: morning, evening, and night. Therefore, e-mails of a sender are divided into three categories: e-mails sent in the morning are tagged as SM, e-mails sent in the evening are tagged as SE, and those sent at night are tagged as SN, where S represents sender. SM, SE and SN are used as class labels during the classification phase.

6.1.2. Classification phase

Once e-mails of all the senders are divided into distinct groups and, thus, the respective class labels for each group are determined, the next phase is to apply classification. This phase consists of features extraction, model generation, and model application. Features extraction

Each e-mail body is converted into an n-dimensional vector of features. A feature could be a word frequency, ratio of two quantities, or a boolean value. Model generation and validation

Prior to the application of classification algorithms, the e-mail group is first divided into training and testing sets. At the end of features extraction phase, there are two sets of features vectors (training and testing) for each suspect. Using the training set, some selected classifiers are employed to generate a model. Using the testing sets, the generated models are validated prior to their actual use.

7. Conclusion

As a result of growing e-mail misuse, investigators need efficient automated methods and tools for analyzing e-mails. In our work, we perform statistical analysis, e-mail mining and stylometric analysis towards an e-maill forensic to assist investigators gather clues and evidence in an investigation in which e-mail communication is relevant. This includes different functionalities ranging from e-mail storing, editing, searching, and querying to more advanced functionalities such as authorship attribution. To obtain more credible results, the level of cohesion and harmony among different analysis techniques needs to be increased.