Filtering Emails Using A Spam Filter Computer Science Essay

Published:

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.

Junk e-mail has long been a problem on the Internet. The problem has now become extremely serious. The growing popularity and low cost of e-mails have attracted the attention of marketers. Using readily available bulk e-mail software and lists of e-mail addresses harvested from web pages and newsgroup archives, sending messages to millions of recipients is very easy and very cheap, and can be considered almost free. Consequently, these unsolicited emails bother users and fill their e-mail folders with unwanted messages. Few users, if any, have never received unsolicited e-mails. These unwanted messages generally are called unsolicited emails or spam. Spam also describes the action of sending out such mails. These unsolicited e-mails are also known as bulk mails, because they are generally sent out in large batches, and as junk mail, because they are worthless to most recipients. Added to this, spammers are becoming more sophisticated and are constantly managing to outsmart 'static' methods of fighting spam. The techniques currently used by most anti-spam software are static, meaning that it is fairly easy to evade by tweaking the message a little. To do this, spammers simply examine the latest anti-spam techniques and find ways how to dodge them. To effectively combat spam, an adaptive new technique is needed. This method must be familiar with spammers' tactics as they change over time. It must also be able to adapt to the particular organization that it is protecting from spam. The answer lies in Bayesian mathematics.

As the number of users connected to the Internet continues to skyrocket, electronic mail (E-mail) is quickly becoming one of the fastest and most economical forms of communication available. Since E-mail is extremely cheap and easy to send, it has gained enormous popularity not simply as a means for letting friends and colleagues exchange messages, but also as a medium for conducting electronic commerce. Unfortunately, the same virtues that have made E-mail popular among casual users have also enticed direct marketers to bombard unsuspecting E-mailboxes with unsolicited messages regarding everything from items for sale and get-rich-quick schemes to information about accessing pornographic Web sites.

With the proliferation of direct marketers on the Internet and the increased availability of enormous E-mail address mailing lists, the volume of junk mail (often referred to colloquially as "spam") has grown tremendously in the past few years. As a result, many readers of E-mail must now spend a non-trivial portion of their time on-line wading through such unwanted messages. Moreover, since some of these messages can contain offensive material (such as graphic pornography), there is often a higher cost to users of actually viewing this mail than simply the time to sort out the junk. Lastly, junk mail not only wastes user time, but can also quickly fills up the server storage space, especially at large sites with thousands of users who may all be getting duplicate copies of the same junk mail. As a result of this growing problem,

automated methods for filtering such junk from legitimate E-mail are becoming necessary. Indeed, many commercial products are now available which allow users to handcraft a set of logical rules to filter junk mail. This solution, however, is problematic at best. First, systems that require users to hand-build a rule set to detect junk assume that their users are savvy enough to be able to construct robust rules. Moreover, as the nature of junk mail changes over time, these rule sets must be constantly tuned and retuned by the user. This is a time-consuming and often tedious process, which can be notoriously error-prone.

Rationale of the Research

OBJECTIVES OF THE WORK

Create an efficient filter

Pass both valid and junk mails and update the databases

Train the filter

Classify the mails as spam or not

Reduce the possibility of false positives

CHAPTER 2

LITERATURE REVIEW

2.1 DEFINITION OF TERMS:

The term "spam" is sometimes used loosely to mean any message broadcast to multiple senders (regardless of intent) or any message that is undesired. Here we intend the narrower, stricter definition: unsolicited commercial email sent to an account by a person unacquainted with the recipient.

The term "ham" is used to refer to any legitimate mail (mail that a user wants to receive). The classification of which mail is spam and which mail is ham varies from individual to individual.

The term "false positive" is used to refer to any legitimate mail (ham) that is wrongly classified as a spam mail. This may have drastic consequences since it might result in an important mail being lost or filtered out or being overlooked as it classified as spam.

The term "false negative" is used to refer to a spam mail classified as ham mail. This is not as drastic as a false positive but only results in a little irritation in the user having to go through an extra mail at the most.

The term "stop-lists" is use to filter out commonly used words from getting stored in the database.

The term "black-list" refers to the list of e-mail addresses from which a client will not receive mail from.the mails received from these addresses are directly blocked without passing through the filters.

The term "white-list" refers to a list of e-mail addresses from whom the mail does not pass through the filter and directly goes to the inbox.

The term "UBE" refers to Unsolicited Bulk email.

The term "UCE" refers to Unsolicited Commercial email. This is the most common type of spam. It does not include chain letters or religious messages though.

The term "Spamvertising" refers to Advertising through the medium of spam.

2.2 BAYESIAN FILTERING TECHNIQUE:

Most people are spending significant time daily on the task of distinguishing spam from useful e-mail. We have better things to do. Spam-filtering software can help. This article discusses one of many possible mathematical foundations for a key aspect of spam filtering--generating an indicator of "spamminess" from a collection of tokens representing the content of an e-mail. The

approach described here truly has been a distributed effort in the best open-source tradition. Paul Graham, suggested an approach to filtering spam in his on-line article, "A Plan for Spam". We took his approach for generating probabilities associated with words, altered it slightly and proposed a Bayesian calculation for dealing with words that hadn't appeared very often. Then I suggested an approach based on the chi-square distribution for combining the individual word probabilities into a combined probability (actually a pair of probabilities--see below) representing an e-mail.

For each word that appears in the corpus, we calculate:

s(w) = (the number of times the word w occurs in spam e-mails) / (the total number of occurrences).

h(w) = (the number of times the word w occurs in ham e-mails) / (the total number of occurrences).

Total occurrences = the number of times the word w occurs in spam e-mails + the number of times the word w occurs in ham e-mails.

Writing Services

Essay Writing
Service

Find out how the very best essay writing service can help you accomplish more and achieve higher marks today.

Assignment Writing Service

From complicated assignments to tricky tasks, our experts can tackle virtually any question thrown at them.

Dissertation Writing Service

A dissertation (also known as a thesis or research project) is probably the most important piece of work for any student! From full dissertations to individual chapters, we’re on hand to support you.

Coursework Writing Service

Our expert qualified writers can help you get your coursework right first time, every time.

Dissertation Proposal Service

The first step to completing a dissertation is to create a proposal that talks about what you wish to do. Our experts can design suitable methodologies - perfect to help you get started with a dissertation.

Report Writing
Service

Reports for any audience. Perfectly structured, professionally written, and tailored to suit your exact requirements.

Essay Skeleton Answer Service

If you’re just looking for some help to get started on an essay, our outline service provides you with a perfect essay plan.

Marking & Proofreading Service

Not sure if your work is hitting the mark? Struggling to get feedback from your lecturer? Our premium marking service was created just for you - get the feedback you deserve now.

Exam Revision
Service

Exams can be one of the most stressful experiences you’ll ever have! Revision is key, and we’re here to help. With custom created revision notes and exam answers, you’ll never feel underprepared again.