This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.
Junk e-mail has long been a problem on the Internet. The problem has now become extremely serious. The growing popularity and low cost of e-mails have attracted the attention of marketers. Using readily available bulk e-mail software and lists of e-mail addresses harvested from web pages and newsgroup archives, sending messages to millions of recipients is very easy and very cheap, and can be considered almost free. Consequently, these unsolicited emails bother users and fill their e-mail folders with unwanted messages. Few users, if any, have never received unsolicited e-mails. These unwanted messages generally are called unsolicited emails or spam. Spam also describes the action of sending out such mails. These unsolicited e-mails are also known as bulk mails, because they are generally sent out in large batches, and as junk mail, because they are worthless to most recipients. Added to this, spammers are becoming more sophisticated and are constantly managing to outsmart 'static' methods of fighting spam. The techniques currently used by most anti-spam software are static, meaning that it is fairly easy to evade by tweaking the message a little. To do this, spammers simply examine the latest anti-spam techniques and find ways how to dodge them. To effectively combat spam, an adaptive new technique is needed. This method must be familiar with spammers' tactics as they change over time. It must also be able to adapt to the particular organization that it is protecting from spam. The answer lies in Bayesian mathematics.
As the number of users connected to the Internet continues to skyrocket, electronic mail (E-mail) is quickly becoming one of the fastest and most economical forms of communication available. Since E-mail is extremely cheap and easy to send, it has gained enormous popularity not simply as a means for letting friends and colleagues exchange messages, but also as a medium for conducting electronic commerce. Unfortunately, the same virtues that have made E-mail popular among casual users have also enticed direct marketers to bombard unsuspecting E-mailboxes with unsolicited messages regarding everything from items for sale and get-rich-quick schemes to information about accessing pornographic Web sites.
With the proliferation of direct marketers on the Internet and the increased availability of enormous E-mail address mailing lists, the volume of junk mail (often referred to colloquially as "spam") has grown tremendously in the past few years. As a result, many readers of E-mail must now spend a non-trivial portion of their time on-line wading through such unwanted messages. Moreover, since some of these messages can contain offensive material (such as graphic pornography), there is often a higher cost to users of actually viewing this mail than simply the time to sort out the junk. Lastly, junk mail not only wastes user time, but can also quickly fills up the server storage space, especially at large sites with thousands of users who may all be getting duplicate copies of the same junk mail. As a result of this growing problem,
automated methods for filtering such junk from legitimate E-mail are becoming necessary. Indeed, many commercial products are now available which allow users to handcraft a set of logical rules to filter junk mail. This solution, however, is problematic at best. First, systems that require users to hand-build a rule set to detect junk assume that their users are savvy enough to be able to construct robust rules. Moreover, as the nature of junk mail changes over time, these rule sets must be constantly tuned and retuned by the user. This is a time-consuming and often tedious process, which can be notoriously error-prone.
Rationale of the Research
OBJECTIVES OF THE WORK
Create an efficient filter
Pass both valid and junk mails and update the databases
Train the filter
Classify the mails as spam or not
Reduce the possibility of false positives
2.1 DEFINITION OF TERMS:
The term "spam" is sometimes used loosely to mean any message broadcast to multiple senders (regardless of intent) or any message that is undesired. Here we intend the narrower, stricter definition: unsolicited commercial email sent to an account by a person unacquainted with the recipient.
The term "ham" is used to refer to any legitimate mail (mail that a user wants to receive). The classification of which mail is spam and which mail is ham varies from individual to individual.
The term "false positive" is used to refer to any legitimate mail (ham) that is wrongly classified as a spam mail. This may have drastic consequences since it might result in an important mail being lost or filtered out or being overlooked as it classified as spam.
The term "false negative" is used to refer to a spam mail classified as ham mail. This is not as drastic as a false positive but only results in a little irritation in the user having to go through an extra mail at the most.
The term "stop-lists" is use to filter out commonly used words from getting stored in the database.
The term "black-list" refers to the list of e-mail addresses from which a client will not receive mail from.the mails received from these addresses are directly blocked without passing through the filters.
The term "white-list" refers to a list of e-mail addresses from whom the mail does not pass through the filter and directly goes to the inbox.
The term "UBE" refers to Unsolicited Bulk email.
The term "UCE" refers to Unsolicited Commercial email. This is the most common type of spam. It does not include chain letters or religious messages though.
The term "Spamvertising" refers to Advertising through the medium of spam.
2.2 BAYESIAN FILTERING TECHNIQUE:
Most people are spending significant time daily on the task of distinguishing spam from useful e-mail. We have better things to do. Spam-filtering software can help. This article discusses one of many possible mathematical foundations for a key aspect of spam filtering--generating an indicator of "spamminess" from a collection of tokens representing the content of an e-mail. The
approach described here truly has been a distributed effort in the best open-source tradition. Paul Graham, suggested an approach to filtering spam in his on-line article, "A Plan for Spam". We took his approach for generating probabilities associated with words, altered it slightly and proposed a Bayesian calculation for dealing with words that hadn't appeared very often. Then I suggested an approach based on the chi-square distribution for combining the individual word probabilities into a combined probability (actually a pair of probabilities--see below) representing an e-mail.
For each word that appears in the corpus, we calculate:
s(w) = (the number of times the word w occurs in spam e-mails) / (the total number of occurrences).
h(w) = (the number of times the word w occurs in ham e-mails) / (the total number of occurrences).
Total occurrences = the number of times the word w occurs in spam e-mails + the number of times the word w occurs in ham e-mails.