This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.
Phishing is a malicious activity where attackers try to lure users into visiting their fraudulent websites . Even though the web users are aware of these types of phishing attacks, lot of users become victim to these attacks. The phishing attack include stealing user's confidential infor-mation such as login username,password and credit card details. Successful phishing detection system would dist-inguish any phishing websites from legitimate websites.
Phishing is a social engineering technique used to deceive users and exploits the poor usability of current web security technologies. In order to lure the victim into giving up sensitive information the message might include imper-atives such as "verify your account" or "confirm billing info-mation". Once the victim has revealed the password, the attacker could access and use the victim's account for fraud-ulent purposes or spamming. Only specialists can identify these types of phishing websites immediately. But all the web users are not specialist in computer engineering and hence they become victim by providing their personal and financial details to the phishing artist. Thus, an efficient mechanism is required to identify the phishing websites from the legitimate websites in order to save credential data.
II. Study of existing Fake website detection techniques
A. WHOIS: It is a "query and response" protocol that is widely used for querying databases that store the registered users or assignees of an Internet resource, such as a domain name, an IP address block, or an autonomous system, but is also used for a wider range of other information. The protocol stores and delivers database content in a human-readable format. This can help provide insight into a domain's history and additional information. WhoIs lookup can be used to see who owns a domain name, how many pages from a site are listed with Google or even search WhoIs address listings for a website's owner.
B. Browser-Integrated Anti-phishing toolbars:
Google Safe Browsing: Google Safe Browsing is a service provided by Google that provides lists of URLs for web resources that contain malware or phishing content. The disadvantage of the approach is that non blacklisted phishing sites are not recognized.
NetCraft tool bar :Netcraft provides a browser toolbar to report and block phishing sites identified by the toolbar user community.
Toolbar features include:
It provides a display of hosting location of a given website.(e.g. the main online banking site of a large US bank cannot be hosted in the former Soviet Union).
Once you report a phishing URL, it is blocked .It natively traps cross-site scripting and other suspicious urls containing characters which have no common purpose other than to deceive.
Netcraft supervisor validation is used to contain the impact of any false reporting of urls.
Display of browser navigational controls (toolbar & address bar) in all windows, to defend against pop up windows which attempt to hide the navigational controls to disguise location.
eBay tool bar ,the eBay solution is designed for eBay and PayPal and involves the use of a so-called "Account Guard" that changes color if the user is on a spoofed site. Verisign provides a commercial antiphishing service .
McAfee SiteAdvisor , SiteAdvisor is a service that reports on the safety of web sites by crawling the web and testing the sites it finds for malware and spam. It includes automated crawlers that browse websites, perform tests and create threat ratings for each visited site . One popular solution to address this problem is to add additional security features within an Internet browser that warns users whenever a phishing site is being accessed. Such browser security is often provided by a mechanism known as 'blacklisting', which matches a given URL with a list of URLs belonging to a blacklist.Large companies such as Microsoft, McAfee and Google, maintain blacklists of phishing web sites.
C. Related Work
In ,blacklists are easily evaded by attackers. They attach new top level domains to existing URLs and modify them so that they are not available in blacklists and can easily perform identity theft,obtain financial details of internet users.This approach generates new URLs by using various heuristics and uses a matching algorithm further to match new URL with entries in the blacklist.
In,a method based on feature extraction of websites and classification approach is used.The classifier is trained with features extracted from a training set of legitim-ate and non-legitimate websites .Then an unknown website is tested by the classifier which checks if a website is fake or real.
Source code of non-legitimate website is captured.
Features are extracted ,identity extraction and feature extraction.
More focus on URL and Source code.
Page URL is checked for no of slashes.
A. Feature Extraction:
The features extracted are listed in Table1 :
Usually phishing websites contain IP address as URL
Starts with http or https
If the url starts with https ,it cannot be a phishing page.
If the website is listed in alexatop5000,then it is a safe URL.
Dots in URL
The url in the source code should not contain more number of dots. If it contains more number of dots then it pretends to be a phishing website.
Slash in page address
The page address should not contain more number of slashes. If they contains more than five slashes then the url is considered to be a phishing url
Slash in url
The URL should not contain more number of slashes. If they contains more than five slashes then the url is considered to be a phishing url.
Use of @ Symbol
Presence of @ symbol in page address indicates that, all text before @ is comment. So the page url should not contain @ symbol.
The details of phishing website will not be available in "whoisâ€Ÿ database. "Whoisâ€Ÿ database is checked for the existence of the data pertaining to a particular website.
The <meta> tag provides metadata about the HTML document. Meta elements are typically used to specify page description, keywords, and author of the document, last modified and other metadata..
If there is no relevance between the URL address and contents of the META tag it can be a phishing website.
META Keyword Tag
The META Keyword Tag provides keywords related to the web page which may be the identity of a web page. If there is no relevance between the URL address and contents of the META Keyword tag then it can be a phish.
An anchor tag contains href attribute whose value is an url to which the page is linked with. Foreign anchor occurs if the domain name in the url is not same as the domain in page url. A website can contain foreign anchor. If number of foreign anchors exceeds then it
is a sign of phishing website. Check all the anchor <a> tags.
Server Form Handler (SFH)
Forms enables user to pass data to a server. Action is the attributes of form tag, which specifies the url to which the data should be transferred. In the case of phishing website, .
1) The value of the action attribute of form tag comprise
2) value is empty,
3) value is #,
4) Value is void.
Many a time,websites may request images, scripts, CSS files etc. from other websites. Phishing websites imitating the legitimate website will request these objects from the same page as legitimate one.Then in such a case ,the domain name used for requesting will not be similar to page url. Request urls are collected from the src attribute of the tags <img> and <script>, href attribute of link tag and code base attribute of object and applet tag. If the domain in these urls is foreign domain then the domain then,it is a phishing website.
Blacklist contains list of suspected websites. It is a third party service. The page url is checked against the blacklist. If the page url is present in the blacklist, it means it is a phishing website and the value of 5 is assigned as -1 or else the value is 1
The following table , table I shows us the various parameters of the classification algorithms namely ,MultiLayer Perceptron and the Decision Tree(J48) and NaÃ¯ve Bayesian. The prediction accuracy is measured as the ratio of number of correctly classified instances in the test dataset and the total number of test cases.
Time taken to build
A solution proposed by  describes a hybrid phish detection method .It is based on the information extraction (IE) and information retrieval (IR) techniques. The identity-based component of method detects phishing webpages by directly comparing the inconsistency between their identity and the identity they are claiming. The keywords-retrieval component uses the IR algorithms and exploits the power of search engines to detect phishing websites.This method requires no training data, no prior knowledge of phishing signatures,and thus is a robust method to detect new phishing patterns.
CANTINA is a content-based approach to detect phishing web sites, It is purely based based on the TF-IDF(term frequency/inverse document frequency) used in information retrieval algorithm.It more specifically focuses on the Robust Hyperlinks algorithm previously developed for overcoming broken hyperlinks.
Robust Hyperlink:If a particular page is not found with its basic URL,then we form a lexical signature that is a composition of 5 words with highest tf-idf value and enter the lexical signature in a search engine to locate a robust hyperlink whose signature closely matches to our lexical signature .If no such link is found then the URL is of a fake website ,else a legitimate one.CANTINA looks for the content of a web page to determine whether it is legitimate or not, in contrast to other approaches that look at other characteristics of a webpage, for example the URL and its domain name.Results show that CANTINA is good at detecting phishing sites, detecting 94-97% of phishing sites. 1
The solution described in  presents a novel classification method that identifies malicious web pages based on static attributes. It analyzes the underlying static attributes of the initial HTTP response and HTML code. Static attributes that characterize malicious actions can be used to identify a majority of malicious web pages. It makes use of a generic classifier ,high-interaction client honeypots and this new classification method into a hybrid system leads to significant performance improvements.
The following is the brief description of the classifiers mainly used in phishing website detection systens:
The NaÃ¯ve Bayes classifier works on a simple, but comparatively intuitive concept. Also, in some cases it is also seen that NaÃ¯ve Bayes outperforms many other comparatively complex algorithms. It makes use of the variables contained in the data sample, by observing them individually, independent of each other.The NaÃ¯ve Bayes classifier is based on the Bayes rule of conditional probability.
It classifies data in two steps
(a) Using the training samples, the method estimates the parameters of a probability distribution, assuming features are conditionally independent given the class.
(b) For any unseen test sample, the method computes the posterior probability of that sample belonging to each class.
The method then classifies the test sample according the largest posterior probability.
A multi-layer perceptron is the neural network model.It uses a backpropagation algorithm. It maps sets of input data onto a set of corresponding output data. It is a standard algorithm for any supervised-learning pattern recognition process. To overcome the representational limit of simple perceptrons, a multi-layer perceptron uses hidden layers between the input layer and the output layer. If the hidden layer is large enough, it can approximate any complex function.
C.Decision tree Induction(J48)
A decision tree is a predictive machine-learning model that decides the target value (dependent variable) of a new sample based on various attribute values of the available data. The internal nodes of a decision tree denote the different attributes, the branches between the nodes tell us the possible values that these attributes can have in the observed samples, while the terminal nodes tell us the final value (classification) of the dependent variable.
The attribute that is to be predicted is known as the dependent variable, since its value depends upon, or is decided by, the values of all the other attributes. The other attributes, which help in predicting the value of the dependent variable, are known as the independent variables in the dataset.
The J48 Decision tree classifier follows the following simple algorithm. In order to classify a new item, it first needs to create a decision tree based on the attribute values of the available training data. So, whenever it encounters a set of items (training set) it identifies the attribute that discriminates the various instances most clearly. This feature that is able to tell us most about the data instances so that we can classify them the best is said to have the highest information gain. Now, among the possible values of this feature, if there is any value for which there is no ambiguity, that is, for which the data instances falling within its category have the same value for the target variable, then we terminate that branch and assign to it the target value that we have obtained. See Fig 1,an example of decision tree,
Fig 1:A decision tree
Decision Tree Algorithm:
As it can be seen from the Table II , Decision tree algorithm's prediction accuracy is 98.5% whereas its mean absolute error is 0.292 which is substantially greater than Multilayer Perceptron and Naives Bayesian algorithms . Hence if improvements are made to decrease the mean absolute error of decision tree algorithm ,then the improved algorithm can be used to detect phishing websites rapidly and successfully. Hence,further work can be done in this area to decrease the mean absolute error of decision tree .
PHP will be used to do feature extraction.Features will be extracted in either a .txt or .xls file which will be converted to .arff format for feeding data to the classifier.See Fig2.
Classifier will further decide whether the given website is a fake or a legitimate one.
Improved Decision tree
In this paper,different techniques to detect fake websites have been presented . As we have seen , there exists a huge number of techniques to detect fake websites , but still there are false positives (false alarms) present . Analysis shows that improvements are needed to decrease the mean absolute error of Decision Tree algorithm . Improved Decision Tree algorithm can be developed.
Mrs.Hetal Rajpura wishes to acknowledge Prof. Hiteishi Diwanji for showing her the right path to carry out her research work and for her constant support and guidelines and to all the staff members of Computer Department, L. D. College of Engineering for extending their kind support throughout the work . Prof.Hiteishi Diwanji wishes to acknowledge her family, and all the staff members at L. D. College of Engineering .