Improving Accuracy of SMS Spam Detection using Machine Learning

2354 words (9 pages) Essay in Information Technology

08/02/20 Information Technology Reference this

Disclaimer: This work has been submitted by a student. This is not an example of the work produced by our Essay Writing Service. You can view samples of our professional work here.

Any opinions, findings, conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of UK Essays.

Improving Accuracy of “SMS spam detection using Machine Learning”, Deploying it in the Cloud Environment and Comparing the performance between Cloud Platform and Local Machine. 

AbstractCell phones are one of the major communication devices in the world. The number of mobile phone users are getting increased every day. This sensational increment of cell phone users has led to a significant increase in SMS spam messages. Despite their interest, these spam messages are sent to many cell phone users in one shot. In addition, to be annoying, spam messages are risky causing you to install malware on your device which might steal your personal data. There arevery limited number of SMS spam filtering softwares currently available due to limited availability of public dataset.

In this paper, I am presenting how the accuracy of detecting spam messages can be increased using different Machine learning classifiers such as Decision tree, KNN, Naïve Bayes, Support vector machine (yet to be decided). I’m also going to implement this machine learning model on the AWS cloud platform and examine the key challenges, benefits, and cost. In addition to this, I will try to answer how different the implementation is in the cloud? Performance Comparison and Accuracy between local machine and cloud.

 

Keywords—SMS Spam, Machine Learning Classifiers, Cloud platforms, Amazon Sagemaker.

I.     Introduction

Short message service(SMS) is the exchange of short text message between sender and receiver using the cellular network of the mobile phone. The maximum size of a text message is 150 alphanumeric characters and that is why it is called short message service.  SMS is one of the most important ways of communication in few places where sending a SMS is much cheaper than making phone calls. This communication is carried out through the control path between mobile and tower and using communication protocols. The popularity of SMS has greatly increased, causing network providers to reduce the costs of SMS that anyone can use. There are few areas in Asia in which there are unlimited free text messaging facilities. This opened a channel for spammers to send spam SMS. Spam SMS refers to any unrequested or irrelevant text messages sent usually to a large group of users for advertisement or sending malware or phishing, etc. A Cloudmark study [1] reported that the amount of spam messages varies widely from region to region. In North America, for example, much less than 1% of SMS messages were spam in 2010, whereas in parts of Asia, up to 30% of messages were spam. There are so many risks associated with SMS spam that spammers can obtain personal information from users, install malware by asking users to click on specific links, sell phone numbers to other spammers if a user replies to spam messages. In addition, there are possibilities to block a good message thinking it’s a spam this might be one of the reasons why network providers do not provide spam filtering software. Similarly, academic researchers also facing difficulties in developing a sophisticated application that efficiently filters spam messages because of the limited availability of the public SMS dataset. To meet all these requirements, I will use the largest SMS dataset to my knowledge, which is also being used by T. A. Almeida, J. M. Gómez Hidalgo, and T. P. Silva, in one of their research work [2]. The rest of this paper is as follows. Section 2 provides information on the existing work and its limitations that is Literature Review. Section 3 gives an overview of my Research Plan and at last, section 4 provides the conclusions and outlines for future work.

Machine Learning in cloud platforms: The connection between Machine Learning and Cloud computing comes to picture when we need infrastructure. Machine learning is a process that would need a huge amount of processing power and storage space to store the data. This can be actually achieved by setting up a cluster of several powerful machines but that hardware infrastructure may not be required to run all the time and It is expensive as well. The Cloud platform offers solutions to these problems by providing hardware infrastructure via the internet which can be destroyed once work is done. Cloud platforms also offer other benefits for implementing Machine Learning models on it such as pay-as-you-go where the user will be charged based on the usage, scalability of the platform, API based Machine Learning services in which developer can add intelligence to any application using pre-trained services and many other benefits.

II.    Literature Review

A.    Existing work on Spam filtering

Tiago A. Almeida and others [3] in their research evaluated several classifiers of Machine Learning and formulated a table of the accuracy of all classifiers. This research was performed on the latest dataset which is a collection of 5574 messages including 747 spam messages. Duplicate messages were eliminated using plagiarism detection techniques. Dataset was divided into 30%-70% for training and testing purpose, respectively. At last, they evaluated the accuracy achieved by several Machine learning classifiers and the results showed the Support Vector Machine algorithm was the best among all.

Gordon V. Cormack and others performed a research [4]. They claimed that SMS spam detection can be achieved better using filtering systems those applied to perform email spam detection, but email spam detection requires some changes to achieve better performance in SMS spam filtering. The best high performing email spam filters were used to detect SMS spam. The results of this experiment concluded that larger dataset is required with more experiments to efficiently filter the spam. It was mentioned before, academic researchers were looking for the largest public dataset.

Gordon V. Cormack and others proposed a new technique in a research work [5] which can improve the efficiency of Machine Learning Classifiers. A short and messy text message was first normalized and expanded to get more features which enhances the classification performance. It was based on semantic and lexicographic dictionaries along with content detection and semantic analysis. The results showed these preprocessing techniques are best suitable for content-based filtering and there were no clear feature sets (merging rules) and classifiers so they intend to complement the traditional bag-of-words features. In this way, after a feature reduction step, it was decided to select the most relevant features to improve the classification capabilities.

L. Zhang, J. Zhu, and others in their research [6] focused on five common supervised learning approaches for email spam detection by using three English datasets and one Chinese dataset. Three major conclusions were deduced from this research. One being, Spam filtering task is quite effective with Bag of words model. Two being, the top performers of spam filtering are maximum entropy, AdaBoost and Support Vector Machine. Three being, the message header information is as crucial as the message body. The information which comes from the message header was not considered as important, but this research concluded that using all the information from both the header and body classifiers yields better results.

B.    Amazon Sagemaker

Amazon Web Services (AWS), Microsoft Azure and Google Cloud are major leading cloud service providers in the market. I will be using the Amazon cloud platform for this research. AWS provides several Machine Learning and Artificial Intelligence services. Amazon Sagemaker is one among them. It allows the user to easily and quickly build, train and deploy Machine Learning models. It comes with a hosted Jupyter Notebook through which we can fetch the data, clean and transform the data. Sagemaker makes it easy to preprocess the data by providing several built-in libraries which saves a considerable amount of time. AWS provides a storage solution called Amazon S3[7] where we can store our input data and results. Jupyter Notebook instances provided by Sagemaker makes it easier to establish a connection to Amazon S3 to read the input. Sagemaker also provides several built-in algorithms that can be used for various problem types to train our model. Once the model is trained Sagemaker provides an opportunity where we can evaluate the accuracy of our trained model. This can be achieved either by using AWS SDK for Python (Boto) or the high-level Python library of Sagemaker. Finally, once the model is trained and tuned Sagemaker makes it easy to deploy it in the production environment and It can be used for predictions of real-time data. Sagemaker deploys the model in an auto-scaling cluster decoupling the model from application code.

III.   Research Plan

Fig. 1 below illustrates the full roadmap of this research. In any scientific research, reliable data is crucial. The lack of proper data can have a serious effect on evaluation processes and comparisons. So, the first step is to find a reliable dataset for the experiment. Regarding this research, I already found a dataset which is a collection of 5573 text messages. This dataset must be preprocessed which includes a few procedures such as Duplicate Analysis, Irrelevant Data observation, eliminating Outliers, etc.  Machine learning has different methods such as Supervised, Unsupervised, Semisupervised to train the data, selection of the right method depends on the dataset and the Business problem. In this research, I will be using any one of the above mentioned three methods. Once the method is finalized, I need to figure out which algorithms will best suit or gives the best result for this problem. Examples of such algorithms are Decision tree, KNN, Naïve Bayes, Support Vector Machine, etc. Amazon Sagemaker also provides several built-in algorithms such as Linear learner, XGBoost, Factorization machines, Seq2seq, etc. Once the algorithms are finalized, model training can be initiated. An evaluation study on the performance and accuracy of all the selected algorithms on the local machine as well as cloud platform needs to be done. Finally, top performers are selected and a comparative analysis between the local machine and Amazon Cloud platform with respect to performance, usability, cost, and Accuracy will be reported. In this way, I will try to answer the following questions How the accuracy of the spam detection can be improved on the local machine? Does the accuracy increase if we make use of built-in algorithms provided in Amazon Sagemaker? How does the implementation differ in cloud platform than in local machine? A Comparative analysis between the local machine and Amazon Sagemaker with respect to performance, accuracy, cost, and usability.

IV.   Conclusion

In this paper, I have tried to give the understanding of SMS spams, risks of it and importance of spam detection in the introduction also a brief introduction to Machine Learning as a cloud service and benefits of it. Four different research works in the field of spam detection were discussed in the Literature review part and we understood that there was a scarcity of public dataset to perform the experiments. But, now the datasets are available, and this research work tries to explore how accuracy can be improved either by using a local machine or cloud platform. A short introduction of Amazon Sagemaker is also given in the literature review. Detailed plan of my research workflow is provided in the Research plan section where I tried to build a roadmap for my future work and mentioned the research questions which will be answered through this paper.

                                                                                      Fig. 1. Roadmap of Research Plan.

REFERENCES

[1]      http://www.cloudmark.com/en/article/

[2]      T. A. Almeida, J. M. Gómez Hidalgo, and T. P. Silva “Towards SMS Spam Filtering: Results under a New Dataset INTERNATIONAL JOURNAL OF INFORMATION SECURITY SCIENCE Vol.2, No. 1

[3]      Tiago A. Almeida, José María Gómez Hidalgo and Akebo Yamakami “Contributions to the Study of SMS Spam Filtering: New Collection and Results”

[4]      Gordon V. Cormack, José María Gómez Hidalgo and Enrique Puertas Sánz “Feature Engineering for Mobile (SMS) Spam Filtering” – http://www.esi.uem.es/jmgomez/papers/sigir07.pdf

[5]      Tiago A. Almeida, Tiago P. Silva,  Igor Santos, and Jos´e M. G´omez Hidalgo “Text Normalization and Semantic Indexing to Enhance Instant Messaging and SMS Spam Filtering” – http://paginaspersonales.deusto.es/isantos/papers/2016/2016- almeida-kbs-short spam.pdf

[6]      LE ZHANG, JINGBO ZHU and TIANSHUN “ An Evaluation of Statistical Spam Filtering Techniques” -  https://dl.acm.org/citation.cfm?id=1039625

[7]      https://aws.amazon.com/s3/

[8]      https://aws.amazon.com/glue/

Get Help With Your Essay

If you need assistance with writing your essay, our professional essay writing service is here to help!

Find out more

Cite This Work

To export a reference to this article please select a referencing stye below:

Reference Copied to Clipboard.
Reference Copied to Clipboard.
Reference Copied to Clipboard.
Reference Copied to Clipboard.
Reference Copied to Clipboard.
Reference Copied to Clipboard.
Reference Copied to Clipboard.

Related Services

View all

DMCA / Removal Request

If you are the original writer of this essay and no longer wish to have the essay published on the UK Essays website then please:

McAfee SECURE sites help keep you safe from identity theft, credit card fraud, spyware, spam, viruses and online scams Prices from
£124

Undergraduate 2:2 • 1000 words • 7 day delivery

Order now

Delivered on-time or your money back

Rated 4.6 out of 5 by
Reviews.co.uk Logo (183 Reviews)