Sentiment Analysis is a branch of NLP which involves contextual mining of text to identify and extract any subjective information that can aid business in understanding the social sentiment of their brands. In this project the main aim is to examine the use of sentiment analysis on drug reviews that can aid in identify new opportunities and challenges for any pharmaceutical business. The project aims at classifying the various reviews on the specified drugs based on their polarity with the aid of their rating. Various conventional and deep learning methods are applied on these data’s to attain reasonable results. The results indicated better performance from deep learning models, at the same time there were some disadvantages and problems in analyzing these data’s and they are highlighted in this project.
There are almost 2.5 quintillion bytes of data generated on an everyday basis around the world and sentiment analysis has been a key tool in aiding the people to understand this huge data and make sense out of it. Sentiment analysis is also widely known as opinion mining within NLP which tries to identify and extract the opinion within a text.
Machine Learning algorithms are broadly divided into supervised and un-supervised learning methods. The supervised models require a training dataset or also known as labels which is used to derive relationship between the features to predict an output, whereas an unsupervised model don’t use these label instead cluster the data into common groups. Another category of learning is the semi supervised where unsupervised are used to develop labels for supervised machine learning.
The models fall into two main categories the conventional machine learning model and the deep learning models. With recent developments is various deep learning models the ability to apply deep learning to NLP and analyzing text has increased drastically. In recent applications the deep learning models with methods where words are projected as vectors has led to impressive results.
Taking these techniques into account this project aims at applying 2 conventional methods such as the SVM (support vector machine) and NB (Naïve Bayes) along with 2 deep learning models namely the popular CNN with word2vec where words are represented as vectors and the LSTM model.
- Related Work
Sentiment analysis on text is not a new concept many works have been carried out in this field the website UCI machine learning from which this dataset has been taken provided few papers that would aid in comparison such as the Felix Graber, Surya Kallumadi,2018 paper which shows the cross domain and cross data learning. Another paper of interest for this project was the Daniel Jurafsky & James H. Martin,2018 which shows the neural network concepts and working in detail used to understand the concepts and use them. The dataset overall is less worked on dataset. In-order to perform the deep learning models understanding the various techniques was import the paper by Arman S. Zharmagambetov was very useful to understand what CBOW, skip-gram were and how they were applicable to text and how well they worked.
- Description of Data
The data is obtained from UCI Machine Learning Repository: http://archive.ics.uci.edu/ml/ . It contains the patient reviews on specific drugs along with their related conditions. These reviews are grouped into three categories such as their benefits, side effects and overall comment. Additionally, it has a column ratings for the drug’s effectiveness.
From this vast categories the main categories under consideration for this project are the overall comments with the column head ”comment Review” and the rating column. The rating range from 1-10 hence are divided into 3-classes Good, Bad, Neutral or 2-classes such as good and bad and this is used to predict the polarity of the comment Review column.
Fig1. Sample of the Dataset
Fig2.Defining the classes from the rating.
The dataset is given as two different files a train and test, both containing the two main columns of the projects interest being rating and comments Review. For this project intending to use the train-split model and to analyze the effect of the size of the dataset on results, the required columns alone are stored in two different data frames and these two data frames are merged together and have their duplicates removed if any as one whole dataset on which the split is performed on different levels to get better results. The resultant dataset contains around 4k long comments with their respective rating and calculated sentiment, the data thus obtained is pre-processed cleaned and then is trained and tested to derive at an output.
The dataset is investigated before performing any analysis on the data, the weights of the classes on the data are depicted in the following figures.
We see from the graphs that most of the data is of positive sentiment around 68% and 22% is negative and 11% is neutral thus it is obvious that the model designed perform better for a 2-class in comparison to a 3-class.It can also be seen that the models are feed with less dataset only around 4k long comments thus if there were more data’s the models would perform better. The data has few missing values that were cleaned. More about the distribution can be seen in figures A1,2 in appendix.
The goal of this project is to classify the comments into their polarity with the aid of the rating, this is done in 2 ways one being the 3-class models and other is the 2-class model and 4 algorithms are applied to the dataset to arrive at acceptable results. The models used are in conventional :SVM and Naïve Bayes in deep learning: CNN and LSTM. Their results are compared and tabulated.
4.1 Data Preparation
The dataset as mentioned above is first stored in a data frame with just the required columns the rating and the comments Review column respectively. This contains some void values thus these are removed and cleaned first before any other processing is carried out. The data is prepossessed before training and testing by :
- Changing all comments to lowercase
- Apostrophe lookup
- Replacing special characters with space
- Replacing numbers with space
- Removing words with length 1.
- Removing any URL or HTML tags in the comments.
The Cleaned data is then preprocessed for NLP :
- Segmentation of sentences
- Lemmatization- removing any inflectional endings.
- Removing stop words
- Normalize the words
Based on the method used that data is also converted to vectors in some cases making it easier for assessing and result in better performance.
Fig4.Example of preprocessed data
4.2 SVM-Support Vector Machines
SVM is used since it transforms the texts to vectors and determine best decision boundary between these vectors. Here SVM grid is applied on the 2-class model and OneVsRestClassifier is used for the multiclass.
In the multiclass model: The preprocessed data is vectorized using a CountVectorizer and then the sentiments are converted into targets by giving them values such as good being 2 bad being 0 and neutral being 1, after which they are split using the train_test_split function with test_size=0.4 and then trained using the OneVsRestClassifier and output on test are predicted.
Fig5.Parameters 3-class classification
- C=penalty parameter, setting it high forcing proper differentiation.
- Gamma=kernel coefficient
- Enabled probability estimates
- The classes weights are balanced
- A non-linear kernel is considered.
In 2-class model :
The preprocesses data is vectorized and split with train_test_split and test_size=0.2 then StratifiedKFold is applied and pipeline is used to perform the grid search using GridSearchCV,it was observed that with these parameters the model performed better in-comparison to others. Then the training dataset is fitted and predicted on test and scored.
Fig6.Parameters 2-class classification
4.3 Naive Bayes
Naive bayes is a very simply classification method yet yield good . It extends the concept on conditional probability it classifies works by figuring out the probability of different attributes in a class under consideration. It works under the consideration that each data point is independent. The method first used was building a bow feature for words and applying the NaiveBayesClassifier but the performance that the model yielded was considerably less for both 2 and 3 class thus switched to a better performing model as given below where a class was defined to fit the classifier on the training dataset predict the sentiment of each comment and score the result, The function used for fit computes NB classification probabilities as defined in the Stanford paper by Daniel Jurafsky & James H. Martinfor .The predict is used to predict this on the test data and the score is used to score the model and this is done for both 2 and 3 class model to compare their accuracy.
Fig7.Model 1 parameters for NB classification
Fig 8 . Model 2 for NB
Convolutional neural network-CNN was used as part of the deep learning model it was used along with the word2vec model since vector representation of words has more sematic meaning and are more helpful in sentiment analysis. There are two word2vec models continuous bag of words(CBOW) and skip gram(SG) here in this project we use the CBOW with their parameters. The word embedding -word2vec is used to represent the prepossessed comments words in a continuous vector space where the nearby points represent similar words the CBOW predicts target words from the source context.
Fig9.Word2vec using CBOW (Source: Arman S. Zharmagambetov, Alexandr A. Pak)
After CBOW is performed on the preprocessed data the train_test_split is used to separate train and test data and test_size=.02 thereby dividing the data into 80% training and 20% testing, after this the one_hot_seq and LabelEncoder are used to represent categories as numbers and then 15% of train dataset is used for validation, the next step involves the neural network definitions. First a base model is defined ,3 layers dense with activation Relu->Relu->SoftMax. Next the deep learning models are defined, a reduced model, a regularized model, and a model with dropouts are defined to see the various effects and observe the performance. The word2vec model is applied to the neural network and dropouts help preventing overfitting, the parameters defined in the network are:
- Kernel Size: The length of the convolution window
- Number of fitting: The output dimensionality
Activation function: Relu-non-linear activation function.
Fig.10 CNN flow, function and architecture with word2vec
In LSTM it extracts and represents the context usage meaning of the words by statistical computation, it identifies the pattern in the text and the relationship between them. The process involved were passing the pre-processed data into the defined LSTM architecture and defining the model classes, training the network and then testing them.
Fig.12 LSTM model
The scores for the various methods performed are depicted in the table below and individual scores can be seen in the appendix figures A 4,5,6,7,8,9 .
Fig.13 Consolidated results of all models
The models perform well under a simple 2 class classification, the reason behind this is possibly the size of the dataset, if more dataset were present then maybe the models would yield better results in the multiclass category.
The best result is from the CNN model trained with Word2Vector in 2 class classification with 85% accuracy and in 3 class a highest of 78%.
In the binary classification the models performed exactly as expected with SVM giving better results in conventional methods and CNN giving better in deep learning methods. Whereas in the 3class models the NB seems to have performed better than SVM this could be because the test size played an important role in boasting the results for SVM and it worked better with a nonlinear model and with grid search with just 2 classes, even after various tuning SVM 3 class did fail to produce the expected results in 3 class model and had a lot of confusion as it can be seen in the confusion matrix in appendix A4.
In this project a sentiment analysis was performed on the drug review dataset using various conventional and deep learning models to evaluate their performance. It was observed that the deep learning models performed better. The accuracy seemed to be enhanced when the vector representation were used in any method. Unequal class distribution was presumed to be a major issue in training and testing the dataset and thereby yielding lower accuracy. The presence of lesser dataset could have also caused to the significant reduction in the performance of the models than what was expected. Overall the binary classification worked almost as what was expected and yielded better results in terms of accuracy when compared to the 3-class classification.
- Felix Graber, Surya Kallumadi, Hagen Malberg, and Sebastian Zaunseder. 2018. Aspect-Based Sentiment Analysis of Drug Reviews Applying Cross-Domain and Cross-Data Learning. In Proceedings of the 2018 International Conference on Digital Health (DH ’18). ACM, New York, NY, USA, 121-125.
- Ioannis Korkontzelos ,Azadeh Nikfarjam b, Matthew Shardlow a, Abeed Sarker b, Sophia Ananiadou,Graciela H. Gonzalez,2016,Analysis of the effect of sentiment analysis on extracting adverse drug reactions from tweets and forum posts, Journal of Biomedical Informatics.
- Daniel Jurafsky & James H. Martin,2018, Neural Networks and Neural Language Models,Chapter7.
- Arman S. Zharmagambetov, Alexandr A. Pak, Sentiment Analysis of a document using deep
learning approach and decision trees.
A1.Distribution for 2 and 3 class
A2.Distribution for 2 and 3 class
A4.SVM 3 and 2 class Results
A5.NB 2 and 3 class Results
A6.CNN 3-class model results
A7.CNN 2-class model results
A8 LSTM 2 and 3 Class Results
A9 NB model 1 results
Cite This Work
To export a reference to this article please select a referencing stye below:
Related ServicesView all
DMCA / Removal Request
If you are the original writer of this essay and no longer wish to have your work published on UKEssays.com then please: