Abstract— This digital world is an invention of friendships through social networks, communication done electronically and online relationships. One may have thousands of ‘friends’ without even seeing them or knowing their real life. In this kind of set up, it is fairly easy to provide a false name, age, gender and location to hide one’s true identity. It is therefore useful if social networking profiles could be cross verified and monitored for false identities on the basis of automated text analysis. This paper aims to provide an overview of the various text-based analysis performed by researchers which includes major steps like selection of a right dataset, preprocessing the dataset, extracting various text features and classifying them appropriately to a particular age and gender for machine learning. Finally, explaining the research plain I intend to use in for my own research.
Keywords— cyberbullying, feature extraction, machine learning, classifiers, safety monitoring, harmful tweets
In recent years, social networking sites have taken a tremendous hike in people’s interests. Websites like Twitter, Facebook and Quora have expanded well enough to grab the interests of people from all age groups. However, it is also easy for many people to provide false identities like false age, gender, location and name. It is easy for online criminals like paedophiles, bullies to target their victims without the worry of getting caught. Online Law enforcement agencies and social network moderators find it difficult to track down the criminals manually as it is quite impossible to get them all due to a large number of profiles posing as adolescents. It is, therefore, become necessary to use automated techniques to help narrow down the search.
If you need assistance with writing your essay, our professional essay writing service is here to help!Essay Writing Service
Natural language processing has enabled many researchers to detect bullying and aggressive behaviour in recent years. In this paper, we focus these phenomena on Twitter and it could be further enhanced to other social networking sites in my future work. There are a few obstacles that we may face using Twitter: Firstly, Twitter has a short word count and may have many grammatical mistakes. Secondly, despite spam detection twitter has many spam accounts and filtering them out may be a difficult task. Thirdly, each tweet provides fairly little context (Chatzakouy, et al., 2017). So, taken on its own, a mean or aggressive tweet may not seem aggressive unless read along with other comments on the similar context. Finally, the speed at which chat language is developing, one of the most ongoing challenge would be to constantly train the algorithms to pick up new variations in language used online.
II. related work
A number of researchers have demonstrated the extraction of age and gender using text-based analysis. Below are a few ones that encouraged me with their techniques and clarity to take on this as my research topic.
In (Agrawal & Awekar, 2018) the authors have demonstrated the efficiency of Deep Neural Networks for detecting cyberbullying across multiple social media platforms
In (Dadvar, et al., 2012) the authors have demonstrated cyberbullying exclusively using gender-based approach.
In (Nguyen, et al., 2011) the authors have demonstrated the use of linear regression for author age prediction.
In (Ugheoke, 2014) has explained in depth about Twitter terms, user profiles and user tweeting behaviour.
This paper will mostly use all of various techniques used by researchers in the past and formulate a technique best suited. The steps involved will be as follows:
a) Selection of datasets
b) Data Preprocessing
c) Feature Extraction
d) Model generation
Finally, it will describe the two hypothesis that will be used in my research followed by a conclusion and further research I might intend to work on given the time for this practicum.
Size of datasets plays a vital role in training machine learning algorithms. In (Peersman, et al., 2011) the authors demonstrate that text categorization approach to the identification of gender and age can be used with adequate reliability in social network communication despite its challenging characteristics like short texts often containing nonstandard, everchanging language. This helps me a great deal for my research as Twitter has a word limit to just 280 words per tweet.
Ideally, a large dataset for training, preferably separate ones for age and gender would be best suited. This is to compensate the short word count per tweet. The point of concern would be whether to use a balanced or unbalanced dataset. This means whether its best to use an equal number of male and female/ equal number of people in all age groups seems to be something that can be inferred after experimentation.
Pre-processing is important when it comes to data taken from platforms like Twitter as the word count is short and people tend to use short forms of words. Along with this, there is the use of symbols and emoticons which need to be filtered out.
a) Cleaning: Clean the data if it contains noise. Numbers, stop words and punctuations need to be tokenized. Also, convert all characters to lowercase.
b) Removing spammers: In (Chatzakouy, et al., 2017) author demonstrates the use of Wang et al’s approach (Chatzakouy, et al., 2017) which combines two main indicators. (i) using too many hashtags that will boost his/her visibility (ii) posting too many similar tweets.
V. feature extraction
Stylometry is the study of writing styles for better classification. Researchers who tend to use the age criteria for feature extraction have noticed that teenagers tend to write about their friends and mood, people in their 20’s write about their college life, people in their 30’s write about work, marriage and politics (Rezaei, 2014). As with age, people of different genders too write about different topics and use some similar words. Feature extraction from text can be done using the below methods:
a) Character based features
Character based features analyses text based on each character included. It checks the total number of characters – alphabets, digits and special characters, total number of letters, total number of uppercase letters, total number of lowercase letters, total number of digits, special characters and white spaces.
b) Lexical or Word based features
(i) Hapax legomena is measure to indicate the total number of words in an entire text provided that a single word is not repeated throughout.
(ii) Hapax dislegomena is a feature which evaluates double iterations of a single word throughout the entire text.
(iii) There is another measure called the Yule’s K measure which calculates the diversity of words used in a text.
Formula used would be:
(iv) Simpsons D is a measure calculated by selecting two words randomly. The diversity of text is calculated on the basis of how big the probability is of selecting the words again. The lesser the probability, the richer the text.
(v) Honor R’s measure is another measure used to calculate the vocabulary richness. It checks the probability a word occurs once in a set as compared to total number of words present.
(vi) Entropy is used to measure the randomness of the data.
Formula’s for all the above measures are available in (Rezaei, 2014).
c) Syntactic based features
Just like Lexical was word based, Syntactical is sentence based. We have to check the total number of single quotes (‘), commas (,), periods(.), colons (:), semi-colons (;), question marks (?), multiple question marks, exclamation marks (!), multiple question marks, ellipsis (…).
d) Structurally or Morphological based features
After sentences, morphological feature moves on to a broader analysis lie paragraphs and writing styles. Total number of sentences, total paragraphs, average sentences per paragraph, average words per paragraph, average characters per paragraph, average words per sentence and total number of blank lines.
VI. model generation and machine learning
Classification of datasets can be done in three ways. The first method implemented is by classifying both datasets for age and gender separately. Second method with be by first classifying gender and using this information to classify age. The results of both can be compared for better results.
A. Age Classification
In (Peersman, et al., 2011), the authors classify age using the adult vs adolescent age groups like min16 vs. plus16, min16 vs. plus18, min16 vs. plus25, min16_male vs. min16_female vs. plus25_male vs. plus25_female (Peersman, et al., 2011). In another paper (Simaki, et al., n.d.), the authors distribute age classes into groups like: Group A (14-19), Group B (20-24), Group C (25-34), Group D (35-44), Group E (45-59) and Group F (>60). The author demonstrated results for various classifiers like SVM, Bayes Net, MLP etc. The best one turned out to be Random Forest algorithm (Simaki, et al., n.d.).
B. Gender Classification
Reference (Rezaei, 2014) has demonstrated the use of four majorly used algorithms which are Naïve Bayes, Logistic Regression, Decision trees and Support Vector Machine for Gender prediction. According to the implementation and results, Support vector machine provided the best possible accuracy among all four.
C. Age Classification using Gender information
This is one step ahead of the two. Knowing gender is a plus for analyzing age. This approach was used by (Peersman, et al., 2011) using age groups like min16_male vs. min16_female vs. plus25_male ss. plus25_female where the SVM classifier yielded a good accuracy.
VII. research plan
The research plan I intend to use would be pretty much most of what is covered in this literature review.
A. Hypothesis I
(i) Datasets: Three datasets will be used for this approach. One with age classified in groups as done by (Simaki, et al., n.d.) namely Group A (14-19), Group B (20-24), Group C (25-34), Group D (35-44), Group E (45-59) and Group F (>60) which is a broader set as compared to (Daelemans, 2003) where they are merely comparing adult’s vs adolescents. Second with text along with genders of the users. And a third one for experimentation will have a wide range of mean/harmful tweets.
(ii) Preprocessing: The data will be cleaned of any noise, stop words and tokenize them. As far as removing spam, I will be using a Wang et al’s (Chatzakouy, et al., 2017) approachof removing account that have too many similar tweets and too many hashtags to boost visibility.
(iii) Feature extraction: Feature extraction plays a vital role in text-based analysis. The analysis used will be lexical, syntactic and structurally based features as described in Section V of this paper.
(iv) Classification algorithms: I will be using Support Vector machine for Gender based classification and Random Forest Algorithm for Age based classification.
B. Hypothesis II
(i) Datasets: This approach will use four datasets in this research plan. One for age, second for gender and third for experimentation similar to Hypotheses I. The fourth one will be a set of mean tweets for experimentation having gender information gathered and then run again to detect age. I will be checking to see if the algorithms work better when checked for age and gender separately or while detecting the correct age knowing the gender.
(ii) Preprocessing: Similar to Hypothesis I.
(iii) Feature Extraction: Similar to Hypothesis I.
(iv) Classification Algorithms: Again, I will be attempting to use Support vector Machine for Gender and along with Random forest for age prediction, I will also use Support vector machine.
The two hypotheses will help me determine which dataset is better along with which algorithm is best suited for age and gender prediction.
VIII. conclusion and future work
The rise on online activity has bought a rise in cyberbullyig and agression among users. Th solution to just log out or delete an account is not a way to cope with this issue. The use of text base sentiment analysis and author profiling will help combat this situation to a great extent. There might be obstacles which may be faced during the research but nevertheless, this a an interesting area of research work. Onling language will always be changing and diverse depending on age and gender making it a continuous piece of work. The algoritms used in varrious papers will help a great deal to conclude which one might be best along with its accuracy. Feature extraction is an important part of this resaerch and all methods mentioned in this will be atempted to be implemented. Support Vector Machne and Random Forest algorithms will be majorly studied along with a search for any new or better algorithims parallelly.
In future stages this work will be extended to other social networking sites like Facebook and Quora too.
J. v. d. L. a. G. D. P. a. W. Daelemans, “Text-Based Age and Gender Prediction for Online Safety Monitoring,” CLiPS – Computational Linguistics Group – University of Antwerp, 2003. [Online]. Available: https://www.clips.uantwerpen.be/sites/default/files/age_gender_paper_published.pdf.
D. Chatzakouy, N. Kourtellisz, J. Blackburnz, E. D. Cristofaro, G. Stringhini and A. Vakali, “Mean Birds: Detecting Aggression and Bullying on Twitter,” Aristotle University of Thessaloniki, Telefonica Research, University College London, 12 May 2017. [Online]. Available: https://arxiv.org/pdf/1702.06877.pdf.
C. Peersman, L. V. Vaerenbergh and W. Daelemans, “Predicting age and gender in online social networks,” Research Gate, October 2011. [Online]. Available: https://www.researchgate.net/publication/221615645_Predicting_age_and_gender_in_online_social_networks.
T. O. Ugheoke, “Detecting the Gender of a Tweet Sender,” University of Regina, May 2014. [Online]. Available: http://www2.cs.uregina.ca/~hilder/my_students_theses_and_project_reports/ugheokeMScProjectReport.pdf.
D. Nguyen, N. A. Smith and C. P. Rose, “Author Age Prediction from Text using Linear Regression,” Language Technologies Institute, Carnegie Mellon University, 24 June 2011. [Online]. Available: http://aclweb.org/anthology/W11-1515.
S. Agrawal and A. Awekar, “Deep Learning for Detecting Cyberbullying Across Multiple Social Media Platforms,” Indian Institute of Technology, Guwahati, 19 Jan 2018. [Online]. Available: https://arxiv.org/abs/1801.06482.
M. Dadvar, F. d. Jong, R. Ordelman and D. Trieschnigg, “Improved Cyberbullying Detection Using Gender Information,” Human Media Interaction Group, University of Twente, 23 Feb 2012. [Online]. Available: https://www.researchgate.net/publication/230701861_Improved_Cyberbullying_Detection_Using_Gender_Information.
V. Simaki, I. Mporas and V. Megalooikonomou, “Age Identification of Twitter Users: Classification Methods and Sociolinguistic Analysis,” University of Patras, University of Hertfordshire, [Online]. Available: https://www.researchgate.net/publication/317585154_Age_Identification_of_Twitter_Users_Classification_Methods_and_Sociolinguistic_Analysis.
A. M. Rezaei, “Author Gender Identification from Text,” Institute of Graduate Studies and Research, Eastern Mediterranean University, July 2014. [Online]. Available: http://i-rep.emu.edu.tr:8080/xmlui/bitstream/handle/11129/1845/RezaeiAtoosa.pdf?sequence=1.
Cite This Work
To export a reference to this article please select a referencing stye below:
Related ServicesView all
DMCA / Removal Request
If you are the original writer of this essay and no longer wish to have your work published on UKEssays.com then please: