Covid-19 Update: We've taken precautionary measures to enable all staff to work away from the office. These changes have already rolled out with no interruptions, and will allow us to continue offering the same great service at your busiest time in the year.

Big Data Model for Tweets

5048 words (20 pages) Essay in Information Technology

23/09/19 Information Technology Reference this

Disclaimer: This work has been submitted by a student. This is not an example of the work produced by our Essay Writing Service. You can view samples of our professional work here.

Any opinions, findings, conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of UK Essays.

Big Data Model for Tweets

      Social Media Analytics

Table of Contents

1 Introduction

1.1 Background

1.2 Aim and Objectives

1.3 Scope

1.4 Measurable Organisational Value

2 Literature Review

2.1 Big Data Platforms

2.2 Text mining Techniques

3 Research and Implementation

Research Methodology

4 Resources

4.1 MongoDB

4.2 MongoDB Compass

4.3 Docker

4.4 R for Text Mining

4.5 Python for Text Mining

4.6 Office 365

5 Project Risks

5.1 Time Frame

5.2 Insufficient Communication

5.3 Hardware/Software Risks

5.4 Lack of Expertise

6 Project Phases

6.1 Research

6.2 Installation of MongoDB

6.3 Implementing MongoDB

6.4 Text Mining using R or Python

7 Gantt Chart

8 References

 

 

Table of Figures

Figure 1 MongoDB

Figure 2 MongoDB Compass

Figure 3 Docker

Figure 4 Sentimental analysis

Figure 5 Office 365

Figure 6  Gantt Chart

Introduction

1.1      Background

Data is not just endlessly growing but also evolving, there has been a major paradigm shift in the data. The structured data that could be handled by relational database and spreadsheet have been replaced by unstructured data.  By simple definition, unstructured data is the data that cannot be managed by relational database, which includes emails, videos, images, social media text, blogs, sensors information, audio files, so on and so forth. It is estimated that almost 80% of data will be unstructured data in future which will demand special attention to make business decisions, political analysis, for social growth and many more areas. (Schneider, 2016) Analysing unstructured data is still in its infant stage and to handle the inherent complexities of it has become a challenge for data scientists and data analysts. Currently, there are many techniques to manage and analyse unstructured data and still evolving as there is no one right way to learn insights from such data.

There are many Big Data platforms or also known as NoSQL applications, that are used to store and handle the unstructured data. Few of the popular applications are Cassandra, MongoDB, Hadoop, CouchBase and many more which helps to create a database for the data quite similar to relational database. Also, there are many software and analytical tools available for applying machine learning techniques on unstructured data for unsupervised and supervised learning from the data.  

 

1.2      Aim and Objectives

Aim

Our aim is utilising Big Data platforms to handle unstructured data and link it Machine Learning tools to analyse Big Data.

Objectives

  • Create a database using Big Data platform- MongoDB.
  • Implement data modelling and upload data into MongoDB database.
  • Linking this database to Machine Learning Tools such as R or Python.
  • Perform descriptive and predictive analysis on unstructured data.

1.3      Scope

Scope of this project is to create an algorithm to implement Big Data techniques to store, pre-process and maintain a database for unstructured data of social media text. Social media text is limited to Twitter for this project. The Big Data platform for the database will be MongoDB and MongoDB Compass for visualisation of data.

For performing analysis on Tweets, inbuild text mining packages of R or Python will be implemented. For the project text mining code for both will be tested to arrive at results. Depending on the ease of procedure and results accuracy one of the two machine learning tools will be applied.  

1.4      Measurable Organisational Value

  • To investigate the Twitter data to find valuable insights.
  • Perform visualisation and descriptive analysis on the Twitter texts.
  • Implement supervised learning and predictive analytics on social media text.
  • Text mining and sentimental analysis from Tweets data.

2       Literature Review

2.1      Big Data Platforms

Cassandra

(Featherston, 2010) has mentioned Cassandra stands for distributed catalog aimed to be vastly accessible mutually in standings of storage space amount and apply for output at the same time as not remaining field towards in the least of specific use of malfunction. He has described Cassandra is a circulated key-value store capable of mounting with no single point of failure with randomly huge sets. As per the author DATA sets might span server nodes, holders and unfluctuating multiple DATA centers. Most of the companies calculating their designed DATA storage wants in petabytes and terabytes instead gigabytes, technologies designed for performing evidence at scale are becoming extra essential. These organizations produce the protection local as well as wide area networks. Significantly that they carry on the face of burdens such as collapsed nodes and defective links. The malfunction of a particular module in an allocated arrangement is frequently minimal but the prospect of failure approximately buildups with straight percentage to the number of components. 

MongoDB

(UGstudents D. & RRCE) has described including an appearance of Big DATA, the routine of NoSQL has enlarged between internet firms and as well as organizations. Profits contain parallel clearing, improved mechanism more than accessibility and simplicity of proposal. NoSQL databases remain reflected as a replacement to social databases, as it’s schema with a reduction of DATA pattern reflected to be enhance for conducting the significant amounts of organized and un organized DATA. They mentioned Big DATA is a catchword, which is represents vast capacity of mutually organized and un organized DATA which is consequently huge and hard outdated software and database methods. It is expected that capacity of DATA is expanding 40% each year and desired to expand 44 times among 2009 and 2020. Generally, the DATA is un organized as it is in the written structure.        

(Arora R. & Aggarwal, 2013) mentioned Using the constant development of DATA amounts, the space of evidence, maintenance and support have turn into the major task. Social database outcomes collapse overdue to descending the functions giving to the inbound circulation. Due to massive DATA space, increasing amount of users and developers have initiated rotating to NoSQL databases. They described social databases have been manipulated for common DATA space in industry products and web with millions of clients writes and reads prerequisites. Together with the arrival of web 2.0 appliances with millions of employers writes and reads, an additional mountable explanation is mandatory. The DATA keeps for these appliances to deliver suitable level. Suitable level means the power give out the DATA and read and write functions terminated in lots of servers. The social databases have slight resources to suitable level around many handful servers. The social databases have minor competences to suitable level over several servers. NoSQL databases take place to established such as huge scale DATA requirements. The term “NoSQL” was fundamental by Carlo Strozzi in 1998 for his RDBMS, Stozzi NoSQL. Lately, the period NoSQL (Not Only SQL) has remained for databases which don’t practice SQL as it’s query language and which don’t expect permanent table schema.                                        

Hadoop

(Heinzlreiter P., Krieger M. T. & Leitner I., 2012) has mentioned Directly the continuously growing of DATA in usage capacities both for multinational companies and investigation, the prerequisites for DATA handling have enlarged drastically over the previous durations, frequently beyond the proficiencies of software, which has been exhausted in given DATA handling techniques. Even though Bigtable, GFS and MapReduce have been established at Google and are managed within the web-based applications like google search engine in footings of big DATA. Hadoop is also an open source known as “Hadoop Distributed Filesystem (HDFS)”, whereas big DATA communicates to HBase contained by the Hadoop.

Comparison Table

Comparison

Cassandra

Mongo dB

Hadoop

Architecture

It a column distributed storage method. (Bhamra, 2018)

It validates scalability by applying replication and sharding. It a document-based system. (Bhamra, 2018)

It has system entitled as Hadoop File System. This can keep massive volume of DATA. (Bhosale, H. S., & Gadekar, D. P., 2014) 

Implementation Language

Java (Bhosale, H. S., & Gadekar, D. P., 2014)

C++ (Bhosale, H. S., & Gadekar, D. P., 2014)

Java (Bhosale, H. S., & Gadekar, D. P., 2014)

Consistency Concepts

Eventually and Instantly

(Bhosale, H. S., & Gadekar, D. P., 2014) 

Eventually and Instantly (Bhosale, H. S., & Gadekar, D. P., 2014)

Instantly

(Bhosale, H. S., & Gadekar, D. P., 2014) 

Database Model

Large Column Store (Bhosale, H. S., & Gadekar, D. P., 2014)

Document file Store

(Bhosale, H. S., & Gadekar, D. P., 2014)

Large Column Store (Bhosale, H. S., & Gadekar, D. P., 2014)

2.2      Text mining Techniques

(Talib, R., Hanif, M. K., Ayesha, S., & Fatima, F., 2016) have concisely explained the techniques, application and issues of text mining to discover patterns, make predictions and gather insights from unstructured data. Text mining broadly consists of retrieval of information, machine learning, data mining and computational linguistics. The standard procedure in text mining starts from extraction and collection of unstructured data then the data is processed to remove irregularities and patterns are derived and finally machine learning is applied for decision making. Information Extraction (IE) is process of getting full knowledge about the data to obtain relevant information from the document. Information Retrieval (IR) is extracting patterns based on relationship and association between them, in relevance with set of words. This IR systems help to extract user behaviour and search related information accordingly. Natural Language Processing technique is used process and analyse data automatically to derive information. Various analysis such as Named Entity Recognition, which extracts relations from abbreviations and synonyms, to attain this complex algorithm are used for identification of entities n relations. Also, co-referencing technique is applied along with Named Entity Recognition to establish a logical relationship to recognise role of person in organisation. Another technique applied is Clustering, where the text documents are classified in sections on the basis of set algorithms. Varied clustering methods are distribution, hierarchical, centroid and density. Text Summarisation is a text mining technique which is performed on the raw document to essentially abridge the text. Summarisation is carried out by weighted heuristics approach by applying set rules. This technique can be applied on many documents at a time. Text mining is extensively applied today in various fields such as academics, business intelligence, social media and so on. In a nutshell, with the huge volume of unstructured data available, text mining is valuable to use this data to obtain knowledge.

(Batrinca, B., & Treleaven, P. C., 2015) have tried to comprehend the tools and techniques for analysing social media data by social media scraping, data storage, data cleaning and sentimental analysis. They have focused mainly on sentimental analysis for Twitter data and finding insights from it. Getting access to social media for research is easy but from certain platforms such as Facebook and Google, accessing their raw data is extremely difficult. In contrast, Twitter has made its data available for researchers to conduct analysis on the huge data set making scraping convenient. Next challenge is to clean the unstructured data to be ready for analysis, majorly to normalise the data. Data protection is important aspect researchers are concerned about, data should be secured once the database is created. Working with unstructured and obtaining insights from it is difficult and also to visualise the data and results is defying. There are many other areas that needs to be considered for social media analysis. Data can be historic or real time feeds depending on the nature of analysis. Also, data source needs to be linked to other sources to derive information, such as impact on financial markets. Twitter data source is isolated with other data source which makes it difficult to analysis its impact. Researchers are constantly trying to link such source to make optimum utilisation of resources. To utilise the huge amount of Twitter data, opinion mining or sentimental analysis is applied. For this machine learning techniques are applied for performing the sentimental analysis, such as support vector machines, bag-of-words, semantic analysis and so on. Sentiment analysis is about gathering information about the writer’s emotion, opinion behind writing the text. The intensity of emotion can also be known and analysed with the help of automated tools. Sentimental analysis is a broad subject and analysing the opinions on social media can be productive to make predictions and in decision making, depending on the research undertaken.

 (Fornacciari, P., Mordonini, M., & Tomaiuolo, M, 2015) have applied analytics to study Twitter opinions, facts and thoughts by combing approach of social networking analysis and sentimental analysis. In concept, they have tried to find a relation between social connection and sentiments behind it. The conventional approach for social analytics is to understand the network topology through connections which results in hierarchy of communities under main topic. In special case of Twitter, it allows to track connections where even the knowledge is not mutual. It simply gives the number of followers for a topic to determine the popularity of it, but it doesn’t define whether the popularity is positive or negative. The issue with microblogging platform like Twitter is that it allows only limited words, so the user finds it difficult to fully express his thoughts and in turn uses emoticons or slangs. Now just applying sentimental analysis on this does not always give accurate results. The algorithm applied here connects the two analyses of social network and sentiments in tweets to overcome the limitations. The researchers have experimented this concept on different Twitter channels and the results were satisfactory. The issue of deriving the sentiments like irony, sarcasm and so on were tackled by combing the approach with studying the social network topology.

(Balahur, 2013) have designed a method to analyse the sentiments in social media texts, especially for Twitter texts. The approach to study the tweets in real time and considering their structure and language is adopted to attain good results. His research mainly focuses on pre-processing of text to normalisation, to generate vocabulary for sentiment testing. Includes minimum use of linguistic processing to make the model portable for various languages and to identify popularity sentiments in tweets. Furthermore, to apply heuristics for feature selection and apply Support Vector Machine learning for classification of data. For tweet pre-processing, repeated punctuations are removed, emoticons are replaced with type of emotion, for incorporating slang language certain words and phrases are included from reliable source. Words are modified, and the hashtags are termed as topics for study. After this SVM method was applied to perform sentimental analysis on data and the results were quite good. They have also concluded that applying their method will avoid overfitting of data and the analysis can be performed on various datasets with minimal linguistic processing.

3       Research and Implementation

Research Methodology

We will be using qualitative research methodology for our project, after evaluating the trump’s tweeter DATA form the internet. Qualitative method is focusing on several methods in firm, mainly recognizing the customer requirements and actions and estimating the value of technology. After understanding of the situation we decided to categorize qualitative DATA into mongo db. Using MongoDB we can put DATA into a proper order. These DATA we that we have collect from the internet we can use put them into mongo db to have a proper structured DATA.

4       Resources

4.1      MongoDB

MongoDB is an open source platform for handling Big Data or also classified as NoSQL database program. It is a document-oriented program which stores data in JSON-like documents allowing the database to be flexible. It is a distributed database, so it is horizontally scalable, geographically distributed and its high availability makes it easy for use. (What is MongoDB)

Figure 1 MongoDB (MongoDB)

4.2      MongoDB Compass

MongoDB Compass is a popular Graphic User Interface(GUI) for MongoDB which enables the user to visualise and explore the data easily. It also runs queries efficiently and quickly, optimising the query performance with all functions. (MongoDB Compass)

Figure 2 MongoDB Compass (Webinar: MongoDB Compass – Data navigation made easy)

4.3      Docker

Docker is an open platform which is used to run the application in a separate environment known as container. Docker facilitates the user to separate application from its infrastructure which makes it faster to get the results of the code. Also, Docker containers are light weight which makes it easier to deploy multiple software as it can run many containers on a hardware than using a virtual machine. We will use docker to run codes in MongoDB and also try to implement it for text mining in R or Python.  (Docker)

Figure 3 Docker (What Does Build, Ship and Run Any App, Anywhere Really Mean?)

4.4      R for Text Mining

R has various packages for text mining, of which ‘SentimentalAnalysis’ is one such package that effectively determines the sentiments in a text. There are different packages and codes available to analyse the text and generate results. We will apply sentiment analysis package for our research.  

4.5      Python for Text Mining

Similar to R, Python has many packages of its own to generate the sentiments from the text. Few such packages are Tweepy, Textblob, and NLTK libraries can be applied to implement text mining using Python.

Figure 4 Sentimental analysis (Twitter Sentiment analysis)

4.6      Office 365

To perform many tasks for the project, Microsoft Office 365 will be used to generate reports, data storage. MS Word for proposal document and final report, MS PowerPoint for presentations and MS Excel for initial data import as CSV files.

Figure 5 Office 365 (12 reasons to use Microsoft Office 365)

5       Project Risks

This project has few risks which are required to be mitigated for the success of the project within the given time frame.

5.1      Time Frame

 The blueprint of this project has been granted an overall of eight weeks to fulfil the task. To do a project with the implementation like this we need to have lot’s research, readings, and knowledge transfer sessions. So, we had weekly meetings, social media conversations and emailing documents to overcome from those risks and fulfil our target. Also, we used Gantt chart and github.com so whenever each person found something new, what is the status of each person doing and most importantly we used deadlines for our work. We kept deadlines, so we know where we are standing and what we have to do more. Gantt chart shows the work to be completed and it helped us a lot in this project.

5.2      Insufficient Communication

Most of the times projects can be unsuccessful because of communication problems. Types of communications as mentioned earlier eg: emails, face to face conversations and social media conversations (whatsapp). These three communication methods involve with huge risks and if each team member does not communicate with each other during the project at the end they will face huge risk. The other risks are team members and lecturers have different agendas. Thus, general discussions are compulsory and appropriate documentation is mandatory and these things will not happen during a short time of period and need to be properly organized. To overcome with these risks, we will need essential meetings regularly.

5.3      Hardware/Software Risks

As we all know hardware and software failures may happen at any time. When running big database you need to have a good computer with good RAM, volume space and compatible operating system. In any case if one of these fails it will directly affect to our project and deadlines. To avoid these risk we need to have backups all the time whenever we are doing anything to the databases. Since we are downloading open sources directly from the web need to be very careful with the versions (updated) sometimes it time consuming because we need to download the correct version. When downloading we need to be very careful because sometimes virus also inbuild with these open sources if we download the incorrect version.

5.4      Lack of Expertise

Compared to Kumash I don’t have any knowledge about big data because my background is networking but with help of Kumash, I learned what is big data. On the other hand, Kumash got a big challenge of studying Big Data Platform – Mongodb. Also, she is acquiring knowledge about text mining using R or Python. She managed to learn about that and guiding me also. This shows clearly, we both facing lot of challenges in this task. We need to do a lot of research techniques and reading to overcome the challenges. Our hard working helps us to learn something new and we manage to solve this risk.

6       Project Phases

6.1      Research

Initially, lot of research is needed to understand the project deliverables. Learning about Big Data Platforms and its architecture, data modelling, schema type, coding structure, its limitations, user interface of the software. Research is an ongoing process, after understanding and implementing the database phase we need to research about text mining. Text mining concepts, types of mining, which method to apply on data, which software to use and the code required to perform the analysis.

6.2      Installation of MongoDB

There are various ways to install and use MongoDB, to create database and upload the data. To lean about installation method and decide upon one that is easy and fast. Also, to check whether installing MongoDB and MongoDB Compass in local machine is useful or to use Docker container in an isolated environment.

6.3      Implementing MongoDB

After MongoDB has been successfully installed, next step will be to create a database in MongoDB. Uploading the data of tweets that has been downloaded from Kaggle.com and visually exploring the data in MongoDB Compass. 

6.4      Text Mining using R or Python

Once the data is successfully uploaded in MongoDB and ready after running queries, we need to connect the database to R or Python to perform Descriptive and Predictive analysis. The initial idea is to perform sentimental analysis using packages available in software and perform further analysis once desirable results are achieved from the analysis.

7       Gantt Chart

Figure 6  Gantt Chart

References

  • 12 reasons to use Microsoft Office 365. (n.d.). Retrieved from https://www.focus.net.nz/blog/category/general/office-365-12-reasons-why-plus-12-must-use-features
  • Arora R. & Aggarwal, R. R. (2013). Modeling and querying data in mongodb. International Journal of Scientific and Engineering Research, 4(7), 141-144.
  • Balahur, A. (2013). Sentiment analysis in social media texts. In Proceedings of the 4th workshop on computational approaches to subjectivity, sentiment and social media analysis (pp. 120-128).
  • Batrinca, B., & Treleaven, P. C. (2015). Social media analytics: a survey of techniques, tools and platforms. Ai & Society, 30(1), 89-116.
  • Bhamra, K. (2018). A Comparative Analysis of MongoDB and Cassandra . (Master’s thesis) The University of Bergen.
  • Bhosale, H. S., & Gadekar, D. P. (2014). A review paper on Big Data and Hadoop. International Journal of Scientific and Research Publications, 4(10), 1-7.
  • Blandford, A., Furniss, D., & Makri, S. (2016). Qualitative HCI Research:Going Behind the Scenes. Synthesis Lectures on Human-Centered Informatics (p. 115). Morgan & Claypool.
  • Docker. (n.d.). Retrieved from https://docs.docker.com/engine/docker-overview/#next-steps
  • Featherston, D. (2010). Cassandra: Principles and Application. Department of Computer Science University of Illinois at Urbana-Champaign.
  • Fornacciari, P., Mordonini, M., & Tomaiuolo, M. (2015). Social Network and Sentiment Analysis on Twitter: Towards a Combined Approach. In KDWeb (pp. 53-64).
  • Heinzlreiter P., Krieger M. T. & Leitner I. (2012). Hadoop-Based Genome Comparisons. In Cloud and Green Computing (CGC), 2012 Second International Conference on (pp. 695-701). IEEE.
  • MongoDB. (n.d.). Retrieved from https://codeburst.io/fawn-transactions-in-mongodb-988d8646e564
  • MongoDB Compass. (n.d.). Retrieved from https://www.mongodb.com/products/compass
  • Schneider, C. (2016, May 25). The biggest data challenges that you might not even know you have. Retrieved from https://www.ibm.com/blogs/watson/2016/05/biggest-data-challenges-might-not-even-know/
  • Talib, R., Hanif, M. K., Ayesha, S., & Fatima, F. (2016). Text mining: techniques, applications and issues. International Journal of Advanced Computer Science & Applications, 1(7), 414-418.
  • Twitter Sentiment analysis. (n.d.). Retrieved from https://realpython.com/twitter-sentiment-python-docker-elasticsearch-kibana/
  • UGstudents D. & RRCE, B. R. (n.d.). Implementation of an Efficient MongoDB NoSQL Explorer for Big Data Visualization.
  • Webinar: MongoDB Compass – Data navigation made easy. (n.d.). Retrieved from https://www.slideshare.net/mongodb/webinar-mongodb-compass-data-navigation-made-easy
  • What Does Build, Ship and Run Any App, Anywhere Really Mean? (n.d.). Retrieved from https://nickjanetakis.com/blog/what-does-build-ship-and-run-any-app-anywhere-really-mean
  • What is MongoDB. (n.d.). Retrieved from https://www.mongodb.com/what-is-mongodb

 

Get Help With Your Essay

If you need assistance with writing your essay, our professional essay writing service is here to help!

Find out more

Cite This Work

To export a reference to this article please select a referencing style below:

Reference Copied to Clipboard.
Reference Copied to Clipboard.
Reference Copied to Clipboard.
Reference Copied to Clipboard.
Reference Copied to Clipboard.
Reference Copied to Clipboard.
Reference Copied to Clipboard.

Related Services

View all

DMCA / Removal Request

If you are the original writer of this essay and no longer wish to have the essay published on the UK Essays website then please:

Related Lectures

Study for free with our range of university lectures!