Strategies for the Analysis of Big Data

2716 words (11 pages) Essay

29th Mar 2018 Computer Science Reference this

Tags:

Disclaimer: This work has been submitted by a university student. This is not an example of the work produced by our Essay Writing Service. You can view samples of our professional work here.

Any opinions, findings, conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of UKEssays.com.

CHAPTER: 1 INRODUCTION

  1. General

Day by day amount of data generation is increasing in drastic manner. Wherein to describe the data which is in the amount of zetta byte popular term used is “Big data”. Government, companies and many organizations try to obtain and store data about their citizens and customers in order to know them better and predict the customer behavior. The big example is of Social networking websites which generate new data each and every second and managing such a huge data is one of the major challenges companies are facing. Disruption is been caused due to the huge data which is stored in data warehouses is in a raw format, in order to produce usable information from this raw data, its proper analysis and processing is to be done. Many of the tools are in progress to handle such a large amount of data in short time. Apache Hadoop is one of the java based programming framework used for processing large data sets in distributed computer environment. Hadoop is useful and being used in types of system where multiple nodes are present which can process terabytes of data. Hadoop uses its own file system HDFS which facilitates fast transfer of data which can sustain node failure and avoid system failure as whole. Hadoop uses Map Reduce algorithm which breaks down the big data into smaller part and performs the operations on it. Various technologies will come in hand-in-hand to accomplish this task such as Spring Hadoop Data Framework for the basic foundations and running of the Map-Reduce jobs, Apache Maven for distributed building of the code, REST Web services for the communication, and lastly Apache Hadoop for distributed processing of the huge dataset.

  1. Literature Survey

There are many of analysis techniques but six types of analysis we should know are:

  • Descriptive
  • Exploratory
  • Inferential
  • Predictive
  • Causal
  • Mechanistic
    1. Descriptive

Descriptive analysis technique is use for statistical calculation. It is use for large volume of data set. In this analysis technique only use for univariate and binary analysis. It is only explain for “what, who, when, where” not a caused. Limitation of descriptive analysis technique it cannot help to find what causes a particular inspiration, performance and amount. This type of technique is use for only Observation and Surveys.

  1. Exploratory

Exploratory means investigation of any problem or case which is provides approaching of research. The research meant provide a small amount of information. It may use variety of method like interview; cluster conversation and testing which is use for gaining information. In particular technique useful for defining future studies and question. Why future studies because exploratory technique we use old data set.

  1. Inferential

Inferential data analysis technique is allowed to study sample and make simplification of population data set. It can be used for trial speculation and important part of technical research. Statistics are used for descriptive technique and effect of self-sufficient or reliant variable. In this technique show some error because we not get accurate sampling data.

  1. Predictive

Predictive analysis it is one of the most important technique it can be used for sentimental analysis and depend on predictive molding. It is very hard mainly about future references. We can use that technique for likelihood some more companies are use this technique like a Yahoo, EBay and Amazon this all company are provide a publically data set we can use and perform investigation. Twitter also provides data set and we separated positive negative and neutral category.

  1. Causal

Casual meant incidental we determine key point of given casual and effect of correlation between variables. Casual analysis use in market for profound analysis. We can used in selling price of product and various parameter like opposition and natural features etc. This type of technique use only in experimental and simulation based simulation means we can use mathematical fundamental and related to real existence scenario. So we can say that in casual technique depend on single variable and effect of activities result.

  1. Mechanistic

Last and most stiff analysis technique. Why it is stiff because it is used in a biological purpose such study about human physiology and expand our knowledge of human infection. In this technique we use to biological data set for analysis after perform investigation that give a result of human infection.

CHAPTER: 2 AREA OF WORK

Hadoop framework is used by many big companies like GOOGLE, IBM, YAHOOfor applications such as search engine in India only one company use Hadoop that is “Adhar scheme”.

2.1 Apache Hadoop goes realtime at Facebook.

At Facebook used to Hadoop echo system it is combination of HDFS and Map Reduce. HDFS is Hadoop distributed file system and Map Reduce is script of any language like a java, php, and python and so on. This are two components of Hadoop HDFS used for storage and Map Reduce just reduce to immense program in simple form. Why facebook is used because Hadoop response time fast and high latency. In facebook millions of user online at a time if suppose they share a single server so it is work load is high then faced a many problem like server crash and down so tolerate that type of problem facebook use Hadoop framework. First big advantage in Hadoop it is used distributed file system that’s help for achieve fast access time. Facbook require very high throughput and large storage disk. The large amount of data is being read and written from the disk sequentially, for these workloads. Facebook data is unstructured date we can’t manage in row and column so it is used distributed file system. In distributed file system data access time fast and recovery of data is good because one disk (Data node) goes to down other one is work so we can easily access data what we want. Facebook generate a huge amount of data not only data it is real time data which change in micro second. Hadoop is managed data and mining of the data. Facebook is used new generation of storage and Mysql is good for read performance, but suffer from low written throughput and the other hand Hadoop is fast read or write operation.

2.2. Yelp: uses AWS and Hadoop

Yelp originally depended upon to store their logs, along with a single node local instance of Hadoop. When Yelp made the giant RAIDs Redundant Array Of Independent disk move Amazon Elastic Map Reduce, they replaced the (Amazon S3) and immediately transferred all Hadoop The company also uses Amazon jobs to Amazon Elastic Map Reduce. Yelp uses Amazon S3 to store daily huge amount of logs and photos,. Elastic Map Reduce to power approximately 30 separate batch RAIDs with Amazon Simple Storage Service scripts, most of those generating around 10GB of logs per hour processing the logs. Features powered by Amazon Elastic Map Reduce include:

  1. People Who Viewed this Also Viewed
  2. Review highlights
  3. Auto complete as you type on search
  4. Search spelling suggestions
  5. Top searches
  6. Ads

Yelp uses Map Reduce. You can break down a big job into little pieces Map Reduce is about the simplest way. Basically, mappers read lines of input, and spit out key. Each key and all of its corresponding values are sent to a reducer.

CHAPTER: 3 THE PROPOSED SCHEMES

We overcome the problem of analysis of big data using Apache Hadoop. The processing is done in some steps which include creating a server of required configuration using Apache hadoop on single node cluster. Data on the cluster is stored using Mongo DB which stores data in the form of key: value pairs which is advantage over relational database for managing large amount of data. Various languages like python ,java ,php allows writing scripts for stored data from collections on the twitter in Mongo DB then after stored data export to json, csv and txt file which then can be processed in Hadoop as per user’s requirement. Hadoop jobs are written in framework this jobs implement Map Reduce program for data processing. Six jobs are implemented data processing in a location based social networking application. The record of the whole session has to be maintained in log file using aspect programming in python. The output produced after data processing in the hadoop job, has to be exported back to the database. The old values to the database have to be updated immediately after processing, to avoid loss of valuable data. The whole process is automated by using python scripts and tasks written in tool for executing JAR files.

CHAPTER: 4 METHOD AND MATERIAL

4.1 INSTALL HADOOP FRAMWORK

Install and configure Hadoop framework after installation we perform operation using Map Reduce and the Hadoop Distributed File System.

4.1.1 Supported Platforms

  • Linux LTS(12.4) it is a open source operating system hadoop is support many platforms but Linux is best one.
  • Win32/64 Hadoop support both type of platform 32bit or 64 bit win32 is not chains assembly platforms.

4.1.2 Required Software

  1. Any version of JDK (JAVA)
  2. Secure shell (SSH) local host installed which is use for data communication.
  3. Mongo DB (Database)

These requirements are Linux system.

4.1.4 Prepare the Hadoop Cluster

Extract the downloaded Hadoop file (hadoop-0.23.10). In the allocation, edit the file csbin/hadoop-envsh and set environment variable of JAVA and HAdoop.

Try the following command: $ sbin/hadoop Three types of mode existing in Hadoop cluster.

  • Local Standalone Mode
  • Pseudo Distributed Mode
  • Fully Distributed Mode

Local Standalone Mode

Local standalone mode in this mode we install only normal mode Hadoop is configure to run on not distributed mode.

Pseudo-Distributed Mode

Hadoop is run on single node cluster I am perform that operation and configure to hadoop on single node cluster and hadoop demons run on separate java process.

Configuration

we can change some files and configure Hadoop. Files are core.xml, mapreduce.xml and hdfs.xml all these files change and run Hadoop.

Fully-Distributed Mode

In this mode setting up fully-distributed mode non trivial cluster.

4.2 Data Collection

The twitter data anthology program captures three attribute.

1) User id

2) Twitter user (who sent Tweet)

3) Twitter text

The Twitter Id is used to extract tweets sent to the specified id. In our analysis; we collect the tweets sent to sachin tendulkar. We used Twitter APIs, to collect tweets sent to Sachin. The arrangement of the Twitter data that is composed. The key attributes Which we mine are: User id, Tweet text and Tweet User (who sent Tweet) save all key attribute in Mongo DB .Mongo DB is database where al tweet is saved. After collecting all data we export to csv and text file this file is use for analysis.

Fig. 1. Twitter data collection procedure

Extracting twitter data using python

In this python code firstly create developer account then we get a consumer key, consumer secret, access token and access token secret this are important for twitter api using that key we find all tweets. Initialize a connection to the Mongo DB instance connectivity to Data Base in this code tweet db is data base name mongo db support to collection.

>show dbs

That commend we see all database those are present in mongo db.

>use Data Base name

Select particular data base we use.

>db

Db command use to which data base is open.

>show collection

This command shows all collection. It means show all table.

>db.tweet.find ()

Use to show all data store in particular data base.

>db.tweet.find ().count ()

Use to that command how much tweet store in your data base.

CHAPTER: 5 SENTIMENTAL ANALYSIS OFBIG DATA

Last and foremost as well as most important part of data analysis is extracting twitter’s data. Supervised and unsupervised techniques are types of techniques that are used for analysis of “Big data”. Sentimental analysis has come to play a key role in text mining application for customer relationship, brand and product position, consumer attitude detection and market research. In recent advance there is several promising new direction for developing and advance sentimental analysis research. Sentimental classification identify whether the semantic direction of the given text is optimistic, pessimistic or unbiased. Most of open approach relies on supervised learning models they classified positive and negative option only. Three ways of machine learning techniques Naïve Bayes, SVM and Maximum Entropy Taxonomy do not perform well on sentimental classification. Sentimental analysis techniques may help researchers to study on the Internet. They would help to find out whether a given text is subjective or objective as well as whether a subjective passage contains optimistic or pessimistic opinions. Supervised Machine Learning techniques use class documents for classification. The machine learning approach treat the opinion classification problem as a topic based content classification problems. Comparison between Naïve Bayes, Maximum Entropy and SVM for sentimental classification, they achieve best precision using SVM.

CHAPTER: 6 SCREENSHOT

Browser view:

This view only use for browser view that show log file of data node and name node.

Hadoop cluster on:

In this screenshot show on data node name node that means properly install and configure single node hadoop cluster.

Data base view:

In this screenshot we extract twitter data and store Mongo DB. Mongo DB is a data base where all tweets are stored.

How many Tweets store in Data Base:

CHAPTER: 7 CONCLUSIONS

We have urbanized an architecture that uses PYTHON and Mongo DB in amalgamation with Twitter APIs to study tweets sent to the specific user. We use our architecture to get the positive, negative and neutral, analysis the number of re tweets and the name and Id of the users sending the tweets. Finding all data we analysis them can be used in conjunction with available results on queuing theory, to study the temporary and stable state performance of social networks. The proposed architecture can be used for a monitor correlation among user behaviors and their locations. The application of obtain outcome to study the development of population in under research. In sentimental analysis mining on large datasets using a Naïve Bayes classifier with the Hadoop echo system. We configure Hadoop in single node cluster and we also provide how to fetch or extracting twitter data using any language of api but in Hadoop cluster file system can do decent job even in the Big Data analysis domain.

Cite This Work

To export a reference to this article please select a referencing stye below:

Reference Copied to Clipboard.
Reference Copied to Clipboard.
Reference Copied to Clipboard.
Reference Copied to Clipboard.
Reference Copied to Clipboard.
Reference Copied to Clipboard.
Reference Copied to Clipboard.

Related Services

View all

DMCA / Removal Request

If you are the original writer of this essay and no longer wish to have your work published on the UKDiss.com website then please:

Related Lectures

Study for free with our range of university lectures!