Disclaimer: This essay is provided as an example of work produced by students studying towards a computer science degree, it is not illustrative of the work produced by our in-house experts. Click here for sample essays written by our professional writers.

Any opinions, findings, conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of UKEssays.com.

Ntroduction To Big Data Technologies Computer Science Essay

Paper Type: Free Essay Subject: Computer Science
Wordcount: 5344 words Published: 1st Jan 2015

Reference this

Research and development in the area of database technology during the past decade is characterized by the striving need for better application support beyond the traditional world, where primarily high volumes of simply structured data had to be processed efficiently. Advanced database technology provides new and much-needed solutions in many important areas; these same solutions often require thorough consideration in order to avoid the introduction of new problems. There have been times when a database technology threatened to take a piece of the action, such as object databases in the 1990’s, but these alternatives never got anywhere. After such a long period of dominance, the current excitement about non-relational databases comes as a surprise.

The RASP Pvt. Ltd. Firm is a startup company and is experiencing a huge burst in the amount of data to be handled along with an increased number of customer bases. This exponential growth in the business needs to be provisioned with mechanisms to handle enormous amount and variety of data, provide scalability and availability functionalities and increased performance. Besides this, reliability has to be ensured by enabling automatic failover recovery. We aim to provide solutions which can help to overcome these hurdles.

Our approach is to explore new technologies besides the mature and prevalent traditional relational database systems. We researched various non-relational database technologies and the features and functionalities they offer. The most important aspect of these new technologies is polyglot persistence that is to use different databases for different needs within an organization. Our attempt was to provide few solutions by combining the powerful features of these technologies and provide an integrated approach to handle the problem at hand.

Outline

Introduction

History

Current Trends

Merits

Demerits

In Depth Problem Review

Storage

Computation

Performance

Scalability

Availability

Introduction to Big Data

What Is Big Data

Hadoop

Map Reduce

HDFS

NoSQL Eco System

Document Oriented

Merits

Demerits

Case Study – Mongo DB

Key Value

Merits

Demerits

Case Study – Azure Table Store

Column Store

Merits

Demerits

Case Study – Cassandra

Graph

Merits

Demerits

Case Study – Neo4j

Solution Approach

NoSQL Methods to MySQL

Problem Addressed

Challenges

MongoDB & Hadoop

Problem Addressed

Challenges

Cassandra & Hadoop

Problem Addressed

Challenges

Azure Table Storage & Hadoop

Problem Addressed

Challenges

Neo4J

Problem Addressed

Challenges

Conclusion

References

1. Introduction

1.1 History

Database system is been used since 1960. It has been evolved largely in last five decades. Relational database concepts were introduced in the decade of 1970. RDBMS took birth with such a strong advantages and usability that is sustained for almost 40 years now. In 1980 structured query languages were introduced that only enriched the use of traditional database system. It gave a facility to retrieve useful data in seconds with the help of two liner query. Lately, internet is used to empower database that provides distributed database systems.

Get Help With Your Essay

If you need assistance with writing your essay, our professional essay writing service is here to help!
Find out more about our Essay Writing Service

1.2 Current Trends

Database has become inevitable part of IT industry. It has its own significance in every model. Normally there is a separate layer in almost all applications called data layer which talks about how to store data and how to retrieve it. There is mechanism provided to access the database in almost every language. The scope of IT industry is expanding with new technologies like mobile computing. New type of databases is being introduced very frequently. Storage capacity was the issue before few days which has been solved with cloud technologies. This entire new trend is also introducing new challenges for traditional database system like large amount of data, dynamically created data, storage issues, retrieval problems etc.

1.2.1 Merits

The main advantage of database system is it provides ACID properties and allows concurrency. Database designer takes care of redundancy control, data integrity by applying normalization techniques. Data sharing and transaction control are added advantages of RDBMS. Data security is also provided to some extent. There are in built encryption facilities to protect data. Backup and Recovery subsystems provided by DBMS help to recover data loss occurred on hardware failure. Structured query languages provide easy retrieval and easy management for database. It also supports multiple views for different users.

1.2.2 Demerits

Database design is most important part of the system. It is difficult to design database that will provide all advantages mentioned above. It is complex process and difficult to understand. Sometimes after normalization the cost of retrieval increases. Security is very limited. It is costly to manage database servers. Single server failure affects badly to the entire business. Large amount of data generated on regular basis is difficult to manage through traditional system. We still don’t have support for some type of data such as media files.

1.3 In Depth Problem Review

1.3.1 Storage

Ever increasing data has always been a challenge for IT industry. This data is often in unstructured format. Traditional database system is not capable of storing such a large amount of unstructured data. As volume increases it becomes difficult to structure, design, index and retrieve data. Traditional database system also uses physical servers to store data which may lead to single point failure. It requires cost to maintain physical database servers. Recovery is also complicated and time consuming for traditional database system.

1.3.2 Performance

Often normalization effects on performance. Highly normalized database contain large number of tables. Many keys and foreign keys are created to relate these tables with each other. Multiple joins are used to retrieve a record and data related to record. Queries containing multiple joins deteriorate performance. Updating and deleting also takes maximum reads and writes. Designer of the database should consider all these things while designing the database.

1.3.3 Scalability

In tradition database model, data structures are defined when the table is created. To store data, especially text data, it is hard to predict the length. If you allocate more length and data is less then space goes in vain. If you allocate less length but data is of more length then without giving any error it will save part of data that can be accommodate in that length. You have to be very specific with the data type. If you try to store float value in integer type, and some field is calculated using that field then all data can be affected. Also, traditional databases focus more on performance.

1.3.4 Availability

As mentioned before, Data is stored in database servers. Bigshot companies have their own data stores located worldwide. To increase performance data is split and stored on different locations. There are some tasks like daily backup which are conducted to take backup of data. If by any reason (natural calamities, fire, flood etc.) data is lost then application will be down as data restore will take some time.

2. Introduction to Big Data

2.1 What is Big Data?

Big data is a term used to describe the exponential growth, availability, reliability and use of structured, semi-structured and unstructured data. There are four dimensions to the Big Data:

Volume: Data is generated from various sources and is collected in a massive amounts. Social websites, forums, transactional data for later use are generated in terabytes and petabytes. we need to store these data in a very meaning full way to make a value out of it.

Velocity: Velocity is not just about producing data faster but it also means processing data faster to meet the requirement. RFID tags requires to handle loads of data as a result they demand a system which deals with huge data in terms of faster processing and generating data. It’s difficult to deal with that much data to improve on velocity for many organizations.

http://www.sas.com/big-data/index.html

Variety:  Today, there are many sources for organizations to collect or generate data from such as traditional, hierarchical databases created by OLAP and users. Also there are unstructured and semi structured data such as email, videos, audio, transactional data, forums, text documents, meter collected data. Most of the data is not numeric but still it is used in making decisions.

http://www.sas.com/big-data/index.html

Veracity: As the variety and number of sources grows it’s difficult for the decision maker to trust on the information they are using for the analysis. So to ensure the trust in Big data is a challenge.

2.2 Hadoop

Apache Hadoop is an open-source software framework that supports data-intensive distributed applications. It was derived from google’s mapreduce and google file system (GFS) paper. It’s written in JAVA programming language and supports the application which run on large clusters and gives them reliability. Hadoop implements MapReduce and uses Hadoop Distributed File System (HDFS). It is defined to be reliable and available because both MapReduce and Hadoop are designed to handle any node failures happen causing data to be available all the time.

Some of the advantages of hadoop are as follows

cheap and fast

scales to large amounts of storage and computation

flexible with any type of data

and with programming languages.

Figure 1: Multi-node cluster (source: http://en.wikipedia.org/wiki/File:Hadoop_1.png)

2.2.1 MapReduce:

It was developed for processing large amounts of raw data, for example, crawled documents or web request logs. This data is distributed across thousands of machines in order to be processed faster. This distribution implies the parallel processing by computing same problem on each machine with different data set. MapReduce is an abstraction that allows engineers to perform simple computations while hiding the details of parallelization, data distribution, load balancing and fault tolerance.

Figure 2: MapReduce Implementation (Source: http://code.google.com/edu/parallel/mapreduce-tutorial.html)

The library of MapReduce in the program shards input files into X pieces. Each shard file is of 16 MB to 64 MB. Then these files are run on the cluster.

One of the shard files is the master. Master assigns work to the worker nodes. Master has M map task and R reduce task to assign to worker node. There might be some idle workers. Master chooses those workers and assign them these tasks.

Map task is to reads the contents. It parses key/value pairs out of the input data and passes each pair to the user-defined Map function. These intermediate pairs are stored in memory.

On timely basis, pairs stored on memory are written to local disk which is partitioned by the partitioning function in R regions. The locations of these partitions on local disk is transferred to the master, which in turn passes it to the worker which performs reduce function.

worker assigned reduce work uses remote procedure calls to read the data from local disks of the map workers. When a reduce worker has read all intermediate data, it sorts it by the intermediate keys so that all occurrences of the same key are grouped together.

The reduce worker iterates over the sorted intermediate data and for each unique intermediate key encountered, it passes the key and the corresponding set of intermediate values to the user’s Reduce function. The output of the Reduce function is appended to a final output file for this reduce partition.

When all map tasks and reduce tasks have been completed, the master wakes up the user program. At this point, the MapReduce call in the user program returns back to the user code.

After successful completion, the output of the MapReduce execution is available in the R output files.

2.2.2 Hadoop Distributed File System (HDFS):

Hadoop Distributed File System is a portable, scalable and distributed file system. It is written in Java for the Hadoop framework. HDFS cluster is made up of cluster of datanode. and each node in hadoop has single namenode. Every datanode has blocks of data on the network using a block protocol specific to HDFS. As file system is on the network it uses TCP/IP layer for the communication and clients use RPC to communicate between each other. Each node does not need to have a datanode present in it. HDFS stores large files with the size multiple of 64MB, across multiple machines. By replicating the data across multiple hosts it achieves reliability, and it does not need RAID on the host server. Data is stored on 3 nodes, 2 of them are stored on the same rack and 1 on different rack. Default replication value used for storing is 3. Data rebalancing, moving copies of data and keeping high replication of data is achieved by communicating between the nodes.

HDFS has high-availability capabilities. It allows the main metadata server to be manually as well as automatically failed over to a backup in the event of failure. The file system has a Secondary Namenode, which connects with the Primary Namenode to build snapshots of Primary Namenode’s directory information. These snapshots are then stored in local or remote directories. These snapshots are then used to restart a failed primary name node. This eliminates replaying the entire file system action. This can be a bottleneck for accessing huge amount of small files as namenode is the only single point for storage and management of metadata. HDFS Federation helps in serving multiple namespaces by separate Namenodes .

The main advantage of HDFS is communication between job tracker and task tracker regarding data. By knowing the data location jobtracker assign map or reduce jobs to task trackers. let’s say, if node P has data (l,m,n) and node Q has data (a,b,c). Job tracker will assign node Q to do map or reduce task on a, b, c and node P will be assigned to do map reduce on l, m, n. This help reduce the unanted traffic on the network.

Figure 3: HDFS Architecture (Source: http://hadoop.apache.org/docs/r0.20.2/images/hdfsarchitecture.gif)

2.3 NoSQL Ecosystem

NoSQL is a non-relational database management systems which is different form the traditional relational database management systems in significant ways. NoSQL systems are designed for distributed data stores which require large scale data storage, are schema-less and scale horizontally. Relational databases rely upon very hard-and-fast, structured rules to govern transactions. These rules are encoded in the ACID model which requires that the database must always preserve atomicity, consistency, isolation and durability in each database transaction. The NoSQL databases follow the BASE model which offers three loose guidelines: basic availability, soft state and eventual consistency.

The term NoSQL was coined by Carlo Strozzi in 1998 for his Open Source, Light Weight Database which had no SQL interface. Later, in 2009, Eric Evans, a Rackspace employee, reused the term for databases which are non-relational, distributed and do not conform to atomicity, consistency, isolation and durability. In the same year, “no:sql(east)” conference held in Atlanta, USA, NoSQL was discussed a lot. And eventually NoSQL saw an unprecedented growth.

Two primary reasons to consider NoSQL are: handle data access with sizes and performance that demand a cluster; and to improve the productivity of application development by using a more convenient data interaction style. The common characteristics of NoSQL are:

Not using the relational model

Running well on clusters

Open-source

Built for 21st century web estates

Schema less

Each NoSQL solution uses a different data model which can be put in four widely used categories in the NoSQL Ecosystem: key-value, document, column-family and graph. Of these the first three share a common characteristic of their data models called aggregate orientation. Next we briefly describe each of these data models.

2.3.1 Document Oriented

The main concept of a document oriented database is the notion of a “document”. The database stores and retrieves documents which encapsulate and encode data in some standard formats or encodings like XML, JSON, BSON, and so on. These documents are self-describing, hierarchical tree data structures and can offer different ways of organizing and grouping documents:

Collections

Tags

Non-visible Metadata

Directory Hierarchies

Documents are addressed in the database via a unique key which represents the document. Also, beyond a simple key-document lookup, the database offers an API or query language that allows retrieval of documents based on their content.

2.3.1.1 Merits

Intuitive data structure

Simple “natural” modeling of requests with flexible query functions

Can act as a central data store for event storage, especially when the data captured by the events keeps changing.

With no predefined schemas, they work well in content management systems or blogging platforms.

Can store data for real-time analytics; since parts of the document can be updated, it is easy to store page views and new metrics can be added without schema changes.

Provides flexible schema and ability to evolve data models without expensive database refactoring or data migration to E-commerce applications.

2.3.1.2 Demerits

Higher hardware demands because of more dynamic DB queries in part without data preparation.

Redundant storage of data (denormalization) in favor of higher performance.

Not suitable for atomic cross-document operations.

Since the data is saved as an aggregate, if the design of an aggregate is constantly changing, aggregates have to be saved at the lowest level of granularity. In this case, document databases may not work.

2.3.1.3 Case Study – MongoDB

MongoDB is an open-source document-oriented database system developed by 10gen. It stores structured data as JSON-like documents with dynamic schemas (MongoDB calls the format BSON), making the integration of data in certain types of applications easier and faster. The language support includes Java, JavaScript, Python, PHP, Ruby and it also supports sharding via configurable data fields. Each MongoDB instance has multiple databases, and each database can have multiple collections. When a document is stored, we have to choose which database and collection this document belongs in.

Consistency in MongoDB database is configured by using the replica sets and choosing to wait for the writes to be replicated to a given number of slaves. Transactions at the single-document level are atomic transactions – a write either succeeds or fails. Transactions involving more than one operation are not possible, although there are few exceptions. MongoDB implements replication, providing high availability using replica sets. In a replica set, there are two or more nodes participating in an asynchronous master-slave replication. MongoDB has a query language which is expressed via JSON and has variety of constructs that can be combined to create a MongoDB query. With MongoDB, we can query the data inside the document without having to retrieve the whole document by its key and then introspect the document. Scaling in MongoDB is achieved through sharding. In sharding, the data is split by certain field, and then moved to different Mongo nodes. The data is dynamically moved between nodes to ensure that shards are always balanced. We can add more nodes to the cluster and increase the number of writable nodes, enabling horizontal scaling for writes.

2.3.2 Key-value

A key-value store is a simple hash table, primarily used when all access to the database is via primary key. Key-value stores allow the application to store its data in a schema-less way. The data could be stored in a datatype of a programming language or an object. The following types exist: Eventually-consistent key-value store, hierarchical key-value store, hosted services, key-value chain in RAM, ordered key-value stores, multivalue databases, tuple store and so on.

Key-value stores are the simplest NoSQL data stores to use form an API perspective. The client can get or put the value for a key, or delete a key from the data store. The value is a blob that is just stored without knowing what is inside; it is the responsibility of the application to understand what is stored.

2.3.2.1 Merits

Performance high and predictable.

Simple data model.

Clear separation of saving from application logic (because of lacking query language).

Suitable for storing session information.

User profiles, product profiles, preferences can be easily stored.

Best suited for shopping cart data and other E-commerce applications.

Can be scaled easily since they always use primary-key access.

2.3.2.2 Demerits

Limited range of functions

High development effort for more complex applications

Not the best solution when relationships between different sets of data are required.

Not suited for multi operation transactions.

There is no way to inspect the value on the database side.

Since operations are limited to one key at a time, there is no way to operate upon multiple keys at the same time.

2.3.2.3 Case Study – Azure Table Storage

For structured forms of storage, Windows Azure provides structured key-value pairs stored in entities known as Tables. The table storage uses a NoSQL model based on key-value pairs for querying structured data that is not in a typical database. A table is a bag of typed properties that represents an entity in the application domain. Data stored in Azure tables is partitioned horizontally and distributed across storage nodes for optimized access.

Every table has a property called the Partition Key, which defines how data in the table is partitioned across storage nodes – rows that have the same partition key are stored in a partition. In addition, tables can also define Row Keys which are unique within a partition and optimize access to a row within a partition. When present, the pair {partition key, row key} uniquely identifies a row in a table. The access to the Table service is through REST APIs.

2.3.3 Column Store

Column-family databases store data in column-families as rows that have many columns associated with a row key. These stores allow storing data with key mapped to values, and values grouped into multiple column families, each column family being a map of data. Column-families are groups of related data that is often accessed together.

The column-family model is as a two-level aggregate structure. As with key-value stores, the first key is often described as a row identifier, picking up the aggregate of interest. The difference with column-family structures is that this row aggregate is itself formed of a map of more detailed values. These second-level values are referred to as columns. It allows accessing the row as a whole as well as operations also allow picking out a particular column.

2.3.3.1 Merits

Designed for performance.

Native support for persistent views towards key-value store.

Sharding: Distribution of data to various servers through hashing.

More efficient than row-oriented systems during aggregation of a few columns from many rows.

Column-family databases with their ability to store any data structures are great for storing event information.

Allows storing blog entries with tags, categories, links, and trackbacks in different columns.

Can be used to count and categorize visitors of a page in a web application to calculate analytics.

Provides a functionality of expiring columns: columns which, after a given time, are deleted automatically. This can be useful in providing demo access to users or showing ad banners on a website for a specific time.

2.3.3.2 Demerits

Limited query options for data

High maintenance effort during changing of existing data because of updating all lists.

Less efficient than all row-oriented systems during access to many columns of a row.

Not suitable for systems that require ACID transactions for reads and writes.

Not good for early prototypes or initial tech spikes as the schema change required is very expensive.

2.3.3.3 Case Study – Cassandra

A column is the basic unit of storage in Cassandra. A Cassandra column consists of a name-value pair where the name behaves as the key. Each of these key-value pairs is a single column and is stored with a timestamp value which is used to expire data, resolve write conflicts, deal with stale data, and other things. A row is a collection of columns attached or linked to a key; a collection of similar rows makes a column family. Each column family can be compared to a container of rows in an RDBMS table where the key identifies the row and the row consists on multiple columns. The difference is that various rows do not need to have the same columns, and columns can be added to any row at any time without having to add it to other rows.

By design Cassandra is highly available, since there is no master in the cluster and every node is a peer in the cluster. A write operation in Cassandra is considered successful once it’s written to the commit log and an in-memory structure known as memtable. While a node is down, the data that was supposed to be stored by that node is handed off to other nodes. As the node comes back online, the changes made to the data are handed back to the node. This technique, known as hinted handoff, for faster restore of failed nodes. In Cassandra, a write is atomic at the row level, which means inserting or updating columns for a given row key will be treated as a single write and will either succeed or fail. Cassandra has a query language that supports SQL-like commands, known as Cassandra Query Language (CQL). We can use the CQL commands to create a column family. Scaling in Cassandra is done by adding more nodes. As no single node is a master, when we add nodes to the cluster we are improving the capacity of the cluster to support more writes and reads. This allows for maximum uptime as the cluster keeps serving requests from the clients while new nodes are being added to the cluster.

2.3.4 Graph

Graph databases allow storing entities and relationships between these entities. Entities are also known as nodes, which have properties. Relations are known as edges that can have properties. Edges have directional significance; nodes are organized by relationships which allow finding interesting patterns between the nodes. The organization of the graph lets the data to be stored once and then interpreted in different ways based on relationships.

Relationships are first-class citizens in graph databases; most of the value of graph databases is derived from the relationships. Relationships don’t only have a type, a start node, and an end node, but can have properties of their own. Using these properties on the relationships, we can add intelligence to the relationship – for example, since when did they become friends, what is the distance between the nodes, or what aspects are shared between the nodes. These properties on the relationships can be used to query the graph.

2.3.4.1 Merits

Very compact modeling of networked data.

High performance efficiency.

Can be deployed and used very effectively in social networking.

Excellent choice for routing, dispatch and location-based services.

As nodes and relationships are created in the system, they can be used to make recommendation engines.

They can be used to search for patterns in relationships to detect fraud in transactions.

2.3.4.2 Demerits

Not appropriate when an update is required on all or a subset of entities.

Some databases may be unable to handle lots of data, especially in global graph operations (those involving the whole graph).

Sharding is difficult as graph databases are not aggregate-oriented.

2.3.4.3 Case Study – Neo4j

Neo4j is an open-source graph database, implemented in Java. It is described as an embedded, disk-based, fully transactional Java persistence engine that stores data structured in graphs rather than in table. Neo4j is ACID compliant and easily embedded in individual applications.

In Neo4J, a graph is created by making two nodes and then establishing a relationship. Graph databases ensure consistency through transactions. They do not allow dangling relationships: The start node and end node always have to exist, and nodes can only be deleted if they don’t have any relationships attached to them. Neo4J achieves high availability by providing for replicated slaves. Neo4j is supported by query languages such as Gremlin (Groovy based traversing language) and Cypher (declarative graph query language). There are three ways to scale graph databases:

Adding enough RAM to the server so that the working set of nodes and relationships is held entirely in memory.

Improve the read scaling of the database by adding more slaves with read-only access to the data, with all the writes going to the master. 

Sharding the data from the application side using domain-specific knowledge.

3. Solution Approach

3.1 NoSQL Methods to MySQL

3.1.1 Problem Addressed

The ever increasing performance demands of web-based services has generated significant interest in providing NoSQL access methods to MySQL – enabling users to maintain all of the advantages of their existing relational database infrastructure, while providing fast performance for simple queries, using an API to complement regular SQL access to their data.

There are many features of MySQL Cluster that make it ideal for lots of applications that are considering NoSQL data stores. Scaling out ,performance on commodity hardware, in-memory real-time performance, flexible schemas are some of them. In addition, MySQL Cluster adds transactional consistency and durability. We can also simultaneously combine various NoSQL APIs with full-featured SQL – all working on the same data set. 

MySQL java APIs have the following features:

– Persistent classes

– Relationships

– Joins in queries

– Lazy loading

– Table and index creation from object model

By eliminating data transformations via SQL, users get lower data access latency and higher throughput. In addition, Java developers have a more natural programming method to directly manage their data, with a complete, feature-rich solution for Object/Relational Mapping. As a resul

 

Cite This Work

To export a reference to this article please select a referencing stye below:

Reference Copied to Clipboard.
Reference Copied to Clipboard.
Reference Copied to Clipboard.
Reference Copied to Clipboard.
Reference Copied to Clipboard.
Reference Copied to Clipboard.
Reference Copied to Clipboard.

Related Services

View all

DMCA / Removal Request

If you are the original writer of this essay and no longer wish to have your work published on UKEssays.com then please:

Related Services

Our academic writing and marking services can help you!

Prices from

£124

Approximate costs for:

  • Undergraduate 2:2
  • 1000 words
  • 7 day delivery

Order an Essay

Related Lectures

Study for free with our range of university lecture notes!

Academic Knowledge Logo

Freelance Writing Jobs

Looking for a flexible role?
Do you have a 2:1 degree or higher?

Apply Today!