Database Has Become Inevitable Part Of It Computer Science Essay

Published:

Database system is been used since 1960. It has been evolved largely in last five decades. Relational database concepts were introduced in the decade of 1970. RDBMS took birth with such a strong advantages and usability that is sustained for almost 40 years now. In 1980 structured query languages were introduced that only enriched the use of traditional database system. It gave a facility to retrieve useful data in seconds with the help of two liner query. Lately, internet is used to empower database that provides distributed database systems.

http://quickbase.intuit.com/articles/timeline-of-database-history

Current Trends

Database has become inevitable part of IT industry. It has its own significance in every model. Normally there is a separate layer in almost all applications called data layer which talks about how to store data and how to retrieve it. There is mechanism provided to access the database in almost every language. The scope of IT industry is expanding with new technologies like mobile computing. New type of databases is being introduced very frequently. Storage capacity was the issue before few days which has been solved with cloud technologies. This entire new trend is also introducing new challenges for traditional database system like large amount of data, dynamically created data, storage issues, retrieval problems etc.

Merits

Lady using a tablet
Lady using a tablet

Professional

Essay Writers

Lady Using Tablet

Get your grade
or your money back

using our Essay Writing Service!

Essay Writing Service

The main advantage of database system is it provides ACID properties and allows concurrency. Database designer takes care of redundancy control, data integrity by applying normalization techniques. Data sharing and transaction control are added advantages of RDBMS. Data security is also provided to some extent. There are in built encryption facilities to protect data. Backup and Recovery subsystems provided by DBMS help to recover data loss occurred on hardware failure. Structured query languages provide easy retrieval and easy management for database. It also supports multiple views for different users.

http://my.safaribooksonline.com/book/databases/9788131731925/database-system/ch01lev1sec2

Demerits

Database design is most important part of the system. It is difficult to design database that will provide all advantages mentioned above. It is complex process and difficult to understand. Sometimes after normalization the cost of retrieval increases. Security is very limited. It is costly to manage database servers. Single server failure affects badly to the entire business. Large amount of data generated on regular basis is difficult to manage through traditional system. We still don't have support for some type of data such as media files.

http://www.cl500.net/pros_cons.html

http://smallbusiness.chron.com/disadvantages-business-databases-33951.html

In Depth Problem Review

Storage

Ever increasing data has always been a challenge for IT industry. This data is often in unstructured format. Traditional database system is not capable of storing such a large amount of unstructured data. As volume increases it becomes difficult to structure, design, index and retrieve data. Traditional database system also uses physical servers to store data which may lead to single point failure. It requires cost to maintain physical database servers. Recovery is also complicated and time consuming for traditional database system.

Computation

Performance

Often normalization effects on performance. Highly normalized database contain large number of tables. Many keys and foreign keys are created to relate these tables with each other. Multiple joins are used to retrieve a record and data related to record. Queries containing multiple joins deteriorate performance. Updating and deleting also takes maximum reads and writes. Designer of the database should consider all these things while designing the database.

http://www.ehow.com/info_8072774_advantages-disadvantages-normalizing-database.html

Scalability

In tradition database model, data structures are defined when the table is created. To store data, especially text data, it is hard to predict the length. If you allocate more length and data is less then space goes in vain. If you allocate less length but data is of more length then without giving any error it will save part of data that can be accommodate in that length. You have to be very specific with the data type. If you try to store float value in integer type, and some field is calculated using that field then all data can be affected. Also, traditional databases focus more on performance.

Availability

Lady using a tablet
Lady using a tablet

Comprehensive

Writing Services

Lady Using Tablet

Plagiarism-free
Always on Time

Marked to Standard

Order Now

As mentioned before, Data is stored in database servers. Bigshot companies have their own data stores located worldwide. To increase performance data is split and stored on different locations. There are some tasks like daily backup which are conducted to take backup of data. If by any reason (natural calamities, fire, flood etc.) data is lost then application will be down as data restore will take some time.

Introduction to Big Data

What is Big Data?

Big data is a term used to describe the exponential growth, availability, reliability and use of structured, semi-structured and unstructured data. There are four dimensions to the Big Data:

Volume: Volume: Data is generated from various sources and is collected in a massive amounts. Social websites, forums, transactional data for later use are generated in terabytes and petabytes. we need to store these data in a very meaning full way to make a value out of it.

Velocity: Velocity is not just about producing data faster but it also means processing data faster to meet the requirement. RFID tags requires to handle loads of data as a result they demand a system which deals with huge data in terms of faster processing and generating data. It's difficult to deal with that much data to improve on velocity for many organizations.

http://www.sas.com/big-data/index.html

Variety:  Today, there are many sources for organizations to collect or generate data from such as traditional, hierarchical databases created by OLAP and users. Also there are unstructured and semi structured data such as email, videos, audio, transactional data, forums, text documents, meter collected data. Most of the data is not numeric but still it is used in making decisions.

http://www.sas.com/big-data/index.html

Veracity: As the variety and number of sources grows it's difficult for the decision maker to trust on the information they are using for the analysis. So to ensure the trust in Big data is a challenge.

2.2 Hadoop

Apache Hadoop is an open-source software framework that supports data-intensive distributed applications. It was derived from google's mapreduce and google file system (GFS) paper. It's written in JAVA programming language and supports the application which run on large clusters and gives them reliability. Hadoop implements MapReduce and uses Hadoop Distributed File System (HDFS). It is defined to be reliable and available because both MapReduce and Hadoop are designed to handle any node failures happen causing data to be available all the time.

Some of the advantages of hadoop are as follows

cheap and fast

scales to large amounts of storage and computation

flexible with any type of data

and with programming languages.

http://upload.wikimedia.org/wikipedia/en/2/2b/Hadoop_1.png

http://en.wikipedia.org/wiki/File:Hadoop_1.png

2.2.1 MapReduce:

It was developed for processing large amounts of raw data, for example, crawled documents or web request logs. This data is distributed across thousands of machines in order to be processed faster. This distribution implies the parallel processing by computing same problem on each machine with different data set. MapReduce is an abstraction that allows engineers to perform simple computations while hiding the details of parallelization, data distribution, load balancing and fault tolerance.

Map, takes an input pair and produces a set of intermediate key/value pairs. The MapReduce library groups together all intermediate values associated with the same intermediate key I and passes them to the reduce function.

The reduce function, accepts an intermediate key I and a set of values for that key. It merges together these values to form a possibly smaller set of values

http://code.google.com/edu/parallel/img/mrfigure.png

http://code.google.com/edu/parallel/mapreduce-tutorial.html

The MapReduce library in the user program first shards the input files into M pieces of typically 16 megabytes to 64 megabytes (MB) per piece. It then starts up many copies of the program on a cluster of machines.

One of the copies of the program is special: the master. The rest are workers that are assigned work by the master. There are M map tasks and R reduce tasks to assign. The master picks idle workers and assigns each one a map task or a reduce task.

Lady using a tablet
Lady using a tablet

This Essay is

a Student's Work

Lady Using Tablet

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.

Examples of our work

A worker who is assigned a map task reads the contents of the corresponding input shard. It parses key/value pairs out of the input data and passes each pair to the user-defined Map function. The intermediate key/value pairs produced by the Map function are buffered in memory.

Periodically, the buffered pairs are written to local disk, partitioned into R regions by the partitioning function. The locations of these buffered pairs on the local disk are passed back to the master, who is responsible for forwarding these locations to the reduce workers.

When a reduce worker is notified by the master about these locations, it uses remote procedure calls to read the buffered data from the local disks of the map workers. When a reduce worker has read all intermediate data, it sorts it by the intermediate keys so that all occurrences of the same key are grouped together. If the amount of intermediate data is too large to fit in memory, an external sort is used.

The reduce worker iterates over the sorted intermediate data and for each unique intermediate key encountered, it passes the key and the corresponding set of intermediate values to the user's Reduce function. The output of the Reduce function is appended to a final output file for this reduce partition.

When all map tasks and reduce tasks have been completed, the master wakes up the user program. At this point, the MapReduce call in the user program returns back to the user code.

After successful completion, the output of the MapReduce execution is available in the R output files.

2.2.2 Hadoop Distributed File System (HDFS):

Hadoop Distributed File System is a portable, scalable and distributed file system. It is written in Java for the Hadoop framework. HDFS cluster is made up of cluster of datanode. and each node in hadoop has single namenode. Every datanode has blocks of data on the network using a block protocol specific to HDFS. As file system is on the network it uses TCP/IP layer for the communication and clients use RPC to communicate between each other. Each node does not need to have a datanode present in it. HDFS stores large files with the size multiple of 64MB, across multiple machines. By replicating the data across multiple hosts it achieves reliability, and it does not need RAID on the host server. Data is stored on 3 nodes, 2 of them are stored on the same rack and 1 on different rack. Default replication value used for storing is 3. Data rebalancing, moving copies of data and keeping high replication of data is achieved by communicating between the nodes.

HDFS has high-availability capabilities. It allows the main metadata server to be manually as well as automatically failed over to a backup in the event of failure. The file system has a Secondary Namenode, which connects with the Primary Namenode to build snapshots of Primary Namenode's directory information. These snapshots are then stored in local or remote directories. These snapshots are then used to restart a failed primary name node. This eliminates replaying the entire file system action. This can be a bottleneck for accessing huge amount of small files as namenode is the only single point for storage and management of metadata. HDFS Federation helps in serving multiple namespaces by separate Namenodes .

The main advantage of HDFS is communication between job tracker and task tracker regarding data. By knowing the data location jobtracker assign map or reduce jobs to task trackers. let's say, if node P has data (l,m,n) and node Q has data (a,b,c). Job tracker will assign node Q to do map or reduce task on a, b, c and node P will be assigned to do map reduce on l, m, n. This help reduce the unanted traffic on the network.

2.3 NoSQL Ecosystem

NoSQL is a non-relational database management systems which is different form the traditional relational database management systems in significant ways. NoSQL systems are designed for distributed data stores which require large scale data storage, are schema-less and scale horizontally. Relational databases rely upon very hard-and-fast, structured rules to govern transactions. These rules are encoded in the ACID model which requires that the database must always preserve atomicity, consistency, isolation and durability in each database transaction. The NoSQL databases follow the BASE model which offers three loose guidelines: basic availability, soft state and eventual consistency.

The term NoSQL was coined by Carlo Strozzi in 1998 for his Open Source, Light Weight Database which had no SQL interface. Later, in 2009, Eric Evans, a Rackspace employee, reused the term for databases which are non-relational, distributed and do not conform to atomicity, consistency, isolation and durability. In the same year, "no:sql(east)" conference held in Atlanta, USA, NoSQL was discussed a lot. And eventually NoSQL saw an unprecedented growth.

Two primary reasons to consider NoSQL are: handle data access with sizes and performance that demand a cluster; and to improve the productivity of application development by using a more convenient data interaction style. The common characteristics of NoSQL are:

Not using the relational model

Running well on clusters

Open-source

Built for 21st century web estates

Schema less

Each NoSQL solution uses a different data model which can be put in four widely used categories in the NoSQL Ecosystem: key-value, document, column-family and graph. Of these the first three share a common characteristic of their data models called aggregate orientation. Next we briefly describe each of these data models.

2.3.1 Document Oriented

The main concept of a document oriented database is the notion of a "document". The database stores and retrieves documents which encapsulate and encode data in some standard formats or encodings like XML, JSON, BSON, and so on. These documents are self-describing, hierarchical tree data structures and can offer different ways of organizing and grouping documents:

Collections

Tags

Non-visible Metadata

Directory Hierarchies

Documents are addressed in the database via a unique key which represents the document. Also, beyond a simple key-document lookup, the database offers an API or query language that allows retrieval of documents based on their content.

2.3.1.1 Merits

Intuitive data structure

Simple "natural" modeling of requests with flexible query functions

Can act as a central data store for event storage, especially when the data captured by the events keeps changing.

With no predefined schemas, they work well in content management systems or blogging platforms.

Can store data for real-time analytics; since parts of the document can be updated, it is easy to store page views and new metrics can be added without schema changes.

Provides flexible schema and ability to evolve data models without expensive database refactoring or data migration to E-commerce applications.

2.3.1.2 Demerits

Higher hardware demands because of more dynamic DB queries in part without data preparation.

Redundant storage of data (denormalization) in favor of higher performance.

Not suitable for atomic cross-document operations.

Since the data is saved as an aggregate, if the design of an aggregate is constantly changing, aggregates have to be saved at the lowest level of granularity. In this case, document databases may not work.

2.3.1.3 Case Study - MongoDB

MongoDB is an open-source document-oriented database system developed by 10gen. It stores structured data as JSON-like documents with dynamic schemas (MongoDB calls the format BSON), making the integration of data in certain types of applications easier and faster. The language support includes Java, JavaScript, Python, PHP, Ruby and it also supports sharding via configurable data fields. Each MongoDB instance has multiple databases, and each database can have multiple collections. When a document is stored, we have to choose which database and collection this document belongs in.

Consistency in MongoDB database is configured by using the replica sets and choosing to wait for the writes to be replicated to a given number of slaves. Transactions at the single-document level are atomic transactions - a write either succeeds or fails. Transactions involving more than one operation are not possible, although there are few exceptions. MongoDB implements replication, providing high availability using replica sets. In a replica set, there are two or more nodes participating in an asynchronous master-slave replication. MongoDB has a query language which is expressed via JSON and has variety of constructs that can be combined to create a MongoDB query. With MongoDB, we can query the data inside the document without having to retrieve the whole document by its key and then introspect the document. Scaling in MongoDB is achieved through sharding. In sharding, the data is split by certain field, and then moved to different Mongo nodes. The data is dynamically moved between nodes to ensure that shards are always balanced. We can add more nodes to the cluster and increase the number of writable nodes, enabling horizontal scaling for writes.

2.3.2 Key-value

A key-value store is a simple hash table, primarily used when all access to the database is via primary key. Key-value stores allow the application to store its data in a schema-less way. The data could be stored in a datatype of a programming language or an object. The following types exist: Eventually-consistent key-value store, hierarchical key-value store, hosted services, key-value chain in RAM, ordered key-value stores, multivalue databases, tuple store and so on.

Key-value stores are the simplest NoSQL data stores to use form an API perspective. The client can get or put the value for a key, or delete a key from the data store. The value is a blob that is just stored without knowing what is inside; it is the responsibility of the application to understand what is stored.

2.3.2.1 Merits

Performance high and predictable.

Simple data model.

Clear separation of saving from application logic (because of lacking query language).

Suitable for storing session information.

User profiles, product profiles, preferences can be easily stored.

Best suited for shopping cart data and other E-commerce applications.

Can be scaled easily since they always use primary-key access.

2.3.2.2 Demerits

Limited range of functions

High development effort for more complex applications

Not the best solution when relationships between different sets of data are required.

Not suited for multi operation transactions.

There is no way to inspect the value on the database side.

Since operations are limited to one key at a time, there is no way to operate upon multiple keys at the same time.

2.3.2.3 Case Study - Azure Table Storage

For structured forms of storage, Windows Azure provides structured key-value pairs stored in entities known as Tables. The table storage uses a NoSQL model based on key-value pairs for querying structured data that is not in a typical database. A table is a bag of typed properties that represents an entity in the application domain. Data stored in Azure tables is partitioned horizontally and distributed across storage nodes for optimized access.

Every table has a property called the Partition Key, which defines how data in the table is partitioned across storage nodes - rows that have the same partition key are stored in a partition. In addition, tables can also define Row Keys which are unique within a partition and optimize access to a row within a partition. When present, the pair {partition key, row key} uniquely identifies a row in a table. The access to the Table service is through REST APIs.

2.3.3 Column Store

Column-family databases store data in column-families as rows that have many columns associated with a row key. These stores allow storing data with key mapped to values, and values grouped into multiple column families, each column family being a map of data. Column-families are groups of related data that is often accessed together.

The column-family model is as a two-level aggregate structure. As with key-value stores, the first key is often described as a row identifier, picking up the aggregate of interest. The difference with column-family structures is that this row aggregate is itself formed of a map of more detailed values. These second-level values are referred to as columns. It allows accessing the row as a whole as well as operations also allow picking out a particular column.

2.3.3.1 Merits

Designed for performance.

Native support for persistent views towards key-value store.

Sharding: Distribution of data to various servers through hashing.

More efficient than row-oriented systems during aggregation of a few columns from many rows.

Column-family databases with their ability to store any data structures are great for storing event information.

Allows storing blog entries with tags, categories, links, and trackbacks in different columns.

Can be used to count and categorize visitors of a page in a web application to calculate analytics.

Provides a functionality of expiring columns: columns which, after a given time, are deleted automatically. This can be useful in providing demo access to users or showing ad banners on a website for a specific time.

2.3.3.2 Demerits

Limited query options for data

High maintenance effort during changing of existing data because of updating all lists.

Less efficient than all row-oriented systems during access to many columns of a row.

Not suitable for systems that require ACID transactions for reads and writes.

Not good for early prototypes or initial tech spikes as the schema change required is very expensive.

2.3.3.3 Case Study - Cassandra

A column is the basic unit of storage in Cassandra. A Cassandra column consists of a name-value pair where the name behaves as the key. Each of these key-value pairs is a single column and is stored with a timestamp value which is used to expire data, resolve write conflicts, deal with stale data, and other things. A row is a collection of columns attached or linked to a key; a collection of similar rows makes a column family. Each column family can be compared to a container of rows in an RDBMS table where the key identifies the row and the row consists on multiple columns. The difference is that various rows do not need to have the same columns, and columns can be added to any row at any time without having to add it to other rows.

By design Cassandra is highly available, since there is no master in the cluster and every node is a peer in the cluster. A write operation in Cassandra is considered successful once it's written to the commit log and an in-memory structure known as memtable. While a node is down, the data that was supposed to be stored by that node is handed off to other nodes. As the node comes back online, the changes made to the data are handed back to the node. This technique, known as hinted handoff, for faster restore of failed nodes. In Cassandra, a write is atomic at the row level, which means inserting or updating columns for a given row key will be treated as a single write and will either succeed or fail. Cassandra has a query language that supports SQL-like commands, known as Cassandra Query Language (CQL). We can use the CQL commands to create a column family. Scaling in Cassandra is done by adding more nodes. As no single node is a master, when we add nodes to the cluster we are improving the capacity of the cluster to support more writes and reads. This allows for maximum uptime as the cluster keeps serving requests from the clients while new nodes are being added to the cluster.

2.3.4 Graph

Graph databases allow storing entities and relationships between these entities. Entities are also known as nodes, which have properties. Relations are known as edges that can have properties. Edges have directional significance; nodes are organized by relationships which allow finding interesting patterns between the nodes. The organization of the graph lets the data to be stored once and then interpreted in different ways based on relationships.

Relationships are first-class citizens in graph databases; most of the value of graph databases is derived from the relationships. Relationships don't only have a type, a start node, and an end node, but can have properties of their own. Using these properties on the relationships, we can add intelligence to the relationship - for example, since when did they become friends, what is the distance between the nodes, or what aspects are shared between the nodes. These properties on the relationships can be used to query the graph.

2.3.4.1 Merits

Very compact modeling of networked data.

High performance efficiency.

Can be deployed and used very effectively in social networking.

Excellent choice for routing, dispatch and location-based services.

As nodes and relationships are created in the system, they can be used to make recommendation engines.

They can be used to search for patterns in relationships to detect fraud in transactions.

2.3.4.2 Demerits

Not appropriate when an update is required on all or a subset of entities.

Some databases may be unable to handle lots of data, especially in global graph operations (those involving the whole graph).

Sharding is difficult as graph databases are not aggregate-oriented.

2.3.4.3 Case Study - Neo4j

Neo4j is an open-source graph database, implemented in Java. It is described as an embedded, disk-based, fully transactional Java persistence engine that stores data structured in graphs rather than in table. Neo4j is ACID compliant and easily embedded in individual applications.

In Neo4J, a graph is created by making two nodes and then establishing a relationship. Graph databases ensure consistency through transactions. They do not allow dangling relationships: The start node and end node always have to exist, and nodes can only be deleted if they don't have any relationships attached to them. Neo4J achieves high availability by providing for replicated slaves. Neo4j is supported by query languages such as Gremlin (Groovy based traversing language) and Cypher (declarative graph query language). There are three ways to scale graph databases:

Adding enough RAM to the server so that the working set of nodes and relationships is held entirely in memory.

Improve the read scaling of the database by adding more slaves with read-only access to the data, with all the writes going to the master. 

Sharding the data from the application side using domain-specific knowledge.

Solution Approach

NoSQL Methods to MySQL

Problem Addressed

The ever increasing performance demands of web-based services has generated significant interest in providing NoSQL access methods to MySQL - enabling users to maintain all of the advantages of their existing relational database infrastructure, while providing blazing fast performance for simple queries, using an API to complement regular SQL access to their data.

There are a number of attributes of MySQL Cluster that make it ideal for lots of applications that are considering NoSQL data stores. Scaling out capacity and performance on commodity hardware, in-memory real-time performance (especially for simple access patterns), flexible schemas… sound familiar? In addition, MySQL Cluster adds transactional consistency and durability. In case that's not enough, you can also simultaneously combine various NoSQL APIs with full-featured SQL - all working on the same data set. 

MySQL java APIs

- Persistent classes

- Relationships

- Joins in queries

- Lazy loading

- Table and index creation from object model

By eliminating data transformations via SQL, users get lower data access latency and higher throughput. In addition, Java developers have a more natural programming method to directly manage their data, with a complete, feature-rich solution for Object/Relational Mapping. As a result, the development of Java applications is simplified with faster development cycles resulting in accelerated time to market for new services.

MySQL Cluster offers multiple NoSQL APIs alongside Java:

- Memcached for a persistent, high performance, write-scalable Key/Value store,

- HTTP/REST via an Apache module

- C++ via the NDB API for the lowest absolute latency.

Developers can use SQL as well as NoSQL APIs for access to the same data set via multiple query patterns - from simple Primary Key lookups or inserts to complex cross-shard JOINs using Adaptive Query Localization

Marrying NoSQL and SQL access to an ACID-compliant database offers developers a number of benefits. MySQL Cluster's distributed, shared-nothing architecture with auto-sharding and real time performance makes it a great fit for workloads requiring high volume OLTP. Users also get the added flexibility of being able to run real-time analytics across the same OLTP data set for real-time business insight.

http://dev.mysql.com/tech-resources/articles/nosql-to-mysql-with-memcached.html

http://www.clusterdb.com/mysql-cluster/scalabale-persistent-ha-nosql-memcache-storage-using-mysql-cluster/

https://blogs.oracle.com/MySQL/entry/nosql_java_api_for_mysql

http://www.infoq.com/articles/nosql-data-security-virtual-panel

Challenges

nfoQ: What are the advantages and disadvantages of using NoSQL solutions compared to relational databases (RDBMS) when it comes to security?

BGU: Since most application developers using NoSQL datastores usually place the security handling in the middleware, most NoSQL databases don't provide any support for enforcing it explicitly in the database. This is not as huge an advantage of the older relational model as it might seem - the ability to enforce security in the database level is usually ignored. In most applications that I've seen, the security layer in the database is not utilized because the developer wants to use connection pooling, and re-authenticating a connection after it was already established is not supported by most databases (ORACLE supports it via "Proxy Authentication"). NoSQL solutions are usually more cluster oriented, which is an advantage in speed and availability, but a disadvantage in security. The problem here is more that the clustering aspect of NoSQL databases isn't as robust or grown-up as it should be.

However, when accessing the DBMS directly and not via an application, RDBMS has a huge advantage over NoSQL due to the availability of built-in security measures.

John: NoSQL databases are in general less complex than their traditional RDBMS counterparts. This lack of complexity is a benefit when it comes to security. Most RDBMS come with a huge number of features and extensions that an attacker could use to elevate privilege or further compromise the host. Two examples of this relate to stored procedures:

1) Extended stored procedures - these provide functionality that allows interaction with the host file system or network. Not only are many of these dangerous in themselves (e.g. xp_cmdshell on SQL Server allows admin users to execute operating system commands) but they have also had a high number of security problems such as buffer overflows.

2) Stored procedures that run as definer - RDBMS such as Oracle and SQL Server allow standard SQL stored procedures to run under a different (typically higher) user privilege. This is analogous to a "setuid" program in Linux. There have been many privilege escalation vulnerabilities in stored procedures due to SQL injection vulnerabilities.

One disadvantage of NoSQL solutions is their maturity compared with established RDBMS such Oracle, SQL Server, MySQL and DB2. With the RDBMS, despite a checkered security past discussed later, the various types of attack vector are well understood and have been for several years. NoSQL databases are still emerging and it is possible that whole new classes of security issue will be discovered.

Emil: For supporting authentication and authorization, graphs are clearly superior to RDBMS/KV stores because security policies can be interconnected with data. Users may belong to many groups, and have individual access rights to various levels of systems. Computing the aggregate authorisation is hard even with trees like LDAP, but in a graph it's straightforward.

MongoDB & Hadoop

Problem Addressed :  to integrate MongoDB with Hadoop for large-scale distributed data processing. Using tools like MapReduce, Pig and Streaming you will learn how to do analytics and ETL on large datasets with the ability to load and save data against MongoDB. With Hadoop MapReduce, Java and Scala programmers will find a native solution for using MapReduce to process their data with MongoDB. Programmers of all kinds will find a new way to work with ETL using Pig to extract and analyze large datasets and persist the results to MongoDB. Python and Ruby Programmers can rejoice as well in a new way to write native Mongo MapReduce using the Hadoop Streaming interfaces.

MongoDB and Hadoop are a powerful combination and can be used together to deliver complex analytics and data processing for data stored in MongoDB. 

Applications have complex needs.

Mongodb is an ideal operational db for big data.

Mongodb map reduce perform parallel processing.

Aggregation is a primary use of Mongodb-Map Reduce combination.

Aggregation framework used optimized for aggregate queries.

Realtime aggregation similar to SQLgroup by.

2 diagrams to follow.

Challenges

Javascript not the best language for processing Map Reduce.

It is limited in external data processing libraries.

It adds load to data stores.

Auto Sharding not reliable

Low Community Contribution

Global write locks sucks.

http://no-fucking-idea.com/blog/2012/04/01/using-map-reduce-with-mongodb/

http://www.slideshare.net/spf13/mongodb-and-hadoop

http://www.10gen.com/events/mongodb-plus-hadoop

http://www.scoop.it/t/hadoop-and-mahout/p/1550779280/challenges-with-mongodb

Cassandra & Hadoop

Problem Addressed

Performant OLT +Powerful OLAP

Less need to shuttle data between storage systems.

Data locality for processing.

Scales with cluster.

Can separate analytics load into virtual DC.

Cassandra has been traditionally used by Web 2.0 companies that require a fast and scalable way to store simple data sets, while Hadoop has been used for analyzing vast amounts of data across many servers.

Typically, running heavy analytics against live databases has been frowned upon, because it could slow responsiveness of the database. For this distribution, however, DataStax is taking advantage of Cassandra's ability to be distributed across multiple nodes.

In this setup, the data can be replicated, whereupon one copy would be kept with the transactional servers and another copy of the data could be placed on servers that would be subjected to analytics.

We can implement Hadoop and Cassandra on the same cluster.

This means that we can have your time-based and real-time applications (real-time being a strength of Cassandra) running under Cassandra while batch-based analytics and queries that do not require a timestamp can run on Hadoop.

In practice, in this environment, Cassandra replaces HDFS under the covers but this is invisible to the developer.

We can reassign (dynamically where appropriate) nodes between the Cassandra and Hadoop environments as is appropriate for your workload.

The other major upside is that using Cassandra removes the single points of failure that are associated with HDFS, namely the NameNode and JobTracker

Column family input and output format.

Pig & Hive Driver.

http://www.infoq.com/presentations/Hadoop-and-Cassandra-Sitting-in-a-Tree

http://www.slideshare.net/jeromatron/cassandrahadoop-integration#btnNext

http://www.bloorresearch.com/blog/IM-Blog/2012/1/cassandra-hadoop.html

http://www.pcworld.com/article/223230/article.html

https://mercury.smu.edu.sg/rsrchpubupload/17574/SMUOpenCirrus2010.pdf

Challenges

Cassandra replication settings are done on a node level with configuration files 

Although we did not achieve the level of performance needed for interactive queries

over weeks or months of taxi data, we believe that there are several ways the

performance can be improved further. In particular, the combination of more RAM and

more effective caching strategies could yield dramatic benefits. For interactive

applications, we expect that Cassandra's support for multi-threaded queries could also

help deliver speed and scalability.

Cassandra tends to be more sensitive to network performance than Hadoop - even

with physically local storage - since Cassandra replicas do not have the ability to

execute computing tasks locally as with Hadoop, meaning that tasks requiring a large

amount of data may need to transfer this data over the network in order to operate on it.

We believe that a commercially successful cloud computing service must be robust and

flexible enough to deliver high performance under a variety of provisioning scenarios

and application loads.

Azure Table Storage & Hadoop

Problem Addressed

Broader access to Hadoop through simplified deployment and programmability. Microsoft has simplified setup and deployment of Hadoop, making it possible to setup and configure Hadoop on Windows Azure in a few hours instead of days. Since the service is hosted on Windows Azure, customers only download a package that includes the Hive Add-in and Hive ODBC Driver. In addition, Microsoft has introduced new JavaScript libraries to make JavaScript a first class programming language in Hadoop. Through this library JavaScript programmers can easily write MapReduce programs in JavaScript, and run these jobs from simple web browsers. These improvements reduce the barrier to entry, by enabling customers to easily deploy and explore Hadoop on Windows.

Breakthrough insights through integration Microsoft Excel and BI tools. This preview ships with a new Hive Add-in for Excel that enables users to interact with data in Hadoop from Excel. With the Hive Add-in customers can issue Hive queries to pull and analyze unstructured data from Hadoop in the familiar Excel. Second, the preview includes a Hive ODBC Driver that integrates Hadoop with Microsoft BI tools. This driver enables customers to integrate and analyze unstructured data from Hadoop using award winning Microsoft BI tools such as PowerPivot and PowerView. As a result customers can gain insight on all their data, including unstructured data stored in Hadoop.image

Elasticity, thanks to Windows Azure. This preview of the Hadoop based service runs on Windows Azure, offering an elastic and scalable platform for distributed storage and compute.

The Hadoop on Windows Azure beta shows several interesting strengths, including:

Setup is easy using the intuitive Metro-style Web portal.

You get flexible language choices for running MapReduce jobs and data queries. You can run MapReduce jobs using Java, C#, Pig or JavaScript, and queries can be executed using Hive (HiveQL).

You can use your existing skills if you're familiar with Hadoop technologies. This implementation is compliant with Apache Hadoop snapshot 0.203+.

There are a variety of connectivity options, including an ODBC driver (SQL Server/Excel), RDP and other clients, as well as connectivity to other cloud data stores from Microsoft (Windows Azure Blobs, the Windows Azure Data Market) and others (Amazon Web Services S3 buckets).

Challenges

This makes HDFS well-suited for cases when data is appended at the end of a file, but not well suited for cases when data needs to be located and/or updated in the middle of a file. With indexing technologies, like HBase or Impala, data access becomes somewhat easier because keys can be indexed, but not being able to index into values (secondary indexes) only allow for primitive query execution.

There are, however, many unknowns in the version of Hadoop on Windows Azure that will be publicly released:

The current release is a private beta only; there is little information on a roadmap and planned release features.

Pricing hasn't been announced.

During the beta, there's a limit to the size of files you can upload, and Microsoft included a disclaimer that "the beta is for testing features, not for testing production-level data loads." So it's unclear what the release-version performance will be like.

http://oakleafblog.blogspot.com/2012/01/introducing-apache-hadoop-services-for.html

http://msdn.microsoft.com/en-us/magazine/jj190805.aspx

http://cloudcomputing.sys-con.com/node/2455411

Neo4J & Hadoop

Problem Addressed

The point to understand about graph databases, especially when it comes to analytics, is that the more nodes you have in your graph then the richer the environment becomes and the more information you can get out of it.

Hadoop is good for data crunching, but the end-results in flat files don't present well to the customer, also it's hard to visualize your network data in excel.

Neo4J is perfect for working with our networked data. We use it a lot when visualizing our different sets of data.

So we prepare our dataset with Hadoop and import it into Neo4J, the graph database, to be able to query and visualize the data.

We have a lot of different ways we want to look at our dataset so we tend to create a new extract of the data with some new properties to look at every few days.

This blog is about how we combined Hadoop and Neo4J and describes the phases we went trough in our search for the optimal solution.

The use of a graph database allows for ad hoc querying and visualization, which has proven very valuable when working with domain experts to identify interesting patterns and paths. Using Hadoop again for the heavy lifting, we can do traversals against the graph without having to limit the number of features (attributes) of each node or edge used for traversal. The combination of both can be a very productive workflow for network analysis.

Neo4j, for example, supports ACID-compliant transactions and XA-compliant two-phase commit. So Neo4j might be better equated with a NewSQL database, except that it can also handle significant query processing.

http://www.neotechnology.com/2012/06/serious-network-analysis-using-hadoop-and-neo4j/

http://www.neotechnology.com/2012/05/bloor-research-graph-databases-and-nosql-2-of-5/

http://blog.xebia.com/2012/11/13/combining-neo4j-and-hadoop-part-i/

http://dbmsmusings.blogspot.com/2011/07/hadoops-tremendous-inefficiency-on.html

Challenges

Hadoop, by default, hash partitions data across nodes. In practice (e.g., in the SHARD paper) this results in data for each vertex in the graph being randomly distributed across the cluster (dependent on the result of a hash function applied to the vertex identifier). Therefore, data that is close to each other in the graph can end up very far away from each other in the cluster, spread out across many different physical machines. For graph operations such as sub-graph pattern matching, this is wildly suboptimal. For these types of operations, the graph is traversed by passing through neighbors of vertexes; it is hugely beneficial if these neighbors are stored physically near each other (ideally on the same physical machine). When using hash partitioning, since there is no connection between graph locality and physical locality, a large amount of network traffic is required for each hop in the query pattern being matched (on the order of one MapReduce job per graph hop), which results in severe inefficiency. Using a clustering algorithm to graph partition data across nodes in the Hadoop cluster (instead of using hash partitioning) is a big win.

Hadoop, by default, has a very simple replication algorithm, where all data is generally replicated a fixed number of times (e.g. 3 times) across the cluster. Treating all data equally when it comes to replication is quite inefficient. If data is graph partitioned across a cluster, the data that is on the border of any particular partition is far more important to replicate than the data that is internal to a partition and already has all of its neighbors stored locally. This is because vertexes that are on the border of a partition might have several of their neighbors stored on different physical machines. For the same reasons why it is a good idea to graph partition data to keep graph neighbors local, it is a good idea to replicate data on the edges of partitions so that vertexes are stored on the same physical machine as their neighbors. Hence, allowing different data to be replicated at different factors can further improve system efficiency.

Hadoop, by default, stores data on a distributed file system (HDFS) or a sparse NoSQL store (HBase). Neither of these data stores are optimized for graph data. HDFS is optimized for unstructured data, and HBase for semi-structured data. But there has been significant research in the database community on creating optimized data stores for graph-structured data. Using a suboptimal store for the graph data is another source of tremendous inefficiency. By replacing the physical storage system with graph-optimized storage, but keeping the rest of the system intact (similar to the theme of the HadoopDB project), it is possible to greatly increase the efficiency of the system.