The amount data being captured is growing exponentially. Consider the number of likes/share in Facebook, tweets/re-tweet in Twitter, +1's/shares in Google+, Financial data, Banking Transactions and Health care policies from millions of users across the world.
FB has more than 15 Petabyte of data, eBay has more than 5 Petabyte of data and other companies like Google, Yahoo also have data at the similar range and they generate data at a rapid pace.
Note: 1 Petabyte = 1024 Terabyte, 1 Terabyte = 1024 Gigabyte.
Distributed Systems before Big Data
Data is stored in a centralized Storage Area Network (SAN). So when processing needs to be done on that data, it needs to be copied to the processing nodes which works fine for relatively limited amount of data. The bottleneck here was not the computing power of the processing nodes, but the time taken to transfer the data to the processing nodes.
Some numbers to look at
Get your grade
or your money back
using our Essay Writing Service!
Typical disk data transfer rate: 75 Megabyte/sec
Time taken to transfer 100 Megabyte of data to the processor: 22 minutes
And for processing, it needs more time as most servers will not have 100Gigabyte of RAM space
A new approach was needed to reduce latency in data transfer and to satisfy the following needs.
Component Failure Support
Consider a large scale system that has more than 1000+ machines which has at least 10 disks mounted on it. The mean time to failure of a disk is approximately 3 years. So on average at least 10 machines fail every day. So they system should be capable of handling failures.
If a component of a system fails and then recovers and comes back online or if it is replaced, it should be able to rejoin the system without any requirement to restart the system.
Component failure during execution should not affect the outcome of the system. The end user should not feel any inconsistency.
Adding load to the system should result in a graceful decline in performance of individual jobs. Increasing resources should support a proportional increase in load capacity.
There are many organizations like CloudEra, Teradata, SAS, EMC, IBM and others that work on this domain. Companies like Google and Amazon are having their own products on their cloud. Facebook and Yahoo are investing in Apache's Hadoop -based software and services and offers a powerful new data platform that enables enterprises and organizations to look at all their data - structured as well as unstructured - and ask bigger questions for unprecedented insight at the speed of thought. Let us take Hadoop as an example to understand the Big Data better.
Hadoop is an open source implementation of Google's MapReduce which consists of Large Scale Distributed File Systems called HDFS - Hadoop Distributed File System, i.e., disk on every server with the software infrastructure to spread your data out among all among them and MapReduce that pushes the code to the data on the servers and run the code in parallel on all servers.
With Hadoop, no data is too big. And in today's hyper-connected world where people and businesses are creating more and more data every day, Hadoop's capability to grow virtually without limits means businesses and organizations can now unlock potential value from all their data.
Hadoop is based on GFS (Google File System) and Google's MapReduce. They address a radical new approach to the problems existing in distributed computing by handling reliability and scalability issues.
The core concept is to distribute the data in the processing nodes as it is initially stored in the system. Individual nodes can work on data local to those nodes which make sure that there is no data transfer needed for the initial phase of processing.
The core concept can be better understood with the overview of Hadoop provided below.
Hadoop: Very High Level Overview
The large size of data is broken down into 64Mb or 128Mb blocks.
Map Tasks (the first step in MapReduce) puts the processing to be done on each block. These tasks are the tasks that are not done on real time basis. These are tasks that run on raw data to create useful information that will be helpful to improve performances in real time scenarios.
Always on Time
Marked to Standard
Master program coordinates the scheduling. The master allocates which machine should run a Map Task. The master will pick one of the machines that contain the replicated data.
With this Architecture many nodes (processing nodes) work parallel on a small block of data.
If a node fails, the master will detect the failure and reassign the work to a different node on the system that contains the replicated data. Restarting a task does not require communication with nodes working on other portions of data. If a system is recovered, it can be automatically added back to the system and assigned new tasks.
If a node appears to be running slowly, the master can redundantly execute another instance of the same task and the result from the first node to finish the task will be used.
In Hadoop system if the master fails then the Hadoop cluster is unavailable. Hadoop is used as a batch processing system. In that case it is not a problem if the processing is stopped for a few minutes and it is restarted. But in real time systems, it is not acceptable. There is a lot of effort being put in this issue by companies like CloudEra to increase the availability of the master.
Programming Languages and Hadoop
Hadoop was developed in Java. But it can be used to write processing in any language. PHP, Perl, Python, Ruby, etc… Just that one needs to figure out a way to find an algorithm that can divide the problems into subsets and work parallel on the blocks that are stored in HDFS.
Some open source projects that are designed to add additional functionality and capabilities to Hadoop. For instance Hadoop doesn't support SQL files. So there are tools that help in making importing SQL files into Hadoop and vice versa. These tools are listed in http://hadoop.apache.org/
Uses of Big Data Analytics
ETL Process in Data Warehousing - Speed up ETL processing
Graph Analysis - To suggest a Friend or a Connection in social networking
Recommendations - Product recommendations in eCommerce
Hadoop on cloud
Vast majority of Hadoop users run it in data centers. But, it always depends on the need. The tradeoff here is the amount of the time Hadoop needs to be running. If it is necessary for the system to use Hadoop 24X7, then the best option is to build your own infrastructure and if Hadoop will run only for a very less amount of time then Cloud services can be used.