Need Of The Mapreduce Technologies Computer Science Essay

Published: Last Edited:

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.

Over the next pages I will spend a few lines to explain the MapReduce technique in general, what it does and what not, and give some examples of most common tools using this algorithm. Also I will try to give a brief introduction on databases and parallel databases, and finally compare them and find out if there are an major improvements and differences between these implementations.

In conclusion I will try to identify if there any actual need to use MapReduce instead of parallel databases, and if there is a need for more and extended research in the upcoming future.

\section{MapReduse and Database}\label{sec:compare}

\subsection{How the idea of MapReduce came up}\label{sec:how the idea of mapReduce came up}

Over the last years big companies like Google and many researchers have implemented hundreds of hundreds computations that process big amounts of data like the web requests in order to process the different type of documents that come out like big reversed tables, number of web pages crawled per host, frequent queries per day, web documents and more. A big amount of these computation are conceptually straightforward. The problem with this is the amount of data, that is usually big, affecting that way the computations, which have to be distributed in thousands of computers in order to finish in a reasonable time. There was a need to deal with issues like the distribution of data, how to handle failures and how to pipeline the computations.\cite{Dean04mapreduce:simplified} In these problems MapReduce come to propose a solution.

\subsection{What mapReduce is and does}\label{sec:what mapReduce is and does}

First of all lets point out MapReduce is not a technology but an algorithm. Wikipedia \cite{algorithm:Online} defines an algorithm as " An effective method for solving a problem using a finite sequence of instructions." In MapReduce's case, the problem being solved is the processing and analysis of very large data sets. The solution is a parallelized divide-and-conquer approach and works like this. First, you split up the problem into small manageable chunks. Second, you fan out each chunk in parallel to individual work units (maps). Third, you take individual results from each unit and recombine them into your final result (reducers). In SQL parlance, conceptually, it is like doing a select aggregate with a group by.

MapReduce as defined in the original paper\cite{mapreduce:Online} is a programming model for processing large data sets. The name was given by a Google's implementation of the model. MapReduce was developed in order to perform distribution of computing on different clusters. The inspiration of this model came from two commonly used functions in functional programmin, Map and Reduce. Although the algorithms libraries have been implemented in a variety of programming languages with popular implementations such as the free-ware Apache Hadoop.

MapReduce is a framework using a big number of nodes to process parallelized problems with huge datasets. The nodes can be on the same local network and are called clusters, or across geographically distributed systems and are called grid. In order to decrease transmission of data, it takes advantage of processing data near or on the storage assets and generally the data locality. The outcome of this process are data, stored either on databases or in file systems.

The process is divided in two steps, Map and Reduce phases:

In the "Map" step, the input is divided into smaller sub-problems by the master node which then distributes them to worker nodes. A tree structure is produced while the workers in there turn might redivide the given problems. All the answers are passed back to the master node and the "Reduce" step starts. In this phase, the collected answers are combined in order to produce the final answer of the given big problem.

All Map and Reduce operations can be processed parallel by distribution. The parallelism is actually limited for Mappers by the number of CPUs and the independence of data sources. All mappers' outcomes that share shame key, are collected to the same reducer at the same time. An implementation like this might seem inefficient, but it can be applied to very big datasets that other servers can not process, and provide some kind of fault tolerance using the parallelism to reschedule operations that have fault as long as the input is available.

A figure of the process is given in order to optical understand the algorithm. The figure is taken by Wikipedia\cite{mapreduce:Online}:







MapReduce \cite{Dean04mapreduce:simplified} is the programming model implemented by Hadoop for processing large datasets that are produced by many different real-world tasks. MapReduce works by breaking processing into two phases: the map and the reduce phase with key/value pairs as input and output; allowing the user-programmer to implement those two functions (map - reduce) to perform the computations needed.

In order to allow the programmer to perform the computations, the map and reduce phases share key/value pairs.

First, the MapReduce\cite{Dean04mapreduce:simplified} library splits the input data into M parts and distributes them in parallel to different machines, as shown in the next Figure:







All the data are converted into key/value pairs by the framework, in order to be the input in the map function. The outcome of this phase is sorted by the key values and become the input of the reduce function.

map (k1,v1) ! list(k2,v2)

reduce (k2,list(v2)) ! list(v2)

In the reduce phase the key/value pairs are sorted and the output of each reducer in stored in output files in a job directory. That way, the programmes are automatically parallelized and can executed in large clusters. Programmers ever without experience, can easily utilize the resources of the distributed system, as it takes care of handling machine failures, managing the machine communications and partitioning the input data. Only a simple code have to be whiten by the developer, in order to perform the operations. An example of pseudo-code of a simple counting operation in given below:









In the following figure, there is a graphical presentation of the example, showing the split of the original dataset in the mappers, then to the Reducers and finally to the final result.

Map Reduce data flow for word count example:







\subsection{Parallel Databases}\label{sec:parallel dbms}

Parallel databases\cite{Dewitt92paralleldatabase} started as exotic hardware and over the years they evolved in software parallel data-flow architecture without the need of any shared hardware. This software is trying to improve the speed on processing relational database queries.

Since the late 1980's database systems\cite{Pavlo09acomparison} was capable of running on clusters and the end-user not mentioning that the storage was performed on multiple machines while performing operations on SQL and relational databases. Many of these systems build on the pioneering research from the Gamma \cite{Dewitt86gamma-} and Grace \cite{Fushimi:1986:OSS:645913.671448} parallel DBMS projects.

In such operations the user only need to specify the request in a high level programming language and there is no need to be aware of the indexing or join techniques. So the parallelization of the processes was achievable since the tables are distributed over nodes in clusters and the execution of SQL commands are translated by an optimizer into a query plan.

In parallel DBMS\cite{Dewitt85multiprocessorhash-based} the process of SQL commands are executed in three phases. As an example we consider a table T1 where we want to filter the records given a predicate. Moreover we request a join on table T2 resulting an aggregate computation on the final output. In the first step, the filter sub-query is performed in parallel in a similar way as Map function. The following step is the use of a selected parallel join algorithm based on the size of the dataset and the tables. Finally there is a need to collect the outputs and compute the final result to the given query.

The distribution of filtered T1 and the T2 table in order to perform the join operation, using hash functions are similar to the Map Reduce functions. For example when a node has all the necessary data, it performs its own sub calculations and give an outcome, next the different outcomes have to be collected and combined in order to provide the overall final result. Those two techniques seen to have a lot in common on how they process large data and data analysis, but there are notable differences in the performance and the structure that i will try to identify in the next section.


There are already a few well known implementations for parallel DBMS systems. Some of them are:


\item Teradata's DBC

\item Tandem's NonStopSQL

\item Budda

\item Eds

\item Gamma

\item Grace

\item Prisma


\subsection{Why map reduce and not use a parallel DBMS instead}\label{sec:why map reduce and not use a parallel DBMS instead}

With either parallel DBMSs or MapReduce, you can provide without much difficulty a parallel and high level programming environment and in both it is possible to code any parallel processing task, either on Map - Reduce jobs or in database queries. Even if these seem quite similar, in large scale datasets they have a lot of differences. For example the indexing process, the programming models and the way the data are distributed are different. They do not have the same query execution strategies and for MapReduce the data can be in any arbitrary format, whereas in parallel databases the data have to be in a well- defined schema.


As we mentioned in a previous section MapReduce is only an algorithm. In order to use it and benefit from it, there was a need of tools that implement this algorithm technique. Since the discovery of the algorithm and especially after the reveal that Google is already using this algorithm a few years now with its own software, there was a big need of a good, trustful and open source software implementation to be made, in order new developers to contribute and eventually improve the use of MapReduce. Some implementation was made, but the most famous and widely used is Hadoop. Oven the next lines, I will try to give a resume of this software, explaining the use in general and the extensions in Databases.

\subsection{Hadoop what it is and what it does - connection with MR}\label{sec:Hadoop what it is and what it does - connection with MR}

Hadoop is the Apache Software Foundation top-level project \cite{hadoop:Online}. Hadoop is the answer to the needs of numerous computations that process a large amount of data to be completed in a reasonable amount of time. The Hadoop open-source project, supplies a powerful framework for the development of highly scalable distributed computing applications. Hadoop allows the developer to focus on the application logic leaving apart the processing details, as the framework is fully responsible for balanced distribution of data and processing to the nodes in a cluster, handling possible failures and other similar issues. This programming model that Hadoop uses is called MapReduce. MapReduce is a distributed data processing model and execution environment that runs on large clusters of commodity machines.

Hadoop includes a number of subprojects:


\item HDFS: A distributed file system that provides high throughput access to application data.

\item HBase: A distributed column-oriented database. HBase uses HDFS for its underlying storage, and support both batch-style computations using MapReduce and point queries (random reads).

\item Avro: A data serialization system for efficient, cross-language RPC, and persistent data storage.

\item Hive: A distributed data warehouse. Hive manages data stored in HDFS and provides a query language based on SQL, (and which is translated by the runtime engine to MapReduce jobs) for querying the data.

\item Pig: A dataflow language and execution environment for exploring very large datasets. Pig runs on HDFS and MapReduce clusters.

\item ZooKeeper: A distributed, highly availably coordination service. ZooKeeper provides primitives such as distributed locks that can be used for building distributed applications.

\item Chukwa: A distributed data collection and analysis system. Chukwa runs collectors that store data in HDFS, and it uses MapReduce to produce reports.

\item Mahout: A Scalable machine learning and data mining library, implementing algorithms for clustering, classification and batch based collaborative filtering.


Hadoop also provides its own set of data types that are optimized for network serialization and correspond to the known Java built-in data types. Of course, the user can define custom data types if necessary. The data types that are used as keys need to implement the Writable Comparable and the data types that are used as values need to implement the Writable interface, which is a subset of Writable Comparable.

The Writable interface implements the methods that are used for serialization and deserialization of the objects and the Writable Comparable implements additionally the methods that are used for the comparison of the keys.

The most common Hadoop data types are:


\item Text: equivalent to String.

\item IntWritable: equivalent to Integer.

\item LongWritable: equivalent to Long.

\item FloatWritable: equivalent to Float.

\item DoubleWritable: equivalent to Double.

\item BooleanWritable: equivalent to Boolean.


\subsection{Hadoop database}\label{sec:hadoop database}

HadoopDB \cite{Abouzeid_hadoopdb:an} extends the Hadoop framework (see Fig. below) by providing the following four components:







1. Database Connector

This component extends the InputFormat class and is used as an interface between database systems on nodes and the TaskTrackers. Each Connector is performing a pre-supplied SQL query by the the MapReduce jobs and returns the results in value/key pair format. Since each database require different query optimization, the MySQL and PostgreSQL have already been implemented. For the framework the databases are sources of data, similar to HDFS's data blocks.

2. Catalog

In order to perform the operations, the system should be provided with specific information such as the metadata, the data partitioning properties, the connection parameters and the driver class and credentials. The Catalog is responsible to maintain those information about the provided databases.

3. Data Loader

The Data Loader consists of two sub components Global Hasher and Local Hasher, and is responsible to partition the data provided a partition key, break apart node data in chunks and finally load the databases with the chunks. Both the sub components use hash functions which differ from the Hadoop's default in order to ensure better load balancing when executing MapReduce jobs.

4. SQL to MapReduce to SQL (SMS) Planner

HadoopDB provides a parallel database front-end to data analysts enabling them to process SQL queries. The SMS planner extends Hive \cite{hadoopIssues}. Hive is responsible to convert SQL queries into MapReduce jobs in order to connect tables stored as files in HDFS. Most of the processing is occur in the Map phase when it involve multiple tables that are collocated and the join operations can be pushed entirely into the database layer.

In order to understand how Hive operates, I give an example on how it creates an executable MapReduce job:

Consider the following query:

SELECT YEAR(saleDate), SUM(revenue)

FROM sales GROUP BY YEAR(saleDate);

Hive processes the above SQL query in a series of phases:

(1) The parser transforms the query into an Abstract Syntax Tree.

(2) The Semantic Analyser connects to Hive's internal catalogue, the MetaStore, to retrieve the schema of the sales table.

(3) The logical plan generator then creates a DAG of relational operators, the query plan.

(4) The optimizer restructures the query plan to create a more optimized plan.

(5) Finally, the physical plan generator converts the logical query plan into a physical plan executable by one or more MapReduce jobs.

(6) Each DAG enclosed within a MapReduce job is serialized into an XML plan.

The following figure gives a graphical representation of the example:

If the sales table is partitioned by YEAR(saleDate), it produces the query plan in Fig. 2(b): this plan pushes the entire query processing logic into the database layer. Only a Map task is required to output results into an HDFS file. Otherwise, SMS produces the query plan in Fig. 2(c) in which the database layer partially aggregates data and eliminates the selection and group-by operator used in the Map phase of the Hive generated query plan (Fig. 2(a)). The final aggregation step in the Reduce phase of the MapReduce job, however, is still required in order to merge partial results from each node. For join queries, Hive assumes that tables are not collocated.







\subsection{Hadoop and HadoopDB}\label{sec:hadoop and hadoopDB}

HadoopDB\cite{Abouzeid_hadoopdb:an} and Hadoop are two different implementations that in no way they are trying to replace each other. Both systems can be used in different accusations or even coexist in order to provide developers with the appropriate tool depending on the task and the dataset. Performance benchmarks showed that using an efficient database layer cuts down time when processing complex queries and that HadoopDB take advantage of Hadoop style systems' fault tolerance and the ability to run on heterogeneous environments.

\section{Comparing different implementations}

In order to check the fault tolerance, the ability to run in a heterogeneous environment, the flexible query interface and generally the overall performance we compare Hadoop, HadoopDB and two commercial parallel DBMSs such as Vertica and DBMS-X. \cite{Abouzeid_hadoopdb:an}


{\bf Vertica} is a parallel database system (founded in 2005) \cite{vertica} based on the C-Store research project \cite{Daniel05c-store:a}. Vertica is a column-store, which means that each attribute of each table is stored (and accessed) separately, a technique that has proven to improve performance for read-mostly workloads. Vertica offers a "cloud" edition, which we used for the experiments in this paper. Vertica was also used in the performance study of previous work \cite{Pavlo09acomparison} on the same benchmark, so we configured Vertica identically to the previous experiments6. The Vertica configuration is therefore as follows: All data is compressed. Vertica operates on compressed data directly. Vertica implements primary indexes by sorting the table by the indexed attribute. None of Vertica's default configuration parameters were changed.


{\bf DBMS-X} is the same commercial parallel row-oriented database as was used for the benchmark in \cite{Pavlo09acomparison}. Since at the time of the VLDB submission this DBMS did not offer a cloud edition, they did not run experiments for it on EC2. However, since the Vertica numbers were consistently 10-15 per cent slower on EC2 than on the Wisconsin cluster presented in \cite{Pavlo09acomparison} (this result is expected since the virtualization layer is known to introduce a performance overhead), we reproduce the DBMS-X numbers from \cite{Pavlo09acomparison} on the figures as a best case performance estimate for DBMS-X if it were to be run on EC2.

\subsection{Benchmark outcomes}\label{sec:benchmark}

The output of a given benchmark as we have seen it in \cite{Abouzeid_hadoopdb:an}:

The performance of HadoopDB can not reach the performance of parallel database systems mostly of the functionality of Hadoop and Hive, which are relatively new projects and the lack of data compression in PostgreSQL. As these tools are developed and evolved over the years, the performance of HadoopDB is expected to increase as well. However the performance on fault tolerance and the ability to operate in heterogeneous environments is similar or even better compared to parallel DBMS. The ability of HadoopDB to directly incorporate Hadoop and open source DBMS software makes HadoopDB particularly flexible and extensible for performing data analysis at the large scales and can be characterized as hybrid. It achieves performance of DBMS, yet still yielding teh fault tolerance and flexibility of MapReduce systems.


I this paper we have examined MapReduce algorithm, some tools that implements it and a comparison with simiral tools and techniques that developers and companies have been using since today. As mentioned in the introduction, the scope of this paper was to identify if there is any actual need of the new algorithm, provide the pros and cons of it and finally identify where and when it performs better or worst than the ordinary Databases most of the companies used to work with.

As it is clearly mentioned through the research and the papers given, MapReduce is not only a trend but it can actually benefit the IT database and data-mine areas if it is used properly.

MapReduce is advantageous because it is easy to use with local optimization, load balancing and fault- tolerance. Moreover a lot of different problems can be expressed as MapReduce computations and it enables scaling of applications across large clusters of machines.

As a new discovery, developers tried to convert almost everything to this new technique, but this seemed to be not the right way since as every implementation has some pros but some cons also.

In general, MapReduce is splitting a specific job in parts, in order to parallelise the procedure. First of all, in order this to be beneficial, we should provide a large dataset instead of a small one. Actually the bigger the dataset we have to process, the better the algorithm performs in comparison with the serialised ones. Comparing to ordinary Databases, we mention that if we need a precise work, such as "find a specific name in the given dataset and output the contact number", databases perform much better than the MapReduce implemented ones. But if we need general and fast overall processes, such as "count in the given dataset how many times the word informatics is written", the MapReduce algorithm performs much better as the splitting of the dataset is not affects the outcome.

Hadoop is a successful implementation because it is an open-source software framework, provides reliability and data motion to applications and it is widely accepted by the developers society.

The best way to benefit from MapReduce is to try to combine it with parallel databases using distributed computations, than compare them with each other. Already big companies have started moving on this way. Microsoft for example in 2012 published a paper on a new scripting language (SCOPE)\cite{DBLP:journals/vldb/ZhouBWLCS12}, which is highly extensible for massive data analysis on large clusters. It is already been used on tens of thousands of machines at Microsoft, serving a variety of Microsoft on-line services. Examples like this seems to move on the right way of development.

In conclusion, I can say that MapReduce is not going to replace parallel databases in the near future, but if we use this technique in the right way, there are a lot of benefits and improvements that can provide. Obviously is the best known algorithm for Big Data processing and more research can be made either in the front-end tools or in the actual speed improvement of the algorithm. Hopefully this paper provides you with the basic information about MapReduce and Databases nowadays.