Cache Manager to Reduce the Workload of MapReduce Framework

3324 words (13 pages) Essay

9th Apr 2018 Computer Science Reference this


Disclaimer: This work has been submitted by a university student. This is not an example of the work produced by our Essay Writing Service. You can view samples of our professional work here.

Any opinions, findings, conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of

Provision of Cache Manager to Reduce the Workload of MapReduce Framework for Bigdata application

Ms.S.Rengalakshmi, Mr.S.Alaudeen Basha

Abstract: The term big-data refers to the large-scale distributed data processing applications that operate on large amounts of data. MapReduce and Apache’s Hadoop of Google, are the essential software systems for big-data applications. A large amount of intermediate data are generated by MapReduce framework. After the completion of the task this abundant information is thrown away .So MapReduce is unable to utilize them. In this approach, we propose provision of cache manager to reduce the workload of MapReduce framework along with the idea of data filter method for big-data applications. In provision of cache manager, tasks submit their intermediate results to the cache manager. A task checks the cache manager before executing the actual computing work. A cache description scheme and a cache request and reply protocol are designed. It is expected that provision of cache manager to reduce the workload of MapReduce will improve the completion time of MapReduce jobs.

Key words: big-data; MapReduce; Hadoop; Caching.

I. Introduction

With the evolution of information technology, enormous expanses of data have become increasingly obtainable at outstanding volumes. Amount of data being gathered today is so much that, 90% of the data in the world nowadays has been created in the last two years [1]. The Internet impart a resource for compiling extensive amounts of data, Such data have many sources including large business enterprises, social networking, social media, telecommunications, scientific activities, data from traditional sources like forms, surveys and government organizations, and research institutions [2].

The term Big Data refers to 3 v’s as volume, variety, velocity and veracity. This provides the functionalities of Apprehend, analysis, storage, sharing, transfer and visualization [3].For analyzing unstructured and structured data, Hadoop Distributed File System (HDFS) and Mapreduce paradigm provides a Parallelization and distributed processing.

Huge amount data is complex and difficult to process using on-hand database management tools, desktop statistics, database management systems or traditional data processing applications and visualization packages. The traditional method in data processing had only smaller amount of data and has very slow processing [4].

A big data might be petabytes (1,024 terabytes) or exabytes (1,024 petabytes) of data composed of billions to trillions of records of millions of people—all from different sources (e.g. Web, sales, customer center for communication, social media. The data is loosely structured and most of the data are not in a complete manner and not easily accessible[5]. The challenges include capturing of data, analysis for the requirement, searching the data, sharing, storage of data and privacy violations.

The trend to larger data sets is due to the additional information derivable from analysis of a single large set of data which are related to one another, as matched to distinguish smaller sets with the same total density of data, expressing correlations to be found to “identify business routines”[10].Scientists regularly find constraints because of large data sets in areas, including meteorology, genomics. The limitations also affect Internet search, financial transactions and information related business trends. Data sets develop in size in fraction because they are increasingly accumulated by ubiquitous information-sensing devices relating mobility. The challenge for large enterprises is determining who should own big data initiatives that straddle the entire organization.

MapReduce is useful in a wide range of applications,such as distributed pattern-based searching technique, sorting in a distributed system, web link-graph reversal, Singular Value Decomposition, web access log stats, index construction in an inverted manner, document clustering , machine learning, and machine translation in statistics. Moreover, the MapReduce model has been adapted to several computing environments. Google’s index of the World Wide Web is regenerated using MapReduce. Early stages of ad hoc programs that updates the index and various analyses can be executedis replaced by MapReduce. Google has moved on to technologies such as Percolator, Flume and MillWheel that provides the operation of streaming and updates instead of batch processing, to allow integrating “live” search results without rebuilding the complete index. Stable input data and output results of MapReduce are stored in a distributed file system. The ephemeral data is stored on local disk and retrieved by the reducers remotely.

In 2001,Big data defined by industry analyst Doug Laney (currently with Gartner) as the three Vs : namevolume, velocity and variety [11]. Big data can be characterized by well-known 3Vs: the extreme density of data, the various types of data and the swiftness at which the data must be processed.

II. Literature survey

Minimization of execution time in data processing of MapReduce jobs has been described by Abhishek Verma, Ludmila Cherkasova, Roy H. Campbell [6]. This is to buldge their MapReduce clusters utilization to reduce their cost and to optimize the Mapreduce jobs execution on the Cluster. Subset of production workloads developed by unstructured information that consists of MapReduce jobs without dependency and the order in which these jobs are performed can have good impact on their inclusive completion time and the cluster resource utilization is recognized. Application of the classic Johnson algorithm that was meant for developing an optimal two-stage job schedule for identifying the shortest path in directed weighted graph has been allowed. Performance of the constructed schedule via unquantifiable set of simulations over a various workloads and cluster-size dependent.

L. Popa, M. Budiu, Y. Yu, and M. Isard [7]: Based on append-only, partitioned datasets, many large-scale (cloud) computations will operate. In these circumstances, two incremental computation frameworks to reuse prior work in these can be shown: (1) reusing similar computations already performed on data partitions, and (2) computing just on the newly appended data and merging the new and previous results. Advantage: Similar Computation is used and partial results can be cached and reused.

Machine learning algorithm on Hadoop at the core of data analysis, is described by Asha T, Shravanthi U.M, Nagashree N, Monika M [1] . Machine Learning Algorithms are recursive and sequential and the accuracy of Machine Learning Algorithms depend on size of the data where, considerable the data more accurate is the result. Reliable framework for Machine Learning is to work for bigdata has made these algorithms to disable their ability to reach the fullest possible. Machine Learning Algorithms need data to be stored in single place because of its recursive nature. MapRedure is the general and technique for parallel programming of a large class of machine learning algorithms for multicore processors. To achieve speedup in the multi-core system this is used.

P. Scheuermann, G. Weikum, and P. Zabback [9] I_O parallelism can be exploited in two ways by Parallel disk systems namely inter-request and intra-request parallelism. There are some main issues in performance tuning of such systems.They are: striping and load balancing. Load balancing is performed by allocation and dynamic redistributions of the data when access patterns change. Our system uses simple but heuristics that incur only little overhead.

D. Peng and F. Dabek [12] an index of the web is considered as documents can be crawled. It needs a continuous transformation of a large repository of existing documents when new documents arrive.Due to these tasks, databases do not meet the the requirements of storage or throughput of these tasks: Huge amount of data(in petabytes) can be stored by Google’s indexing system and processes billions of millions updates per day on wide number of machines. Small updates cannot be processed individually by MapReduce and other batch-processing systems because of their dependency on generating large batches for efficiency. By replacing a batch-based indexing system with an indexing system based on incremental processing using Percolator, we process the similar number of data documents averagely per day, happens during the reduction of the average age of documents in Google search which is resulted by 50%.

Utilization of the big data application in Hadoop clouds is described by Weiyi Shang, Zhen Ming Jiang, Hadi Hemmati, Bram Adams, Ahmed E. Hassan, Patrick Martin[13]. To analyze huge parallel processing frameworks, Big Data Analytics Applications is used. These applications build up them using a little model of data in a pseudo-cloud environment. Afterwards, they arrange the applications in a largescale cloud situation with notably more processing organize and larger input data. Runtime analysis and debugging of such applications in the deployment stage cannot be easily addressed by usual monitoring and debugging approaches. This approach drastically reduces the verification effort when verifying the deployment of BDA Apps in the cloud.

Matei Zaharia, Mosharaf Chowdhury, Michael J. Franklin, Scott Shenker, Ion Stoica [14] MapReduce and its variants have been highly successful in implementing large-scale data-intensive applications on clusters of commodity base. These systems are built around an model which is acyclic in data flow which is very less suitable for other applications. This paper focuses on one such class of applications: those that reuse a working set of data across multiple operations which is parallel. This encompasses many machine learning algorithms which are iterative. A framework cnamed Spark which ropes these applications and retains the scalability and tolerantes fault of MapReduce has been proposed. To achieve these goals, Spark introduces an abstraction called resilient distributed datasets (RDDs).

An RDD is a read-only collection of objects which are partitioned across a set of machines. It can be rebuilt if a partition is lost. Spark is able to outperform Hadoop in iterative machine learning jobs and can be used to interactively query around and above 35 GB dataset with sub-second response time. This paper presents an approach cluster computing framework named Spark, which supports working sets while providing similar scalability and fault tolerance properties to MapReduce

III. Proposed method

An Objective of proposed System is to the underutilization of CPU processes, the growing importance of MapReduce performance and to establish an efficient data analysis framework for handling the large data Drift in the workloads from enterprise through the exploration of data handling mechanism like parallel database such as Hadoop.


Figure 1: Provision of Cache Manager

III.A.Provision of Dataset To Map Phase :

Cache refers to the intermediate data that is produced by worker nodes/processes during the execution of a Map Reduce task. A piece of cached data is stored in a Distributed File System (DFS). The content of a cache item is described by the original data and the operations applied. A cache item is explained by a 2-tuple: Origin, Operation. The name of a file is denoted by Origin in the DFS. Linear list of available operations performed on the Origin file is denoted by Operaion. Example, consider in the word count application, each mapper node or process emits a list of word, counting tuples that record the count of each word in the file that the mapper processes. Cache manager stores this list to a file. This file becomes a cache item. Here, item refers to white-space-separated character strings. Note that the new line character is also considered as one of the whitespaces, so item precisely captures the word in a text file and item count directly corresponds to the word count operation performed on the data file. The input data are get selected by the user in the cloud. The input files are splitted. And then that is given as the input to the map phase. The input to the map phase are very important. These input are processed by the map phase.

III.B.Analyze in Cache Manager:

Mapper and reducer nodes/processes record cache items into their local storage space. On the completion of these operations , the cache items are directed towards the cache manager, which acts like an inter-mediator in the publish/subscribe model. Then recording of the description and the file name of the cache item in the DFS is performed by cache manager. The cache item should be placed on the same machine as the worker process that generates it. So data locality will be improved by this requirement. The cache manager maintains a copy of the mapping between the cache descriptions and the file names of the cache items in its main memory to accelerate queries. Permanently to avoid the data loss, it also flushes the mapping file into the disk periodically. Before beginning the processing of an input data file, the cache manager is contacted by a worker node/process. The file name and the operations are send by the worker process that it plans to apply to the file to the cache manager. Upon receiving this message, the cache manager compares it with the stored mapping data. If an exact match to a cache item is found, i.e., its origin is the same as the file name of the request and its operations are the same as the proposed operations that will be performed on the data file, then a reply containing the tentative description of the cache item is sent by the cache manager to the worker process.On receiving the tentative description,the worker node will fetch the cache item. For processing further, the worker has to send the file to the next-stage worker processes. The mapper has to inform the cache manager that it already processed the input file splits for this job. These results are then reported by the cache manager to the next phase reducers. If the cache service is not utilized by the reducers then the output in the map phase can be directly shuffled to form the input for the reducers. Otherwise, a more complex process is performed to get the required cache items. If the proposed operations are different from the cache items in the manager’s records, there are situations where the origin of the cache item is the same as the requested file, and the operations of the cache item are a strict subset of the proposed operations. On applying some additional operations on the subset item, the item is obtained. This fact is the concept of a strict super set. For example, an item count operation is a strict subset operation of an item count followed by a selection operation. This fact means that if the system have a cache item for the first operation, then the selection operation can be included, that guarantees the correctness of the operation. To perform a previous operation on this new input data is troublesome in conventional MapReduce, because MapReduce does not have the tools for readily expressing such incremental operations. Either the operation has to be performed again on the new input data, or the developers of application need to manually cache the stored intermediate data and pick them up in the incremental processing. Application developers have the ability to express their intentions and operations by using cache description and to request intermediate results through the dispatching service of the cache manager.The request is transferred to the cache manager. The request is analyzed in the cache manager. If the data is present in the cache manager means then that is transferred to the map phase. If the data is not present in the cache manager means then there is no response to the map phase.


Map reduce framework generates large amount of intermediate data. But, this framework is unable to use the intermediate data. This system stores the task intermediate data in the cache manager. It uses the intermediate data in the cache manager before executing the actual computing work.It can eliminate all the duplicate tasks in incremental Map Reduce jobs.

V. Future work In the current system the data are not deleted at certain time period. It decreases the efficiency of the memory. The cache manager stores the intermediate files. In future, these intermediate files can be deleted based on time period will be proposed. New datasets can be saved. So the memory management of the proposed system can be highly improved.

VI. References

[1] Asha, T., U. M. Shravanthi, N. Nagashree, and M. Monika. “Building Machine Learning Algorithms on Hadoop for Bigdata.” International Journal of Engineering and Technology 3, no. 2 (2013).

[2] Begoli, Edmon, and James Horey. “Design Principles for Effective Knowledge Discovery from Big Data.” In Software Architecture (WICSA) and European Conference on Software Architecture (ECSA), 2012 Joint Working IEEE/IFIP Conference on, pp. 215-218. IEEE, 2012.

[3] Zhang, Junbo, Jian-Syuan Wong, Tianrui Li, and Yi Pan. “A comparison of parallel large-scale knowledge acquisition using rough set theory ondifferent MapReduce runtime systems.” International Journal of Approximate Reasoning (2013)

[4] Vaidya, Madhavi. “Parallel Processing of cluster by Map Reduce.” International Journal of Distributed & Parallel Systems 3, no. 1 (2012).

[5] Apache HBase. Available at

[6] Verma, Abhishek, Ludmila Cherkasova, and R. Campbell. “Orchestrating an Ensemble of MapReduce Jobs for Minimizing Their Makespan.” (2013): 1-1.

[7] L. Popa, M. Budiu, Y. Yu, and M. Isard, Dryadinc:Reusing work in large-scale computations, in Proc. ofHotCloud’09, Berkeley, CA, USA, 2009

[8] T. Karagiannis, C. Gkantsidis, D. Narayanan, and A.Rowstron, Hermes: Clustering users in large-scale e-mailservices, in Proc. of SoCC ’10, New York, NY, USA, 2010.

[9] P. Scheuermann, G. Weikum, and P. Zabback, Datapartitioning and load balancing in parallel disk systems,The VLDB Journal, vol. 7, no. 1, pp. 48-66, 1998.

[10] Parmeshwari P. Sabnis, Chaitali A.Laulkar , “SURVEY OF MAPREDUCE OPTIMIZATION METHODS”, ISSN (Print): 2319- 2526, Volume -3, Issue -1, 2014

[11] Puneet Singh Duggal ,Sanchita Paul ,“ Big Data Analysis:Challenges and Solutions”, International Conference on Cloud, Big Data and Trust 2013, Nov 13-15, RGPV

[12] D. Peng and F. Dabek, Largescale incremental processingusing distributed transactions and notifications, in Proc. ofOSDI’ 2010, Berkeley, CA, USA, 2010

[13] Shvachko, Konstantin, Hairong Kuang, Sanjay Radia, and Robert Chansler. “The hadoop distributed file system.” In Mass Storage Systems and Technologies (MSST), 2010 IEEE 26th Symposium on, pp. 1-10. IEEE, 2010.

[14] “Spark: Cluster Computing withWorking Sets “Matei Zaharia, Mosharaf Chowdhury, Michael J. Franklin, Scott Shenker, Ion Stoica University of California, Berkeley

Cite This Work

To export a reference to this article please select a referencing stye below:

Reference Copied to Clipboard.
Reference Copied to Clipboard.
Reference Copied to Clipboard.
Reference Copied to Clipboard.
Reference Copied to Clipboard.
Reference Copied to Clipboard.
Reference Copied to Clipboard.

Related Services

View all

DMCA / Removal Request

If you are the original writer of this essay and no longer wish to have your work published on the website then please:

Related Lectures

Study for free with our range of university lectures!