This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.
This report discusses video compression using Hadoop and MapReduce framework. Hadoop and MapReduce are fairly new frameworks that support distributed computing with multiple nodes on a cloud network. These frameworks can process data stored on large distributed file systems like Hadoop Distributed File System (HDFS) or Google File System (GFS). MapReduce is a functional programming paradigm suitable to speed up the highly parallelizable processes like video compression. The purpose of the project is to leverage the computing power of resources in a cloud to compress high definition raw videos in parallel and stream the compressed video output over network. The system requires building a private cloud using OpenNebula software and setting up Hadoop framework on it. This project aims to find out the optimal number of nodes for a compression process by verifying factors such as performance, communication overhead and load balancing. The key advantages of this system are it is highly parallelizable, reduces over head on large scale computations and provides fault tolerance.
Table of Contents
Chapter 1. Project Overview 1
Proposed Areas of Study and Academic Contribution 2
Current State of the Art 3
Chapter 2. Project Architecture 5
Architecture Subsystems 6
Hadoop Distributed File System (HDFS) 6
Client Node 7
JobTracker Node 7
Input Splits 8
TaskTracker Node 8
In past years, we saw an ever growing demand for HDTV broadcasting and storage of high definition videos. Also, increased security concerns led to more surveillance systems in public locations. Sources of massive videos spans across video conferencing, surveillance systems, HD on demand videos, HDTV broadcasting, video telephony etc. These videos have to be compressed for storage and distribution. Compressed video transmission can reduce bandwidth requirement significantly.
H.264 is a high efficient video compression technology featuring inter-picture prediction and variable block size motion compensation . It provides higher quality for high motion videos at a reduced bit rate than other technologies like VP8 and MPEG-2. This projects aims to accelerate H.264 video compression time using distributed computing power of Hadoop framework. Hadoop is powered by highly parallelizable MapReduce programming model.
This project uses MapReduce model to run the video compression process in parallel on a cluster of nodes in a cloud. MapReduce can perform large scale data processing and analysis in parallel by executing different modules of application on different nodes.
Proposed Areas of Study and Academic Contribution
The following table gives summary of areas of study to implement this project.
Areas of study
H.264 codec implementation,
Hadoop, MapReduce configuration
Hadoop is an open source technology and many projects use this to process high volume of data. Developers can add custom features and patches to Hadoop as needed to make the framework more improved. Organizations can customize Hadoop according to their usage without any licensing issues. Hadoop is an emerging technology which goes along with cloud and virtualization technique and attracts lots of attention among developers. H.264 is an efficient compression codec and a deeper understanding of its algorithm is a good learning opportunity for students.
Current State of the Art
The demand for high definition video storage and distribution is increasing day by day. At the same time video compression technologies are evolving rapidly. Newer Inter-frame compression technologies which use motion compensation provide better efficiency compared to intra-frame compression techniques. For example, MPEG-4 compression technique delivers 50 % more reduction in bit rate with same quality than older MPEG-2 standard . Another open source video codec released by Google ââ‚¬"VP8 is proven efficient for low motion video compression. An advance MPEG-4 format called H.264 is superior on high motion video compression . H.264 is utilized in many consumer products, real time video conferencing, YouTube videos and iTunes store. Quality of video recording in surveillance systems is improved with H.264 .
Hadoop provides a shared storage system using HDFS and a highly parallelizable processing mechanism using MapReduce. Hadoop is used by Google, Yahoo, Facebook, LinkedIn, New York Times and Last.fm and many other projects. The music community website Last.fm uses Hadoop for data processing making use of its scalability, low cost, customization capability, easy learning curve, redundant back up etc. Today Facebook uses the second largest Hadoop cluster with 2400 cores and 2PB of data. Facebook produces data summaries using Hadoop over large user data to implement site features . Yahoo! Webmap program uses Hadoop application that runs on more than 10000 cores to derive data for yahoo search.
Figure 1 - Architecture
For this project, a Hadoop framework is built on a private cloud. The private cloud is setup using Eucalyptus. The Hadoop Distributed filesystem(HDFS), the client node , jobtracker node , input splits and task tracker node are the main components in this architecture. The HDFS is used to store very large input video files and compressed output video files. The client node presents the task that has to be accomplished. It submits the task as a mapreduce job to the jobtracker node. Jobtracker node oversees all the jobs submitted, schedules them to be run on different worker nodes and provides with input for processing. Input file to be processed is divided into a number of files mentioned in the configuration information provided with Hadoop setup. Task tracker node will run actual task to be performed which is compression for this project. Once the compression is completed, generated output is stored back in the HDFS.
Hadoop Distributed File System (HDFS)
HDFS is a distributed filesystem provided by the Apache Hadoop framework. Like other distributed filesystems, HDFS are designed to store very large files across a cluster of computer connected over network. The ability to provide streaming data access and support for large data sets makes HDFS very ideal for this project. In this project, very large high definition raw video files are used as input and they will be stored on the HDFS file system. The compressed output video files which are relatively large can be written back on to the HDFS filesystem. The HDFS setup follows a Master-Slave architecture. Master node which is known as the NameNode is a single node which administers the namespace of the filesystem. It stores the filesystem tree and maintains the metadata for all files and directories in the tree . There are one or more slave or worker nodes. These nodes are called DataNode. DataNode are the ones which actually stores the file. The files are divided into fixed length blocks and blocks are stored across a group of one or more DataNodes connected as a cluster.
Client Node is the one which submits a job that has to be done. This job is called MapReduce job and is described as the unit of work that needs to be performed by a client. The components of a MapReduce job are the following
The MapReduce Program submits a job to the JobClient. The JobClient will request a new job ID and checks whether an output directory has been specified. After that JobClient will slice the input into splits. Once the input splits are computed the JobClient will copy the configuration file , the splits and other resources into a directory present in the JobTracker node  .
JobTracker is the node which controls all the jobs submitted by client applications. Jobs are divided into two types of tasks ââ‚¬" map and reduce tasks. JobTracker node schedules tasks to be run on TaskTrackers. When a MapReduce job is ready for execution they are placed into a queue. The JobTracker then schedules these jobs in the queue to be run on TaskTrackers. JobTracker will create as many map tasks as input splits and assign each split into each map task.
Input splits are fixed size files computed by dividing the input files. JobClient is responsible for slicing up the input file into splits. The high definition input video is divided into four parts and the corresponding part from each frame is assigned to the same map tasks.
TaskTrackers are the nodes which run tasks. Tasks involve both map and reduce tasks. TaskTrackers are responsible for reporting the progress of each job to the JobTracker. Each TaskTracker node sends periodic heartbeats to the JobTracker to report that it is alive. The heartbeat also informs the JobTracker when it is ready for a new task . Map tasks run video compression algorithm on the input splits. The map tasks are designed to run in parallel across multiple nodes in a cluster. The outputs from the map functions are stored temporarily on the local disk. The output from all the map functions are merged and given to the reduce task. The reduce task runs the reduce function which will combine the map outputs to build the final compressed file. This compressed output file is then written into the HDFS.