Effective Data Locality Aware Task Computer Science Essay

Published:

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.

Data locality aware scheduling method is more aware to- wards the data locality. By improving data locality it improves the mapreduce performance.

Data locality aware scheduling methods working start with when it receives a request from a requesting node, after receiveing request this method schedules the task to the node whose input data is already present on the requesting node. If tasks were not found then it will se- lect the task which has input data nearest to the node which have send the request, and then take the decision on whether to reserve the task on the node which stores the input data or schedule the task to the node from which request is received. By utilizing this option it can improve the data locality and also it involves the runtime overhead. That means the waiting time to schedule the task to the node which has the input data. This wait- ing time can be longer than the latter options runtime overhead, Hence the transmission time to copy the input data to the node from which request is received. But if any node has been assigned more than one tasks, then there is no priority given in existing framework. Here this

approach will allocate tasks with the maximum required data.This max required data calulation is done after rec- ognizing input data network locations and sizes.So it ul- timately improves scheduling delay and enhance system utilization.

Section 1 describes the introduction of Data locality aware scheduling method.

Section 2 illustrates the related work of scheduling method.

Section 3 explains the design, input and output. Section 4 illustrates the results and discussion. and Section 5 explains the conclusion of the work.

2. Related Work

Analysis and comparison of the various existing schedul- ing methods for mapreduce framework of hadoop de- scribed in the following section,

1. Fifo, Fair and Capacity Scheduling methods

Fifo is the default scheduler which is used in default with Hadoop without any extra configuration. Fair scheduler is designed by Facebook. Fair schedul- ing method assigns resources to jobs such that all jobs get, on average, an equal share of resources over time. Capacity scheduler is designed by Yahoo. Ca- pacity Scheduler is a pluggable scheduler for Hadoop which provides a way to share large clusters. It is based on capacity of the resources.

2. Late (Longest Approximate Time to End) schedul- ing method

LATE scheduler always speculatively execute the task. If any task works slowly so it is very uncom- mon to continue with the task processing. Task is progress is very slow due to some reasons like high CPU load on the node, slow background processes

etc. All tasks should be finished for completion of the entire job. The scheduler detect a slow running task to launch another equivalent task as a backup which is termed as speculative execution of tasks. If the backup copy completes faster, the overall job performance is improved. Speculative execution is an optimization but not a feature to ensure reliabil- ity of jobs. The default implementation of specu- lative execution relies implicitly on certain assump- tions: a) Uniform Task progress on nodes b) Uni- form computation at all nodes.

3. Delay scheduling method

Delay scheduling method is a technique for achiev- ing locality and fairness in Scheduling. Fair sched- uler is designed to allocate fair share of capacity to all the users. Two locality problems identified when fair sharing is followed are âÄ‚Åž head-of-line schedul- ing and sticky slots. Delay scheduling method over- comes this problems. This scheduling method im- proves more data locality than fair scheduling.

4. Dynamic priority scheduling method

Dynamic Priority Scheduling method supports ca- pacity distribution dynamically. Based on the prior- ities of the user concurrently capacity is distributer among usersrs. Automated capacity allocation and redistribution is supported in a regulated task slot resource market. This method provides users to get Map or Reduce slot on a proportional share basis per time unit. These time slots can be configured and called as allocation interval.

5. Deadline constraint scheduling method

Deadline Constraint Scheduler focuses on the issue of deadlines but focuses more on increasing sys- tem utilization. Coordinating with deadline require- ments in Hadoop based data processing is done by a job execution cost model that considers various parameters like map and reduce runtimes, input

data sizes, data distribution, etc., This Scheduling method shows that when a deadline for job is differ- ent, then the scheduler assigns different number of tasks to TaskTracker and makes sure that the sys- tem specified deadline will gets fulfilled.

6. Comparison of existing scheduling methods

Fifo: The main drawback here is starvation of small jobs in the event of resources being utilized by large jobs.

Fair: Unlike the default Hadoop scheduler, which forms a queue of jobs, this lets short jobs finish in reasonable time while not starving long jobs. There are limits to the number of the concurrently run- ning jobs in each job pool. If there are too many jobs submitted, the follow-up jobs will wait in the scheduling queue until previous jobs complete and release task slots.

LATE: If bugs cause a task to hang or slow down then speculative execution is not a solution, since the same bugs are likely to affect the speculative task also. Bugs should be fixed so that the task does not hang or slow down

3. Programmer's design

The data locality aware scheduling method to improve the Hadoop MapReduce system performance.

Data locality aware scheduling methods working start with when it receives a request from a requesting node, after receiveing request this method schedules the task to the node whose input data is already present on the requesting node. If tasks were not found then it will se- lect the task which has input data nearest to the node which have send the request, and then take the decision on whether to reserve the task on the node which stores the input data or schedule the task to the node from which request is received. By utilizing this option it can

improve the data locality and also it involves the runtime overhead. That means the waiting time to schedule the task to the node which has the input data. This wait- ing time can be longer than the latter options runtime overhead, Hence the transmission time to copy the input data to the node from which request is received. Jobtracker receives tasks from jobqueue and then it di- vide a job to multiple tasks and then assigns task to the tasktracker using scheduler. Data locality scheduling method is used for scheduling task to tasktracker node. The Method would perform the following functionalities: Accepts request from a requesting node.

Schedules the task.

To reserve the task for the node storing the input data. Schedule the task to the requesting node by transferring the input data to the requesting node. The objective of this method is to make a tradeoff between waiting time and transmission time at the execution time when task is scheduled to a node and then it will obtain optimal task execution time.

3.1. Mathematical Model

Let S be the system

S= {

1. Identify the inputs as T and N. S = {L, T, Te, Tl, N, M, v, P...

L = {g g is list of task.}

T = {i i is task to be assign to the node.}

Te = {i i is the execution time of the node.} Tl = {i i is the remaining time of the node.} N = {r r is the requesting node.}

M = {q q is the receiving node.}

v = {t t is the running speed.}

P = {s s is the progress.}

2. Identify the output as O.

S = {T, Te, Tl, N, M, v, PâÄ‚Ä™

O = {On}

On = {o o is the node to which task is assigned.}

3. Identify the waiting time As W.

W = {ww is the waiting time of the node.}

4. Identify the transmission time as Tm.

Tm = {m m is the transmission time of the node.}

5. Identify failure cases as F

S= {L, T, Te, Tl, N, M, v, P, W, Tm, o.. Failure occurs when âÄ‚Åž

O= {}

O= {p p is when waiting time is greater than transmis- sion time.}

6. Identify success case (terminating case) as e S= {L, T, Te, Tl, N, M, v, P, W, Tm, O, e.. Success is defined as-

O= {p p is when waiting time is less than transmission time.}

7. Initial conditions as S0

S= {L, T, Te, Tl, N, M, v, P, W, Tm, O, e, S0 }

Mathematical Representation

Let S be the system

S = { L, T, Te, Tl, N, M, v, P, W, Tm, O, e, S0 }

T= set of task to be assign. Te= set of execution time.

Tl= set of remaining time of running task. N = set of requesting node.

M = set of receiving node.

v = set of speed required to task processing. P = set of progress of running task

w = set of waiting time.

Tm = set of transmission time.

O= set of output

For Performance testing default hadoop scheduling method deploy proposed scheduling method on the same Linux cluster. For evaluation of performance of mapreduce framework evaluation criteria are selected as follows

1. The number of the map tasks not scheduled to the nodes with the input data.

2. The normalized execution time.

3. The response time of jobs.

By comparing this criteria using jobs running in these two Hadoop environments the effectiveness of the scheduling method is measured. It is desirable that the system can get the less number of map task not scheduled to the nodes with the input data since it can reduce data transmission over the network, it is also desirable that it can obtain shorter normalized execution time and response time since this means that the jobs were executed will impreoves the effectiveness and efficiency.

3.2. Dynamic Programming and Serialization

Data locality aware scheduling methods main objective is to improve the performance of mapareduce framework. Its working start with as it receives a request from a re- questing node, after receiving request this method sched- ules the task to the node whose input data is already present on the requesting node. If tasks were not found then it will select the task which has input data nearest to the node which have send the request, and then take the decision on whether to reserve the task on the node which stores the input data or schedule the task to the node from which request is received.

Figure 2 shows divide and conquer model. In Figure

2, jobtracker receives tasks from jobqueue and then it divide a job to multiple tasks and then assigns task to

Figure 1: Mathematical Model, Venn Diagram and Dis- cription

the tasktracker using scheduler. Data locality scheduling method is used for scheduling task to tasktracker node.

Figure 2: Divide and conquer model

3.3. Data independence and Data Flow architecture

This scheduling method has two main modules, Figure

3 shows the data flow diagram for data locality aware

scheduling alogorithm.

Module 1 Waiting time estimation

In first module the waiting time and transmission time are calculated and as per the waiting time the task is scheduled by jobtracker.In a MapReduce Framework, a node can send a request for a new task if there is disk space available for the new task. It can be used for fu- ture scope by the node after the node completes a task. By assuming this type of requirement this will be get always satisfied. If a node can send a request whenever it completes a task and successive tasks have to wait for scheduling before the node completes its current execut- ing tasks. The waiting time can be measured by the remaining time needed to complete the executing tasks.

Figure 3: Data flow diagram

Module 2 Task selection

In second module as per the comparison between trans- mission time and waiting time the task selected for par- ticular slot. When a particlar node receives a request from a requesting node, this scheduling method selects the task. If input data is exists on the node on which request is received, the task is schedule to the request- ing node. If there is no such task is present then this method selects one from the second level tasks and then third level tasks after that it estimates the waiting time of the selected task. If the waiting time is shorter than

the transmission time, the method reserves the task for the nodes with the input data and selects another task for the requesting node.

3.4. Multiplexer Logic

In hadoop mapreduce framework, jobtracker receives a jobs from jobqueue and then divide it into multiple tasks and then assigns the task to the tasktracker by using scheduling method. In data locality aware scheduling method, when it receives request from requesting node it first check for input data if it is there on that node then it will assigns the task to that node. If input data is not present it search for node which has the input data when it will find the appropriate node which has input data it will assigns the task to that particular node. For selecting appropriate node it will first calculate the wait- ing time and then transmission time compare it. If the waiting time is shorter than the transmission time, the method reserves the task for the node storing the input data.

For calculating waiting time following parameter is used, v = p /Te

Tl = (1-p)/v Finally for multiple phases v(running speed) and Tl(remaining time) is calculated for estima- tion of time as f(X,Y) for ith phase, where f(X,Y) repre- sents remaining time of the task.This remaining time of the tasks running on a node represents the waiting time of the tasks whose input data are stored on the same node. Then using task selecting criteria task is selected for particular node.

3.5. Turing Machine

State diagram for data locality aware scheduling method is as shown in figure 4. In Figure 4, in first state as input job is received by jobtracker from jobqueue and then in next state job is divided in multiple tasks and then next

state task is assign to the tasktracker using scheduler. In third stage scheduler searched for presence of input data on receiving node if that data is present it assign task if not it goes to the next state. In next state time is estimated and compared and if waiting time is shorter than transmission time task is assigned to the that node.

Figure 4: State diagram

4. Results and Discussion

As result of this scheduling method the performance of the mapreduce framework is improved.

For result estimaton default hadoop scheduling method deploy proposed scheduling method on the same Linux cluster. For evaluation of performance of mapreduce framework evaluation criteria are selected as follows

1. The number of the map tasks not scheduled to the nodes with the input data.

2. The normalized execution time.

3. The response time of jobs.

By comparing this criteria using jobs running in these two Hadoop environments the effectiveness of the scheduling method is measured. It is desirable that the system can get the less number of map task not sched- uled to the nodes with the input data since it can reduce data transmission over the network, it is also desirable that it can obtain shorter normalized execution time and response time since this means that the jobs were exe- cuted will impreoves the effectiveness and efficiency. Finally, calculate the difference between the submission time and the completion time of each job, and then it result as the response time of the job. Comparing the default Hadoop scheduling method with the data locality aware scheduling method estimation of shorter response time will be done.

5. Conclusion

This method improves the data locality of Hadoop MapReduce and it results in performance improvement. Hence reducing the normalized execution time and the response time of jobs. As future scope more focus on the technique which can transmit data sets in Hadoop environments more effectively.For future work, we can extend the locality aware model for load balancing and resource allocation across places while avoiding counter productive stealing.

Writing Services

Essay Writing
Service

Find out how the very best essay writing service can help you accomplish more and achieve higher marks today.

Assignment Writing Service

From complicated assignments to tricky tasks, our experts can tackle virtually any question thrown at them.

Dissertation Writing Service

A dissertation (also known as a thesis or research project) is probably the most important piece of work for any student! From full dissertations to individual chapters, we’re on hand to support you.

Coursework Writing Service

Our expert qualified writers can help you get your coursework right first time, every time.

Dissertation Proposal Service

The first step to completing a dissertation is to create a proposal that talks about what you wish to do. Our experts can design suitable methodologies - perfect to help you get started with a dissertation.

Report Writing
Service

Reports for any audience. Perfectly structured, professionally written, and tailored to suit your exact requirements.

Essay Skeleton Answer Service

If you’re just looking for some help to get started on an essay, our outline service provides you with a perfect essay plan.

Marking & Proofreading Service

Not sure if your work is hitting the mark? Struggling to get feedback from your lecturer? Our premium marking service was created just for you - get the feedback you deserve now.

Exam Revision
Service

Exams can be one of the most stressful experiences you’ll ever have! Revision is key, and we’re here to help. With custom created revision notes and exam answers, you’ll never feel underprepared again.