This report allows us to understand different aspects of affinity scheduling and the performance of the scheduling compared to the existing scheduling clauses available in OpenMP. Loop scheduling uses different clauses like static, dynamic, guided, auto etc. Each option has a different structure when compared to performance. This was clear in the last report where all the directives were compared to each other. Each of the loop scheduling techniques has its positives and negatives. In these options load balancing is an important concept which has to be made clear. In load balancing, it is clear that the processors are continuously busy without being idle. In OpenMP directives, the load balancing is taken care of by using the parameter 'chunk' where the loop is divided into a number of these chunks. Each chunk of the loop is then distributed among the threads equally. Here, OpenMP assumes that the iterations take almost the same time to complete which may not be the case.
This is where the concept of affinity scheduling comes into picture. It is assumed that the iterations are executed at different speeds and consume different amounts of time for processing. The thread which finishes its iteration before other thread is idle and has to wait for all other threads to complete because of its explicit barrier. Hence, affinity scheduling algorithm was used and the performance of two loops given was observed and compared with the OpenMP scheduling directives. During implementation of the algorithm, special attention was given on inter-thread synchronisation. The loops are made parallel using OpenMP directives and run in parallel. As this algorithm uses shared memory programming, special care is taken to synchronise the threads. The threads can access the shared variables, which undergo concurrent reads and writes. Concurrent writes yield unwanted results, hence it was necessary to use synchronisation options like critical regions and locks to prevent them.
This report gives us a clear understanding of using affinity scheduling for unbalanced loads and the advantages and disadvantages of using affinity scheduling. It also helps us understand the performance of this scheduling option as compared to the OpenMP scheduling options like static, dynamic, guided and auto.
Before using affinity scheduling in this report, the behaviour of all the scheduling options was observed and the load balancing for each of the option was noted. The OpenMP clauses that were studied were static, dynamic with chunk size, guided and auto. The different load balancing techniques for each of these options were noted to be as follows:
In this clause, the iterations are divided into chunks of size N/p where N is the number of iterations and p is the number of threads. The loops are divided in round robin fashion. It can also be used for uneven load distribution but is not very efficient.
The working of dynamic scheduling is similar as compared to static but the only difference being that the distribution of chunks among threads is not known i.e. the chunks are distributed between threads dynamically. The thread which finishes its iterations first is given the next chunk for processing dynamically.
This option is convenient when the load for a particular loop is not known previously but it can affect the performance of the loop because of data locality.
The chunk size of guided scheduling clause is not predefined as for dynamic scheduling. It starts with the execution of the largest chunk and the chunk size decreases gradually till it becomes the least which is generally specified by the user. Guided is a good load balancing clause but it can cause performance overhead when the first chunk is largest and can cause wastage of time for the first chunk itself.
The scheduling of Auto clause is runtime i.e. it is free to choose the assignment of chunks between the threads. Auto can prove to be a good load balancing clause with less overhead as compared to the other load balancing algorithms.
Apart from these OpenMP clauses there are other load balancing algorithms which give better results than the ones mentioned above. But each of these algorithms have their advantages and disadvantages. One of such a kind of algorithm is the affinity scheduling algorithm. The working of affinity scheduling algorithm is as below:
The Parallel loop in affinity scheduling algorithm is divided into a number of local chunks. Each thread continues to execute its local chunks until the local chunk of a particular thread is exhausted. The thread which finishes its iteration can be thought of as the 'seeking' thread. Instead of staying idle, the seeking thread starts searching all the other threads for their local chunks. It checks each of the remaining iterations of the thread and finds the most loaded thread from the team of threads. After finding the most loaded thread the seeking thread starts stealing the 1/p iterations of the most loaded thread. At the same time, the most loaded thread which can be thought of as the 'giving' thread has (initial - stolen chunk) of iterations now left for execution. This behaviour of all the seeking thread continues until all the local chunks assigned to every thread are finished.
The main advantage of this algorithm is that it shows great efficiency in uneven load balances and shared memory programming. The implementation of this algorithm has to be done manually unlike the other OpenMP clauses.
The main idea behind the implementation of this algorithm was to create a common pool of information for all the threads from where the threads gather information about each other. The basic implementation of this algorithm was done by considering two common arrays of the size of the number of threads created. One of the arrays stored the remaining iterations of each of the threads. The threads were given a common set of iterations to be executed. Both the arrays are shared by the threads and serve as a common pool of information for the threads. One of the array stores the flags of each thread. The flag was set to 1 when the loop begins its execution and was reset to 0 when the loop finished its execution. This was done so as to ensure that which threads have flag set to 1 and are still continuing their execution and which of these threads have finished their execution. The array total which stores remaining iterations of the thread reduces its iteration every time it finishes one of its iteration.
The remaining iterations of each threads is known to one another. Hence after a thread finishes with its local set of iterations it checks if any flag of the thread is still 1. After this the seeking thread gets the id of that thread. This id is stored in the variable of most loaded thread, the seeking thread then gathers information on the remaining set of local iterations of the most loaded thread. This thread then steals the (1/p)th chunk of the remaining set of iterations of the giving thread. The giving thread then reduces its number of iterations from its local set of iterations to (remaining iterations - 1/p iterations) every time the chunk is stolen from the remaining chunk. The seeking thread then continues its execution on the stolen chunk of iterations and the giving thread processes the remaining number of local iterations in the total array.
Since the idea of common pool was used to implement this algorithm, it was found that synchronisation was of utmost importance. All the threads access the common pool of resources concurrently giving rise to concurrent access of resources. The concurrent writes lead to unwanted changes to the results. For example, the seeking thread executes the remaining chunk of iterations from the giving thread and computes a particular value and then writes the result . Whereas simultaneously the giving thread also computes some value from the remaining set of iterations in the total array and writes the result. Due to this the seeking thread overwrites the result many times yielding an unwanted result in the end.
The solution to this problem can be obtained through proper use of synchronisation techniques available in OpenMP. There are different Synchronisation techniques available which are as follows:
The critical region can be set to the variables which undergo concurrent updates. Due to this there will be only one thread entering the critical region and updating the variable. The disadvantage of using critical region is that since only one thread can enter the critical region at a time, the other threads would have to wait for their turn to enter the critical region which can affect the performance of the program. Although critical region in implemented in the section where the seeking threads simultaneously try to access the shared variables and try to update them concurrently.
Locks are a better way of synchronising threads to prevent access to shared variables. By using locks a threads can update their own variable regardless of access of other variables. Locks are a way of synchronising threads without affecting the performance of the threads.
By the use of synchronisation techniques the threads are prevented from concurrent access to the list of shared variables. Critical section is necessary when there are a number of seeking threads. As a result, these threads will simultaneously try to gain access to the giving thread which will cause a problem. Hence critical section was used to avoid the consequences. The critical section allows only one of the seeking threads to gain access to the giving thread. Until the seeking thread gets access and finishes with the critical section the other seeking threads can wait for their chance to arrive. Although there might be a chance that the performance of the program to be degraded as the other threads have to be idle till the seeking thread gets hold of the giving thread and its iterations.
Another aspect of synchronisation in the code implemented comes when the variable suma and sumc is calculated. The iterations sum up the values in the arrays a as well as c and print the sum in suma and sumc. Thus since the threads update these values concurrently, there is a possibility of concurrent writes to these arrays and thus get an undesired values. Thus it is necessary to create locks on these variables which will help prevent concurrent reads and writes to these variables.
The results obtained after running the code on different number of threads were observed to be varying. Loop1 was observed to be running for less time as compared to loop2. The reason for this might be the structure of each loop. The structure of loop2 was more complex than that of loop1. The observations of each loop are listed below in tabular form.
Number of Threads
Total time for 100 reps for loop1
Total time for 100 reps for loop2
Table 1: Number of threads and total time for loop1 and loop2
From the set of observations above, we can see that the total time for execution of 100 reps for loop1 reduces as the number of threads increases, but from the instant 6 threads are used, the time for 100 reps becomes almost constant till 16 threads used. But as for loop2, the total time for execution of 100 reps goes on reducing as the number of threads increase.
The graph for Table1 is given below:
Figure 1: Graph showing comparison of total time taken for loop1 and loop2 using different number of threads
As for the comparison of results between the best built in OpenMP schedule, we get the following set of observations for loop1 and loop2 respectively.
Number of threads
Total time for 100 reps using (static,1)
Total time for 100 reps using affinity scheduling
Table 2: Results of Number of threads and total time required for static and affinity scheduling
From the above set of results obtained we can infer that the built in loop scheduling algorithm for loop1 performs almost well till the number of threads were 4 but from that point on, the affinity scheduling algorithm performs better as compared to static and remains constant until it the number of threads reach 16. As for static the performance is slower as compared to affinity scheduling. This can be observed perfectly in the following graph.
Figure 2: Graph showing comparison between total time taken for loop1 using the OpenMP built in scheduling and affinity scheduling
Number of threads
Total time for 100 reps using (dynamic,8)
Total time for 100 reps using affinity scheduling
Table 3: Result of comparison between total time taken for different number of threads for dynamic and affinity scheduling
From the above set of observations we can notice that the built in scheduling option performs better than that of the affinity scheduling option. That must be the case because of the complexity of loop2. We can observe that the time reduces as the number of threads increases for both dynamic as well as affinity scheduling, but total time taken for affinity scheduling is more than that taken for dynamic scheduling clause. Hence affinity scheduling for loop2 is not as better as dynamic scheduling for loop2. Also since the predetermined chunks for dynamic is 8 , the threads seem to finish execution faster as compared to affinity, where it is allowed to take 1/p of the remaining iterations from the giving thread. This can be clearly observed using the following graph.
Figure 3: Comparison for loop2 using built in scheduling and affinity scheduling
From the above set of observations, the following set of conclusions have been inferred:
The affinity scheduling algorithm can perform better in some circumstances where the loops are less complicated and a lesser number of shared variables are utilized.
The more the number of shared variables, the more synchronisation is required and hence it can affect the performance of the program
Overuse of synchronisation and critical sections deteriorate the performance of the code in parallel as the threads have to wait for their turn.
In some cases the built in scheduling clauses perform better than that of the affinity scheduling algorithm.
The affinity scheduling performs better than static scheduling where the number of chunks are predetermined and allocated to the threads in round robin fashion. This is because in static a thread has to wait for other threads to complete execution because of the explicit barrier. But that is not the case in affinity scheduling, as the idle threads can steal chunk of other threads resulting in a better performance.
In case of loop2, in dynamic scheduling as the threads are assigned chunks dynamically on becoming idle, it performs better as the number of chunks is predetermined in dynamic. But in affinity it can only take 1/p chunks from the giving thread which keeps it at a loss.
Further improvements in the code can be done by using a different data structure rather than an array. The best possible method is by using queues. The algorithm is implemented in a basic form in this report but they can be enhanced by doing further work.