This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.
Because processor speed is increasing at a much faster rate than the memory speed, computer architects have started to use memory hierarchies with one or more levels of cache memory. Caches take advantage of data locality in programs. Data locality is the property that references to the same memory location or adjacent locations are reused within a short period of time.
Caches also have an impact on programming, programmers substantially enhance performance by using style that ensures more memory references are handled by cache. Scientific programmers expend considerable effort at improving locality by structuring loops so that the innermost loop iterates over the elements of a column, which are stored consecutively in Fortran. This task is time consuming, tedious and error-prone. Instead, achieving good data locality should be the responsibility of the compiler. By placing the burden on the compiler, programmers can get good uniprocessor performance even if they originally wrote their program for a vector or parallel machine. In addition, programs will be more portable because programmers will be able to achieve good performance without making machine-dependent source-level transformations.
2. Data Locality
While parallelizing programs, two different notions of data locality are to be considered. Temporal locality occurs when the same data is used several times within a short period of time. Spatial locality occurs when different data elements that are located near to each other are used within a short period of time. An important form of spatial locality occurs when all the elements that appear on one cache line are used together. The reason is that as soon as one element from a cache line is needed, all the elements in the same line are brought to the cache and will probably still be there if they are used soon. The effect of this spatial locality is that cache misses are minimized resulting in an increase in speedup of the program.
Consider the following example in which the first loop finds the differences, the second finds the squares.
Figure 2.1.1 Program to find the squares of the differences (a) without loop fusion (b) with loop fusion
Given that there are two programs that perform the same computation, the fused example in figure 2.1.1(b) performs better that the one in figure 2.1.1(a), because it has better locality. Each difference is squared immediately, as soon as it is produced. In contrast, the example in figure 2.1.1(a) fetches z[i] once, and writes it twice. Moreover, if the size of array is larger than the cache, z[i] needs to be fetched from the memory the second time it is used in this example. Thus, this code can run significantly slower.
Consider another example that sets all the elements of an array to 0.
Figure 2.1.2 Sequential code for zeroing an array
Figure 2.1.2(a) and (b) sweeps through the array column by column and row by row respectively. In terms of spatial locality, it is preferable to zero an array row by row since all the words of in a cache line are zeroed consecutively.
These two examples illustrate several important characteristics associated with numeric applications operating on arrays:
Array code often has many parallelizable loops
When loops have parallelism, their iterations can be executed in arbitrary order, they can be reordered to improve data locality drastically
As we create large units of parallel computation that are independent of each other, executing these serially tends to produce good data locality
2.2 Matrix Multiplication - an example
Matrix multiplication, a familiar algorithm will be elaborated in this section to learn how rewriting the code can help improving data locality, that is processors are able to do their work with far less communication. The following figure shows the basic matrix multiplication algorithm that multiplies matrices X and Y, and stores the result in Z.
Figure 2.2.1 Basic Matrix Multiplication Algorithm
Though this matrix multiplication algorithm uses the same data, it suffers from poor data locality. A reuse of data results in a cache hit, only if the reuse happens soon enough, before the data is displaced from the cache. In this case, n2 multiply add operations separate the reuse of same data element in matrix Y, so the locality is poor. In fact, n operations separate the reuse of same cache line in Y. In addition, on a multiprocessor, reuse may result in a cache hit only if the data is reused by the same processor.
To optimize this algorithm so that it improves the data locality, one way is to change the layout of the data structure. For example, storing the Y in column-major order would help improve the reuse of cache lines of matrix Y. The applicability of this approach is limited because same matrix is normally used in different operations. For example, if Y played the role of X in another matrix multiplication, then it would suffer from being stored in column-major order, since the first matrix in a multiplication is better stored in row major order.
n Another way to improve data locality is to implement Blocking technique. This technique
Figure 2.2.2 A matrix divided into blocks of size B
changes the execution order of the instructions. Instead of computing it row or column at a time, divide the matrix into submatrices or block (figure 2.2.2), and orders the operations so that an entire block is used over a short period of time. If B is picked properly, we can reduce cache miss significantly. One should choose B such that, one block from each of the matrices fits into a cache line.
Figure 2.2.3 Matrix Multiplication with Blocking
Figure 2.2.3 shows the matrix multiplication with blocking. The outer three loops, are always incremented by B, therefore always mark the upper left of blocks. The three innermost loops with X and Y make all possible contributions to the block of Z.
3. Data Reuse
The two kinds of information that are useful for locality optimization and parallelization are Data Reuse and Data Dependence. For locality optimization, we wish to identify sets of iterations that access the same data or same cache line. For correctness of parallelization and locality loop transformation, we must identify all the data dependencies in the code. Both of these concepts are too related to each other. Whenever we identify iterations that reuse the same data, there are data dependencies between them and whenever there is data dependency, the same data is reused. However, not all reuse can be exploited in locality optimization. For example, consider the following piece of code,
Figure 3.1 Sample Code
the loop in the figure 3.1 reads locations 5,8,11,14,17,… and writes locations 3,10,17,24,... . The read and write iterations access the same elements 17, 38, 59 and so on. However, this reuse occurs rarely and cannot be exploited easily.
3.1 Types of Reuse
We need to distinguish between the access as an instruction in a program, e.g., x = z[i,j], from the execution of this instruction many times, as we execute the loop nest. The statement is referred to as static access, while the various iterations of the statement as we execute its loop nest are called dynamic access.
Reuse can be classified as self versus group. If iterations using the same data come from the same static access, we refer to reuse as self reuse and if iterations using the same data come from different static access, we refer to reuse as group reuse. The reuse is temporal if the same exact location is referenced and is spatial if the same cache line is referenced.
3.1.1 Self Temporal Reuse
There can be substantial savings in memory by exploiting self reuse. If the data referenced by a static access has 'k' dimensions and the access is nested in a loop 'd' deep, for some d>k, then the same data can be reused n(d-k). For example, 3-deep nested loop accesses one column of an array, then there is a potential saving accesses of n2 accesses. The dimensionality of the access corresponds to rank of the coefficient matrix in access, and we can find which iterations refer to the same location by finding the null space of the matrix.
220.127.116.11 Rank of a Matrix
Rank of a matrix F is the largest number of columns or rows of F that are linearly independent. A set of vectors are linearly independent if none of the vectors can be written as a linear combination of finitely many other vectors in the set. For example, the following matrix
1 2 3
5 7 9
4 5 6
2 1 0
Figure 18.104.22.168.1 Matrix
has a rank equal 2 since the second row is a sum of first and third one and the fourth row is the third row minus twice the first row.
22.214.171.124 Null Space of a Matrix
A reference in 'd' loop nest with 'r' rank accesses O(nr) in O(nd) iterations, so on an average O(nd-r) should refer to the same array element. Suppose an access in this loop nest is represented by matrix-vector combination F and f. Let i and i' be two iterations that refer to the same array element. Then Fi+f = Fi'+f.
Rearranging we get, F (i -i') = 0
The set of all solutions to the equation Fv = 0 is called the null space of F. If F is fully ranked i.e. if d = r, then the null space of F consists of only the null vector . In that case, iterations in a loop nest all refer to different data. In general, dimension of the null space, also called as nullity is d - r. If d>r, then for each element there is a (d-r) dimensional space of iterations that access that element. For example, the matrix in figure 126.96.36.199.1 has rank =2 and loop depth = 3 giving a value of 1 for nullity.
3.1.2 Self Spatial Reuse
The analysis of the spatial reuse depends on the data layout of the matrix. We assume the row major order here. Two array elements share the same cache line if and only if they share the same row in a two dimensional array. More generally, in an array of d dimension, we take array elements to share a cache line if they differ only in the last dimension. The trick to discover and taking advantage of the self spatial reuse is to drop the last row from the coefficient matrix. If this resulting truncated matrix has rank that is less than the depth of the loop nest, we can assure for spatial locality. For example, consider the following,
0 0 0 0
1 0 1 0
Figure 188.8.131.52 (a) Matrix (b) Truncated Matrix
The rank of the truncated matrix is 1 and the loop depth is 2 , we can say that there is definitely an opportunity for spatial reuse.
3.1.3 Group Reuse
We compute group sharing only among accesses in a loop sharing the same coefficient matrix. Given two dynamic accesses Fi1 + f1, Fi2 +f2, reuse of the same data requires that,
Fi1+f1 = Fi2+f2
Or F(i1-i2) = f2- f1
Suppose v is one solution to this equation. If w is any vector in the null space of Fi, w+v is also a solution. Consider the following example,
Figure 184.108.40.206 2-deep loop nest
Although there exist no temporal reuse for either access, the z[i,j] and z[i-1,j] access almost the same set of array elements. That is, there is group temporal reuse, because the data read by access z[i-1,j] is the same as data written by z[i,j].
4. Locality Optimization
The performance of a processor is highly sensitive to its cache behavior. Misses in the cache can take tens of clock cycles, so high cache miss rates can lead to poor processor performance. In the context of multiprocessor with a common memory bus, contention on the bus can add to the penalty of poor data locality. There are three techniques to improve the data locality in uniprocessor ad multiprocessor systems:
We improve the temporal locality of the computed results by trying to use the results as soon as they are generated. We do so by dividing a computation into independent partitions and executing all the dependent operations in each partition close together.
Array contraction reduces the dimensions of an array and reduces the number of memory locations accessed. We can apply array contraction if only one location of the array is used at a given time.
We improve the spatial locality of the computed results and the temporal and spatial locality of read only data. Instead of executing each partition one after the other; we interleave a number of the partitions so that reuses among partitions occur close together.
4.1 Temporal Locality of Computed Data
Obtain the dependent partitions together and execute these partitions serially improves the temporal locality of the computed data. The code in figure 4.1.1(b) has two outer loops that iterate through the independent partitions serially. This code has improved temporal locality, since computed results are used immediately in the same iteration. Thus even if the goal is to optimize for sequential execution, it is profitable to use parallelization to find these related operations and place them together.
Figure 4.1.1 (a) Code excerpt for a multigrid algorithm
Figure 4.1.1 (b) Code excerpt for a multigrid algorithm after partition
4.2 Array Contraction
This optimization provides another example of the tradeoff between storage and parallelism. Using more registers leads to instruction-level parallelism, using more memory leads to loop level parallelism. In the figure 4.1.1(b) above, each iteration of the inner loop produces and consumes different element of AP, AM, T and a row of D. If these arrays are not used outside the code excerpt, the iterations can serially reuse the same data storage, instead of putting the values in different element and rows. Doing array contraction here increases the speed of the program because it reads and writes less data. As less storage is used, less parallelism is available. Iterations in the figure 4.2.1 share data dependency and no longer can be executed in parallel. To parallelize the code on P processors, we can expand each of the scalar variables by a factor of p, and have each processor access its own private copy. Thus the amount by which the storage is expanded is directly correlated to the amount of parallelism exploited.
Figure 4.2.1 Code excerpt for a multigrid algorithm after partition and after array contraction
4.3 Partition Interleaving
Different partitions in a loop often read the same data or read and write the same cache lines. Optimizing reuse across partitions is really important. The following section focuses on different techniques to optimize reuse across partitions.
4.3.1 Reuse in Innermost Blocks
Data can be found in the cache if it is reused within a small number of iterations. If the innermost loop has large or unknown bound, only reuse across iterations of the innermost loop translates into a locality benefit. Blocking creates inner loops with small known bounds, allowing reuse within and across entire blocks of computation to be exploited. Thus, blocking has the effect of capitalizing on more dimensions of reuse. In figure 2.2.3, the block size B is chosen to be small, so that all cache lines read and written within the block of computation fits into the cache line.
4.3.2 Interleaving Inner Loops in a Parallel Loop
When an outer parallelizable code contains an inner loop, to exploit reuse across iterations of the outer loop, we interleave the executions of a fixed number of instances of the inner loop. Creating two dimensional inner blocks, this transformation reduces the distance between reuse of consecutive iterations of the outer loop.
Figure 220.127.116.11 Interleaving four instances of the inner loop
4.3.3 Interleaving Statements in a Parallel Loop
Consider a parallelizable loop with sequence of statements S1, S2...Sm. If some of these statements are loop themselves, statements from consecutive iterations may still be separated by many operations. We can exploit reuse between iterations by interleaving their executions as shown in the figure 18.104.22.168.
Figure 22.214.171.124 The statement interleaving transformation
This paper presents a comprehensive approach to improving data locality and highlights the important optimization techniques that can be used to improve the data locality, thereby increasing the cache hits and improving the performance of the processor.