# Cache Oblivious Matrix Multiplication Algorithm Computer Science Essay

**Published:** **Last Edited:**

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.

Hardware implements cache as a block of memory for temporary storage of data likely to be used again. Cache Oblivious Algorithm are typically analyzed using an idealized model of cache, but it is much easier to analyze than a real cache memory. In this paper an efficient technique is proposed to manage cache memory. The new technique uses block recursive structure of two types only. The algorithm is tested on famous problem of matrix multiplication. It avoids jumps and cache misses are reduced to the order of N3/L√M.

Keywords: Cache Memory; Cache oblivious; Cache hit; Cache miss; Tag; Sequential Processing.

## Introduction

Cache memory is a high-speed, relatively small memory that represents a critical point in computer systems, since it eliminates the gab between a central processing unit (CPU) and main memory speed. It has an access time smaller than that for main memory. So, the average access time is decreased in a great deal. Every time the CPU generates a reference for a specific word, the cache memory will be accessed to get that word if it is there; otherwise, the main memory is accessed to retrieve the line containing that word.

A Cache Oblivious Algorithm is designed to exploit the CPU cache without having the size for cache or the length of the cache lines as an explicit parameter. Cache Oblivious was conceived by Charles E. Leiserson as early as 1996 and first published by Harald Prokkop in his master thesis at Massachusetts Institute of Technology in 1999.

The Cache Oblivious Algorithm is a simple and elegant model to design algorithm that perform well in hierarchical memory models ubiquitous on present hardware platforms. This model was first formulated in 1999 (M. Frigo Charles et al.1999) and since then is the topic of research.

Optimal cache oblivious algorithm is widely used in the Cooley-Tukey Fast Fourier Transform (FFT) algorithm, matrix multiplication, sorting, matrix transposition and several other problems (P.Kumar, 2007). Typically a cache oblivious algorithm works by a recursive divide and conquer methodology, where the problems are divided into smaller and smaller sub-problems. The resource uses can be optimally used by using recursive block structure algorithms. They are simple and portable. They use cache effectively because once the sub-problem fits into cache, its smaller sub problems can be solved into cache and with no cache misses (M. Frigo et al 1999).

Matrix multiplication is one of the most studied computation problem (D. Coppersmith et al 1990). Consider the two matrices of size N - N, Matrix X = ( Xi, Xj), and Y = ( Yi, Yj). The product Z of these two matrices is given by

N

Zij = Σ Xik, . Y kj ……………………………..(1)

K

We can also assume that 99% of matrix multiplication resulting algorithms are similar to algorithm 1. But this algorithm involves a lot of memory jumps and makes bad use of cache memory. The cache misses in this algorithm are of high order.

Algorithm 1 Multiplication of two n-by-n matrices

for i from 1 to N do

for j from 1 to N do

C[ i, j ] := 0

for k from 1 to N to

C[ i, j ] := C[ i, j ] + A[ i, k ] * B [ k,j] ;

enddo;

enddo;

enddo;

To improve cache performance, the temporal and spatial locality of the access to the linearized matrix elements has to be improved. Most linear algebra libraries, like implementations of ATLAS (R. Clint et al 2001), therefore use techniques like loop blocking, and loop unrolling [kazushige et.al ]. A lot of fine tuning is required to reach optimal cache efficiency on a given hardware, and very often the tuning has to be repeated from scratch for a new machine. Recently, techniques have become popular that are based on a recursive block matrix multiplication [Gustavson F.G. et.al. 1999]. They automatically achieve the desired blocking of the main loop, and the tedious fine tuning is restricted to the basic block matrix operations. Such algorithms are called cache oblivious [Erik D. Dernaine et.al. 2002], emphasizing that they are inherently able to exploit a present cache hierarchy, but do not need to know about the exact structure of the cache.

Several approaches have been presented that use an element ordering based on the peano curve [Bader M. et al 2004]. The peano curve results from a recursive construction idea. It totally avoids jump in the access to all matrices involved and shows optimal spatial locality but still this problem is open for research and modifications can be performed.

## Sequential Access Processing

In this paper, we present an approach that uses a sequential access processing. The sequential access (see fig 1) technique that converts the two dimensions into single. It is very easy to manipulate different matrix transformation operations.

Figure 1: Sequential Access Processing

However, our presented scheme totally avoids jumps in access to all matrices involved and reduces cache miss. It also uses only two types of recursive blocks.

We will demonstrate the general idea of sequential access based algorithm for 3-by-3 matrices. Initially we will convert a matrix of two dimensions into a single one. Algorithm 2 gives conversion algorithm.

Algorithm 2: Conversion of two dimensions into single dimension

counter =0;

for ( i = 0; i < n; i++)

for ( j = 0; j < n; j++)

cin>>a[ i, j];

d[counter] = a[ i , j]

Now it is easy to access the matrix element levels. It gives better element access from the cache memory.

Lets consider the multiplication of 3-by-3 matrices. The elements matrices are combined as single matrices.

a0 a5 a6

a1 a4 a7

a2 a3 a8

b0 b5 b6

b1 b4 b7

b2 b3 b8

c0 c5 c6

c1 c4 c7

c2 c3 c8

a b c … (2)

The elements Ck of matrix C are computed as sum of products of a and b.

Ck = Σ ai, . b,j ………………………………………….(3)

I, ,j

Where each set Ck contains multiplication of both pair elements. Figure 2 gives the graph representation of operations of 3-by-3 matrix multiplication. The nodes of graph are triples (i, j, k). In figure 2, two nodes are connected if the difference between two indices of nodes is not larger than one. Observe that all the nodes are connected and there are no jumps at all. It can be traversed in forward or backward direction.

Figure 2: Graph representation of operations of 3-by-3 matrix multiplication

For example, the element C0= {(a0, b0), (a5, b1), (a6, b2)} and that of C1={(a1, b0), (a4, b1), (a7, b2)}.

The multiplications scheme presented can be easily extended to multiplication of 5-by-5 , 7-by-7 and so on. It can be used on any matrix multiplication as long as the matrix dimensions are odd numbers. It is necessary to use a block recursive approach. In case of a large matrix, the matrix can be divided into 3-by-3 recursive blocks as shown in figure 3.

Figure 3: 3-by-3 Recursive blocks

Recursive blocks for 9-by-9 matrices are shown in figure 4. Observe that we have used only two types P and Q recursive blocks for the complete 9-by-9 matrix. This is the best possible way to divide the 9-by-9 matrix.

Figure 4: Recursive blocks for 9-by-9 matrix

Also, observe that the range of indices within a matrix block is contiguous. So, it fulfills the basic requirement of recursive block and avoids the jumping of blocks.

## Recursive Sequential Multiplication

Now we will show the use of recursive blocks P and Q in case of 9-by-9 matrix multiplication. The two 9-by-9 matrices and their resultant matrix is given by equation (4).

Pa0 Qa5 Pa6

Pa1 Qa4 Pa7

Pa2 Qa3 Pa8

Pb0 Qb5 Pb6

Pb1 Q b4 Pb7

Pb2 Q b3 Pb8

Pc0 Qc5 Pc6

= Pc1 Qc4 Pc7

Pc2 Qc3 Pc8

………(4)

We can write the equations for elements of resultant matrix as given in equation (5).

Pc0= Pa0 Pb0 + Qa5 Pb1 + Pa6 Pb2

Qc5= Pa0 Qb5 + Qa5 Qb4 + Pa6 Qb3

Qc3= Pa2 Qb5 + Qa3 Qb4 + Pa8 Qb3 ……………………………………………(5)

Similarly we can write equations for Pc1, Pc2, Qc4, Pc6, Pc7, Pc8 . If we consider the ordering of the matrix blocks , there are exactly four types of block multiplications as given in equation (6).

## +

P P . P

## +

Q P . Q

## +

P Q. P

## +

Q Q. Q

…………………………………………………………….(6)

## +

P P . P means P = P + P.P

Thus we have a closed system of four multiplication schemes. Observe that we have got only four multiplication schemes which are much less than other works (M. Bader et al 2004).

We need to compute matrix operations for all five multiplication schemes. The matrix operations for P=P+P.P can be same as that given in figure 2. The matrix operations for Q=Q+Q.Q is given in figure 5.

Figure 5: Matrix operations for Q = Q + Q.Q

After carefully examining all matrix operations we can prepare the table for sequential access operations for all matrix multiplication schemes as given in table 1.

Table 1: Sequential operations of all matrix multiplications

## +

C A . B

C

A

B

## +

P P . P

## +

## +

## +

## +

Q P . Q

## +

## +

## +

## +

P Q . P

## +

## +

## -

## +

Q Q . Q

## +

## +

## -

+ sign indicates that block scheme is executed in forward direction and - sign indicates that the block scheme is executed in reverse direction.

Algorithm 3: Implementation of Sequential Access Recursive Scheme

Mult(int SA, int SB, int SC, int dim)

## {

If (dim==1)

## {

C[c] = C[c]+A[a].B[b]

## }

else

{ Mult (SA, SB, SC, dim/3); a += SA; c += SC

Mult (SA, SB, SC, dim/3); a += SA; c += SC

Mult (SA, SB, SC, dim/3); a += SA; b +=SB

Mult (SA, -SB, SC, dim/3); a += SA; c - = SC

Mult (SA,-SB, SC, dim/3); a += SA; c - = SC

Mult (SA, -SB, SC, dim/3); a += SA; b += SB

Mult (SA, SB, SC, dim/3); a += SA; c += SC

Mult (SA, SB, SC, dim/3); a += SA; c += SC

Mult (SA, SB, SC, dim/3); c += SC; b += SB

Mult (SA, SB, SC, dim/3); a -= SA; c += SC

Mult (SA SB, SC, dim/3); a -= SA; c += SC

Mult (SA, SB, SC, dim/3); a -= SA; b += SB

Mult (SA, -SB, SC, dim/3); a -= SA; c- = SC

Mult (SA,-SB, SC, dim/3); a -= SA; c - =SC

Mult (SA, -SB, SC, dim/3); a-= SA; b += SB

Mult (SA, SB, SC, dim/3); a -= SA; c += S C

Mult (SA, SB, SC, dim/3); a -= SA; c+= SC

Mult (SA, SB, SC, dim/3);; b+-= SB; c += SC

Mult (SA, SB, SC, dim/3); a += SA; c += SC

Mult (SA ,SB, SC, dim/3); a += SA; c += SC

Mult (SA, SB, SC, dim/3); a += SA; b += SB

Mult (SA, -SB,SC, dim/3); a += SA; c - =SC

Mult (SA,-SB, SC, dim/3); a += SA; c - =SC

Mult (SA, -SB, -SC, dim/3); a += SA; b += SB

Mult (SA, SB, SC, dim/3); a += SA; c += SC

Mult (SA, SB, SC, dim/3); a += SA; c += SC

Mult (SA, SB, SC, dim/3);

## }

## }

## Implementation

The sequential access recursive scheme can be implemented as given in algorithm 3. This algorithm takes four parameters SA, SB, SC and dim. SA, SB, and SC indicates sequential access scheme. These parameters can take only +1 or -1 value. Parameter dim specifies the current matrix size. A, B, and C are actual matrices and their indices are a, b and c. Matrix C is the resultant matrix. The name of the recursive function is Mult. The algorithm can be easily implemented by writing appropriate program in C or C++.

## Data Access Locality and Cache Efficiency

During 3-by-3 block multiplication, the algorithm will perform 27 operations. Normally, for multiplication we require access of n2 elements for n3 operations (M.Bader et al 2004). But in this algorithm we need to access only 9 elements of all the three matrices. That is, we are accessing only k2 elements during k3 operations. So, we are getting k2/k3, that is, k 2/3 ratio. Hence, data access locality is of the order of k 2/3.

The ideal cache model assumes a computer consisting of a local cache of limited size, and unlimited external memory. The cache consists of M words that are organized as cache lines of L words each. The replacement strategy is assumed to be ideal in the sense that the cache can foresee the future. Hence, if a cache line has to be removed from the cache, it will always be the one that is used farthest away in the future.

For a matrix multiplication of two N - N matrices, N being a power of three, out of N - N block, only n - n block is processed and will remain in cache. This block is also reused in next multiplication. Hence, only two blocks will be required to be transferred. Therefore, the number of cache lines transfers are given by T(n) = 2 n2/L.

Therefore, for N - N, T(N) = (N/n)3 T(n) which is of the order of (N3/L√M).

## Conclusion

We present here the problem of optimization of cache memory done by implementation of optimal oblivious matrix multiplication. Our algorithm uses only two types of recursive blocks. All the elements are accessed sequentially and there are no jumps at all. The number of cache miss are of the order of Ò ( N3/L √M).