This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.
Introduction and motivation: GPU is a highly parallel, multithreaded, multicore processor with the great computational power and high memory bandwidth. The GPU architecture was driven by the market, demanding a real-time and high-definition 3D graphics. The main difference between CPU and GPU is that GPU contains more transistors devoted to computations then data caching and flow control. The advances of Graphic Processing Units (GPU) technology and the introduction of CUDA programming model facilitate developing new solutions for sparse and dense linear algebra solvers. The project deals with implementation and investigation of Matrix Multiply and Matrix Transpose using CUDA on GPU.
Matrix Multiplication is a fundamental building block for scientific computation. While solving linear equations and finding eigenvalues, the workload is dominated by matrix multiplication. As scientific computing is an important application domain in GPU computing, optimizing matrix multiplication on GPU is the key to achieve high performance in this domain. Having this in mind we are trying to optimize matrix multiplication using different techniques like coalescing, tiling, prefetching, avoiding memory bank conflicts, and loop unrolling. Later we analyze how efficiently the memory is used and performance is improved.
Matrix Transpose is one important linear algebra procedure that has deep impact in various computational science and engineering applications. Several implementations have been proposed in the literature to enhance the computation speed. However, several factors hinder the expected performance of large matrix transpose on GPU devices. I try to optimize matrix transpose by solving issues of memory usage like coalescing data transfers to and from global memory, shared memory bank conflicts and padding and evaluate the performance.
Matrix operations involve large number of memory accesses. My project concentrates on improving these matrix operations on a GPU with optimal time.
Related work: Data parallel programming languages are considered an intermediate approach between automatic parallelization efforts and explicit parallel programming models such as CUDA and OPENMP to support parallel computing. Initially Fortran 90 was widely used language and influenced following data parallel languages. Similar to array assignments in Fortran 90 is the lock step execution of each single instruction in threads executing simultaneously on a streaming multiprocessor in CUDA programming model. Recently interest in GPGPU programming has increased which is basically used for graphics hardware.
There has been lot of work on studying and exploiting the parallelism existent in matrix multiplication. In the year 1988, University of Texas, Austin presented their work on the analysis and performance on a class of parallel matrix algorithms in "Analysis of a Class of Parallel Matrix Multiplication Algorithms" by Gunnels., et all. They implemented the panel-axpy and the panel-dot based algorithms which allowed for high performance to be obtained across a range of matrix sizes. "A new parallel Algorithm on Distributed-Memory Concurrent Computers" by Choi., et all presented another parallel matrix multiplication algorithm called DIMMA, for block cyclic data distribution on distributed memory concurrent computer systems. DIMMA was a scalable matrix multiplication algorithm which used a pipelined broadcasting scheme to overlap computation and communication effectively to obtain maximum throughput from the sequential BLAS routine regardless of the block size. Similarly, a lot of research is going in exploiting data- level parallelism in matrix transpose as well. "Linear-time Matrix Transpose Algorithms Using Vector Register File with Diagonal Registers" by Bedros Hanounik., et all presented algorithms that exploit diagonal register properties to achieve a linear-time execution of MT operation using vector processor that supports diagonal registers. In the year 1993, a paper named "parallel matrix transpose algorithms on distributed memory" by choi, et all proposed algorithms on distributed memory concurrent processors.
Algorithm and implementation of the applications:
Matrix multiplication optimization:
Matrix multiplication: This document discusses aspects of CUDA application performance related to efficient use of GPU memories and data management as applied to a matrix multiplication. The main issue is the optimization of matrix multiplication using shared memory, coalescing data transfers in global memory, dividing the large matrix into smaller matrices and computing the product (tiling), prefetching the data of sub matrix and finally improving the performance using CUBLAS.
NaÃ¯ve GPU implementation:
In this implementation on GPU one thread is assigned to compute one element of the final matric C. Each thread loads one row of matrix A and one column of matrix B from global memory, do the inner product, and store the result back to matrix C in the global memory. As shown in the figure the memory footprint of one thread on global memory where matrix A, B and C are stored.
Tiled matrix multiplication is mainly used for improving the computation to memory ratio. In this approach the matrix is divided into a number of tiles and then each tile is computed. Each thread block computes one tile of the matric C and each thread in that particular thread block computes one element of that tile. A tile refers to the working set of data operated on by a block, grid, warp, etc. When output tile size is increased, the data reuse in shared memory also increases. It is important to monitor the size of the tile so that optimized output is obtained.
When a matrix multiplication is performed, The GPU kernel computes in multiple iterations. In each iteration, one tile of matrix A and one tile of matrix B are loaded from global memory to shared memory, computation is performed, and temporal result of matrix C is stored in a register. After all the iteration is done, the thread block stores one tile of matrix C into global memory.
Accessing data in the global memory is critical to the performance of a CUDA application. Memory coalescing techniques are used to move data efficiently from global memory into shared memory and registers. Global memory is implemented with dynamic random access memories (DRAMs).Two dimensional arrays in C/C++ are row-major. In the above tiled implementation, multiple global accesses have to be carried out to load one column of matrix B. during computation the input data is loaded from global memory and is transferred to shared memory. Once the output has been computed in shared memory it is again transferred to the global memory. An obvious solution is to transpose matrix B by CPU before offloading it to GPU memory.
Avoiding Bank Conflicts:
Shared memory has been used in all our implementations starting from the Tiled approach. Shared memory has banks that are organized such that they are interleaved. The GeForce GTX 480 device has a 16KB of shared memory per Streaming Multiprocessor. The Shared memory is divided into 16 banks where each bank can service one address per cycle. The number of simultaneous accesses is equal to the number of banks available. Multiple simultaneous accesses to the same memory bank lead to bank conflicts i.e. A bank conflict occurs if two or more threads access any bytes within different 32-bit words belonging to the same bank. I have implemented a coalesced matrix multiplication without bank conflicts.
The kernel is computation bound. Because the inner product consumes most of the time, it is important to make sure this part is efficient. The inner product in CUDA takes two instructions in the binary for one line of code (i.e. two source operands). This accrues the pressure on the current GPU architecture that allows only one source operand from the shared memory. A better solution is to perform outer product instead of inner product. For outer product matrix A can be stored in shared memory and matrix B, C are stored in registers as it does not require sharing of matrices B and C. Therefore, each thread only stores one element of matrix B and one column of the tile of matrix C in the register. The computation to memory ratio of the outer product is the same as the inner product.
Prefetching is to fetch data in advance. I have implemented prefetching along with loop unrolling to fetch data for the next iteration. In this implementation we unroll loops to facilitate code scheduling in the compiler or explicitly insert prefetching code. Prefetching for matrix multiplication is achieved by initiating long-latency global loads into an additional local variable register long before the variable is used. This optimization category is primarily the jurisdiction of the instruction schedulers of the compiler and runtime. Usually when a loop occurs same condition is execution over and over until the loop terminates. This increases the execution time. Loop unroll technique reduces the number of branches that are to be executed and the number of execution times of the terminating condition. "Pragma" has been used to unroll the outer loops as the inner loops are automatically unrolled by NVCC.
CUBLAS is an implementation of BLAS (Basic Linear Algebra Subprograms) on top of the NVIDIA CUDA driver. CUBLAS allows access to the computational resources of NVIDIA's GPUs. Applications use the CUBLAS library is to create matrix and vector objects in GPU memory space, fill them with data, call a sequence of CUBLAS functions, and, finally, upload the results from GPU memory space back to the host. I have implemented matrix multiplication using CUBLAS helper routines. I used cublasSgemm() [BLAS1 function]which computes the product of matrix A and matrix B, multiplies the result by scalar alpha, and adds the sum to the product of matrix C and scalar beta. Alpha and beta are singleâ€precision scalars. I have implemented CUBLAS routines for squared and non-squared matrices.
To exactly determine the effect of CUBLAS I have compared execution times of both the CPU and GPU using simple CUDA with CUBLAS execution time.
Matrix Transpose optimization:
Matrix transpose: This section discusses aspects of CUDA application performance related to efficient use of GPU memories and data management as applied to a matrix transpose. Matrix transpose is optimized so that it uses shared memory with no bank conflicts. It is then compared with naÃ¯ve matrix transpose. This particular approach increases the execution speed by 10x times. Shared memory deals with on-chip bank conflicts.
The code performs certain tasks like data allocation and transfer between host and device, the launching and timing of several kernels, result validation, and the de-allocation of host and device memory.
This is the most basic form of a transpose. odata and idata are the float pointers to the input and output matrices, width and height are the dimensions of matrix X and matrix Y, and nreps determines how many times the loop over data movement between matrices is performed. Two indices index_in and index_out are used to calculate the value of the matrix element. Each thread executing the kernel transposes elements from one column of the input matrix to their transposed locations in one row of the output matrix.
Shared memory transpose:
The shared memory transpose has three optimizations made: (1) coalescing memory access (2) tiling (3) padding.
Un-coalesced global memory access is avoided by reading the data into shared memory so that each half warp accesses noncontiguous locations in shared memory and writes contiguous data to odata. each element in a block should be accessed by different threads. A __synchthreads() call is used to ensure that all reads from idata to shared memory have completed before writes from shared memory to odata commence.
Tiling is a method in which the larger matrix is divided into smaller matrices and the computation is performed using data level parallelism. It is important to divide the larger matrix into appropriate number of tiles for optimal utilization of streaming Multi- processors.
Shared memory of 16kB is divided into 16 banks each consisting of a 32-bit word. Simultaneous access of the same bank can cause bank conflicts. To avoid this we introduce an additional column into the matrix. This is called padding as an extra column is padded. The shared memory array is sized to (BLOCK_DIM+1)*BLOCK_DIM. This pads each row of the 2D block in shared memory so that bank conflicts do not occur when threads address the array column-wise.
Every application has been executed for matrix size ranging from 16 x 16 to 1024 x 1024. I have experimented with block sizes 4 x 4 to 32 x 32. I used timing functions of the cutil library to measure and compare the output performance of all the applications for CPU and GPU execution times. The performance of each of the GPU based application has been compared with other GPU based applications and also the CPU serial implementation. It is found that CPU implementation was better for smaller matrices than GPU implementation. This is due to the fact that for smaller input sizes, the memory transfer from CPU to GPU becomes an overhead and parallel nature of threads cannot be optimally utilized.
The matrix multiplication applications have been implemented on GeForce GTX 480 NVIDIA GPU Card. It is a CUDA Capability 2.0 device. The device has 480 scalar cores, 15 Streaming Multiprocessors each constituting 32 cores, Global memory of 1535 MB, shared memory per block of 49KB, total number registers per block equal to 32768. The warp size is 32, and the maximum number of threads per block is 1024.
Matrix multiplication implementation:
The Speed up obtained by performing the code in the GPU is demonstrated in the Line graph above. From the graph it can be inferred that the GPU operation certainly lead to a very large speed up as the size of the Matrices increases. Therefore using GPU for large computations is more desirable.
However, multiple global memory accesses increases the computation time and degrades the performance. The amount of computation is 2 x M x N x K flop (2 x (width) ^ 3 for square matrices) and the amount of global memory access is 2 x M x N x K word (2 x (width) ^3 for square matrices). Moreover, the computation to memory ratio is approximately ¼ (flop/byte). Therefore, the naÃ¯ve implementation is bandwidth bounded. Due to the above constraints naÃ¯ve matrix multiplication is not a feasible solution.
Unlike the naÃ¯ve implementation, the tiled implementation uses shared memory in streaming multiprocessors of the GPU. Using the shared memory speeds up the operation as it reduces the number of global memory accesses. The tiled implementation was done using two different block sizes (16 and 32) in order to measure the performance of the GPU devices.
Below graph shows the tiled implementation for block size 16. One thread block loads one tile of input matrix A and B from the Global memory to the Shared memory per iteration. Results are loaded in a temporary register. Once the iterations are complete, one tile of C is stored back into the global memory. Threads cooperate to load a block of A & B into on-chip shared memory.
The tile size must be appropriate to the matrix size such that all the SMs have equal amount of operations to perform. Below graph is a comparison between tiled and naÃ¯ve implementation. It can be inferred from the graph that the execution time decreases for tiled implementation.
Al though the execution time decreases the numbers of computations remain the same as naÃ¯ve implementation. This is because the tiled implementation still suffers from DRAM memory transactions due to non-coalesced memory.
From the above graph we can see that coalesced approach has an improved performance when compared to tiled approach. This is due to the reduced number of accesses to the global memory.
Avoiding Bank Conflicts:
As shown in the graph coalesced matrix multiplication without bank conflicts has a better GPU performance. This is be implemented only when each of the threads in the half warp access a different bank of shared memory.
Prefetching the tile of data:
In this algorithm a tile of data is prefetched while other tile of data is being executed. At the end of the data is swapped using prefetch pointers. Loop unrolling is also used to further improve the performance.
Below is the graph of performance between the earlier approach and the prefetch approach for different block sizes.
Unrolling improves performance by reducing the number of branches encountered and providing more sequential code that can be executed faster.
The number of useful floating point operations are improved as the sharing of matrices B and C are not required. Each thread stores one element of matrix B and one column of C in the register.
Comparison of CPU and GPU performance:
Below graph shows the execution times of CPU and naÃ¯ve GPU implementation. There is a tremendous improvement in the performance. The GPU processes data faster compared to the CPU due to its multi-core structure.