# Investigating The Computation Performance Of Gpu Computer Science Essay

Published:

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.

In this report, the features of the OpenCL framework are described. The matrix multiplication problem is picked for investigating the computation performance of GPU. The resultant implementation features at most 100 times speedup. Conclusions are drawn from the test results, which reflect the effects of using shared memory, different data types, and different workgroup sizes.

Introduction

Motivation

GPUs are manycore processors capable of handling tremendous computation and data throughput. However, despite its powerfulness, in the past, GPU was only used for computer graphics, and it was difficult to program. As of today, due to the market demand, it has evolved into a general-purpose parallel processor which, with the support of accessible programming interfaces and industry-standard languages, is able to tackle any problems that can be solved by stream processing. [1] GPU is probably the most powerful computational hardware today concerning its cost, and so the use of it becomes more and more popular and the practice is named General Purpose Computation on Graphics Processing Units, also known as GPGPU. [2]

OpenCL is one of the GPU programming interfaces available in the market. Among all, only OpenCL is open standard and platform independent. OpenCL stands for Open Computing Language. As its name means, it is a framework for writing programs that can execute across heterogeneous platforms consisting of CPUs, GPUs, and other processors. It provides parallel computing using task parallelism and data parallelism. It includes a language based on C99 for defining and controlling the platform. [3] There are also C++ bindings for easier programming.

So, migrating algorithms to GPU is not as difficult as it used to be. Nonetheless, not every algorithm can be migrated to GPU. Even if the algorithm can be executed on GPU, it may not be optimized and the effort becomes meaningless. We must choose an algorithm that can be parallelized and re-design it specifically for GPU so as to fit it for GPU and thus achieve desirable enhancement/ speedups. Problems which fit the output data decomposition model are usually mapped fairly easily to data-parallel hardware. [7]

In this case, the matrix multiplication problem is a proper example. It can illustrate the data-parallel approach by OpenCL to obtain speedups on GPUs. [6]

Objective

In this report, the matrix multiplication problem is picked for investigation so to come up with some optimization strategies for GPU programming. The algorithm is re-designed to take advantage of the shared memory and to be tested with different data types and workgroup sizes

Project Outline

To introduce OpenCL framework and present the differences between GPU programming and CPU programming

To describes the matrix multiplication problem, and the GPU solution to it.

To investigate the computation efficiency on GPU with different features. Results are illustrated and discussed.

To conclude how accelerations on GPU can be optimized and the optimization strategies.

Background and Related Work

GPU

Nowadays GPU is a highly parallel, multithreaded, many-core processor with high computational power and high memory bandwidth. GPU is especially for computations so it has more transistors then flow control and data caching, as illustrated in Figure 1. Because of this, GPU programming is quite different from the typical CPU one and some special interfaces are required which are introduced in the next section. It is important to know the GPU hardware architecture since it determines how to write and optimize a program. [6]

## Figure 1

In this study, the GPU used is NVIDIA GTX280. It consists of 30 streaming multiprocessors. Each streaming multiprocessor contains 8 stream processors, 1 double precision unit, 2 special function units, 16K shared memory and 64K registers. [4, 5] The hardware architecture of NVIDIA GTX280 is illustrated in Figure 2.

## Figure 2. Illustration of the hardware architecture of NVIDIA GTX280

OpenCL

OpenCL allows parallel computing and is platform independent. It includes a language based on C99 and a set of C++ bindings. It provides portable accelerated code that makes it outperform the other interfaces. It contains a set of functions which allows programmers to interact with the CPU host and the GPU device. Basically, the functions can be divided into four categories: Platform, Execution, Memory and Programming. The whole model can be summarized as below.

Firstly, a context is created which contains all the information and data required to execute the OpenCL program

Then, memory objects are created which can be moved on and off the GPU devices

Command queues allow the CPU host to initiate operations to be carried out by the GPU device

Moreover, kernels contain the code which the GPU devices need to execute [7]

In this study, OpenCL is used as the programming interface.

Matrix Multiplication

The matrix multiplication problem is

P = AB

, where the dimension of A is m x n and that of B is p x m. The dimensions of the multiplicand and the multiplier matrices need to follow this rule: width of the multiplier matrix = height of the multiplicand matrix, as illustrated in Figure 3. The dimension of the result matrix P is thus p x n, and its elements can be obtained by

Pij = âˆ' AikBkj

, where Pij is an element of the result matrix P in row i and column j.

P

## Figure 3

In this report, only square matrices were investigated.

The code sample of the algorithm for solving the matrix problem is illustrated below. This is the algorithm we usually use to solve matrix problem on CPU sequentially.

for (int h = 0; h < matrix_dimension; h++) {

for (int k = 0; k < matrix_dimension; k++) {

P[h][k] = 0;

for (int m = 0; m < matrix_dimension; m++)

P[h][k] = P[h][k] + A[h][m] * B[m][k];

## }

## }

The above-mentioned algorithm was modified and was then used to investigate the computation performance on GPU.

Overall Design

In order to execute the matrix multiplication problem on GPU and obtain a speedup, the algorithm has to be redesigned to fit the GPU architecture. Moreover, it is to be modified in each testings.

The following code sample is an simple implementation of the matrix multiplication using OpenCL. [7]

__kernel void Multiply(

__global float* c, int Wa, int Wb,

__global float* a, __global float* b) {

//get global position in Y direction

int row = get_global_id(1);

//get global position in X direction

int col = get_global_id(0);

float sum = 0.0f;

//calculate result of one element

for (int i = 0; i < Wa; i++) {

sum +=

a[row*Wa+i] * b[i*Wb+col];

## }

c[row*Wb+col] = sum;

## }

In this case, each work-group reads one row of A and one column of B and computes the corresponding element of C. this is straightforward.

The following code sample is another implementation of matrix multiplication using OpenCL, which uses the shared memory. [6]

In this implementation, each work-group computes one square sub-matrix Csub of C and each work-item within the work-group computes one element of Csub, as illustrated in Figure 4.

To fit the code for GPU, the multiplier and the multiplicand matrices are divided into square matrices of dimension = work-group size. Csub can then be evaluated by summing up the products of the square matrices. A product is obtained by firstly, copying the two square matrices from the global memory to the shared memory with one work-item loading one element of each matrix. Then, each work-item compute one element of the product. Each work-item is accumulating the result and once finished, results are moving back to the global memory for the host to access it. [6]

In this way, the fast shared memory helps to save a lot of global memory bandwidth. Since A is only read (B.width / block_size) times from global memory and B is read (A.height / block_size) times. For the original simple algorithm, A is read B.width times from global memory and B is read A.height times! The enhancement is remarkable!

## Figure 4

Testing and Evaluation

In the following, several tests and results are presented concerning the computation performance on GPU.

The computation efficiency is measured in GFLOPS (gigaflops). FLOPS stands for floating-point operations in a second, while GFLOPS equals one billion floating-point operations in a second. This measurement unit is commonly used as it is a good representation of performance.

Speedup on GPU

The computation efficiency on GPU is compared with that on CPU.

Result

Figure 4

Analysis and Discussion

As shown in Figure 4, the speedup on GPU is remarkable. The speedup is at most 100 times. The acceleration mainly depends on the data type used. Seemingly the data type can be benefited greatly from hardware resources and leads to substantial computation enhancement.

On the other hand, since copying data to and from the GPU can be costly, it cannot be ignored. But as shown in the graph, with data copying taken into account, the acceleration is still significant. And, it does not affect the ranking of the data types. This is due to data copying time is constant. So, its effect is negligible for speedups on GPU.

Computation efficiency on GPU with different data types

This test is performed with 3 common data types: integer, float, and double.

## Result

## Figure 5

Analysis and Discussion

As shown in Figure 5, the best efficiency is obtained by using float, while the worst case is using double. The difference is significant. Hence, selecting a suitable data type is an important issue in GPU programming. If accuracy is not important, for sure float is the best choice.

Computation efficiency on GPU with shared memory

GPU has two different types of memory: global memory and local memory. Shared memory is one form of the share memory. Two different algorithms as discussed in the previous section are used to execute on GPU to see if there are any optimization with shared memory.

## Result

## Figure 6

## Analysis and Discussion

From the graph we can see that shared memory is very fast that greatly enhace the acceleration. This is because shared memory is on-chip, closer to the work-groups, so it is way faster than the global memory.

Computation efficiency on GPU with different work-group size

We normally define the work-group size at the beginning of the code.

// Thread block size

#define BLOCK_SIZE 16

In this test, different sizes are tested to see if it affects the acceleration.

## Result

## Figure 7

## Analysis and Discussion

As shown in the graph the optimal work-group size is 16. This is because the number of shared memory bank is 16. [8] when the work-group size is 16, multiple access to the same memory bank is avoided and memory conflicts is not likely. So, it is faster.

Conclusion and Future Work

Difficulties

OpenCL is a powerful interface yet it is very new and very limited resources are available. When studying the way to migrate codes to GPU using OpenCL and coming across some difficulties, it is difficult to find a answer.

Conclusion

After all the analysis and discussion, it can be concluded that in order to optimize a gpu program, the following things are important.

Try to maximize shared memory accessed and minimize global memory accessed.

Use the optimal work-group size 16

Hence, selecting a suitable data type is an important issue in GPU programming. If accuracy is not important, for sure float is the best choice.