# Necessity For Gpu Acceleration Computer Science Essay

**Published:** **Last Edited:**

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.

The development of GPU accelerated Neighborhood Dependent Component Feature Learning super resolution algorithm started with the MATLAB version obtained through collaboration between High Performance Computing Lab and Dr. Vijayan Asari at the University of Dayton. The MATLAB implementation was functionally sound but its performance fell well below than par. As such CPU only implementation written in high performance language C++ using Open Source Computer Vision library (OpenCV) was developed, allowing for critical performance profiling. Profiling results shows the bottlenecks that would affect the performance, as well as certain sections that would benefit from the GPU implementation. Once the OpenCV implementation was completed, development of GPU accelerated Super resolution algorithm was undertaken.

## Flow of algorithm:

The principle of NDCFL algorithm proposed by Moinul is shown in Figure 1. The low resolution color image is split into its red, green and blue components. For each color, the Fourier phase component was calculated to estimate the covariance matrix (steering) which is in turn substituted in the second order kernel function (SKR2) to estimate the unknown pixel. Once the unknown pixels are estimated, the red, green and blue components are merged to form a high resolution image of given order.

## Necessity for GPU acceleration:

In order to get good effect, for each pixel a 9X9 overlapping patch was created. There are so many such patches in an image and so the process of calculating covariance between low resolution patch and high resolution patch have to be executed for number of times. In each stage, many computationally intensive mathematical problems have to be solved until the exit condition is met. NDCFL algorithm takes 610.10 seconds to get 2360X2320 image from a 590X580 image. Even though the resolution of input image is not very large, it takes a lot of time. Therefore, it is necessary to accelerate the algorithm.

In the algorithm, an image is divided into three components and there is no data relation which brings thread level parallelism. Inside each thread the computation process is the same for each pixel and there is no dependency between computations. So, the algorithm is suitable for parallel computation.

Input Image

Steering

SKR2

Deconvolution

Steering

SKR2

Deconvolution

Steering

SKR2

Deconvolution

Red

Green

Blue

Phase calculation

Phase calculation

Phase calculation

Output Image

Figure 1: Program flow of NDCFL algorithm

## Platform of GPU:

Due to specialized rendering hardware, Graphical Processing Units (GPGPUs) are becoming more efficient for computationally intensive and highly parallel applications. GPU has increased bandwidth over CPU because majority of the transistors on the GPU are dedicated to data processing.

The Tesla C2070 is a Fermi based NVIDIA range of GPU, it features up to 448 Compute Unified Device Architecture (CUDA) cores. It has 14 streaming multiprocessor (SM) units and each unit contains 32 CUDA cores which makes a total of 448 CUDA cores. Once the kernel is launched by the host, single thread of a kernel is called work item and few threads are grouped into workgroups. A CUDA core takes care of a work item and a SM takes care of workgroup. The clock speed of GPU is 1.15GHz. It consists of 5375MB global memory which can be used for read/write by all threads in the kernel, 64KB constant memory and 49KB shared memory which is common for threads within a workgroup

## Design of GPU accelerated algorithm:

The computational complexity of each stage of the OpenCV implementation of NDCFL algorithm is measured to get a comprehensive idea of the system workload distribution. The benchmarking was done on Intel Xeon 650 processor which has a clock speed of 2.66 GHz, memory size 288 GB and maximum memory bandwidth of 32GB/s. Figure 2 shows the profiling of building blocks of the algorithm.

Figure 2: Complexity profiling of building blocks of NDCFL algorithm

For 590X580 image, the SKR2 stage was consuming 78 percent (473.492 sec) and the Steering stage was consuming 19 percent (115.247 sec) of the total execution time i.e. 610.10 sec. Going by the principle of Amdahl's law, we selected these two stages as candidates to be sped up.

## GPU Kernel:

## Steering Stage:

The processing is done for each pixel with the number of lightweight threads the CUDA device is able to launch for a kernel. An image is considered to be a 2 dimensional matrix and is split into several rectangular sections called blocks. For simplicity we divide the image into blocks of equal size. If the image has (X,Y) pixels and if the kernels thread size is (16,16), (X/16,Y/16) blocks are launched to get one thread per pixel. If (X,Y) are not factor of block dimensions i.e. not a factor of 16, during runtime we could get a segmentation error or a clipped output because we dint assign enough threads for processing the image. So, we use the following steps to calculate the number of blocks required for processing the image.

dim3 grid((height/BLOCK_DIM+1),(width/BLOCK_DIM+1));

dim3 threads(BLOCK_DIM, BLOCK_DIM);

The following lines are the most important lines in the kernel, each thread takes its index within its block as well as the index of its block within the grid and it translates into a unique (x,y) index within the image. This unique (x,y) index is used to calculate the location of data they need from the original matrix and to write data into the resultant matrix.

int x = blockDim.x * blockIdx.x + threadIdx.x;

int y = blockDim.y * blockIdx.y + threadIdx.y;

int offset = y + x*width;

Each pixel creates an overlapping patch of size 9-9, calculate real components and imaginary components of Fourier phase angle and normalize them to unity, calculate four variances (4-1 vector) taking all 9 pixels along horizontal, vertical and two diagonal directions about the center pixel, estimate four variances (4-1 vector) taking only the rth pixels along the same (respective) directions and finally calculates the covariance (2x2 matrix). In OpenCV implementation, the above process is carried out for each pixel by making use of for statement and there is no data relation in the above process. So, the above process is passed on to the GPU, to be processed in parallel. The process of steering kernel function is shown in next part.

## SKR2 Function:

As the low resolution image is processed pixel by pixel and the algorithm to find the high resolution image from low resolution counterpart is the same and there is no data relation. In the OpenCV implementation for an image size of 590X580, set of statements inside for loop are executed for 1.3 million times serially. By posting this for loop as GPU kernel, we can assign 1.3 million work item for 1.3 million tasks.

The statements inside for loop are very long and complex. The size of variables in each thread is very high and so a condition arises with memory allocation. Allocating and using memory rationally is very important in order to get good performance. GPU consist of three types of memory- global, shared/local and private memory. The shared and private memories are faster than the global memory but it is limited. Therefore the private memory is assigned to variables which is accessed often and takes less space and the global memory is assigned to variables which takes more space and accessed less often. The possibility of using shared memory in the SKR2 function is very meagre because of the huge size of variables used, most of the float array used in this kernel are allocated in global memory.

The frequent use of global memory reduces the speedup and its a bottleneck if there is not enough parallelism in the function. But we were able to achieve considerable speedup because of huge parallelism associated with the SKR2 function. Since it took 77 percent of the total runtime in the CPU only version any speedup will make a huge difference in the overall runtime.

## Multicore implementation:

A low resolution image is split into 3 channels (red, green, blue) and all processing explained earlier are done for each channel serially one after the other. GPU version of NDCFL algorithm just makes use of one CPU and one GPU. The hardware used is a cluster with 24 CPUs and 4 GPUs. This allows the GPU implementation of NDCFL algorithm to be modified to support multiple CPUs and multiple GPUs, so that all three channels can be processed in parallel, since there is no dependency between them. This is made feasible by making use of POSIX threads - a thread is created for each channel and it is posted to an individual core (i.e.) 3 channels to 3 different CPU cores, each CPU core in turn assigns a GPU for executing its GPU kernel. The total hardware utility for implementing the multicore version of NDCFL algorithm is 3 CPU core and 3 GPU cores.

## RESULTS:

After the completion of multicore and multiple GPU version of Neighborhood Dependent Component Feature Learning (super-resolution) algorithm, profiling results were collected to examine the performance of the algorithm. The result shows that the GPU accelerated version of algorithm provides higher performance than the original implementation. For comparison purpose, selected performance figures for the host version are included. Profiling was carried out on a PC with on Intel Xeon 650 processor, 192GB of DDR3 RAM and TESLA C2070 GPU.

For each high level processing function executed in GPU, large performance increases were observed. Steering function benefits the most from GPU because of high parallelism involved and less host-device memory transfer. Even though, SKR2 function achieves lesser speedup compared to steering function because of large number of memory transfer between host and device, it does have a huge impact on the overall runtime and speedup. Considering the complexity of the algorithm, the speedup achieved by using GPU is substantial. Table 1 shows the profiling results for 590x580 image and super resolution factor of 4.

Table 1: Profiling results of NDCFL algorithm

The approach taken was to enhance the algorithm using the CPU and the GPU. Certain SIMD functions execute really fast on the CPU, so only the functions that took a lot of CPU time was transferred to the GPU. The accelerated GPU kernel functions replaced the CPU functions and the resulting output was accurate and the quality of image was never compromised.

Figure 3: Performance analysis of NDCFL algorithm.

Figure 3 illustrates the performance analysis of NDCFL algorithm for various image sizes varying from 80X86 to 590X580. As noticed, as the image size goes larger the performance of the algorithm keeps increasing, compared to the CPU. Table 2 gives finer details of the execution time and speedup for various images.

Table 2: Execution time and Speedup of NDCFL algorithm

For a super-resolution factor of 4, input image of size 80X86 will give an output image of size 320X336 even though the size of image is not so big 4x speedup was achieved and for 590X580 image GPU implementation was 5.39 times faster than its CPU only counterpart.

Figure 4: Speedup gained by NDCFL Algorithm

## Steering Kernel:

Steering kernel consists of three parts - i) variance calculation, ii) Singlular Value Decomposition (SVD) and iii) Covariance matrix estimation. Table 3 shows the profiling results of the steering kernel for an upscaling factor of 4. Variance and covariance calculation benefitted the most from GPU, because of the use of shared memory instead of global memory. Variables used in variance and covariance estimation are allocated in shared memory since they were used often and smaller in size. Speedup of around 60x and 214x were achieved respectively. Due to the high complexity of the kernel and heavy use of global memory the observed performance of SVD function was not in par with other two processing stages.

Table 3: Profiling results of Steering function.

Initially when the steering function was implemented in CUDA, we had to copy a huge data structure from host to device and three data structures from device to host, to process the SVD function as it was implemented on the host. In order to minimize the number of data structures copied in and out of the device, SVD function is then implemented within the CUDA kernel so that the number of data structure copied to and from the device is minimized to one. This greatly increased the performance of the overall steering function and speedup of 10.81x is achieved over OpenCV implementation of NDCFL algorithm.

## SKR2 Function:

The Second Order Kernel Regression (SKR2) function, a critical component of the NDCFL algorithm was computed on the GPU using CUDA. This function was the most time consuming part in the OpenCV version of the algorithm. The SKR2 kernel is very complex and the variables used are very large data structures and so there is not enough shared memory available for the work groups. All the variables used are allocated in the global memory and since the access time of global memory is slow, the performance of the algorithm reduces. Even then a substantial speedup was achieved because of highly parallel nature of the function. Table 4 shows the profiling results of SKR2 function for a 590X580 image and super-resolution factor of 4.

Table 4: Profiling results of SKR2 Function.

The SKR2 kernel is divided into three sub functions for better understanding and comparison - i) estimation of weight matrix, ii) regression kernel evaluation and iii) estimation of unknown pixel. The performance of the regression kernel evaluation stage is low because it involves statements that require frequent access to global memory to find the inverse of matrix using gauss elimination method. As mentioned earlier, the access time of global memory is a bottleneck for GPU's performance. As noticed, the memory transfer time between host and the device and vice versa was considerable. The higher bandwidth of the GPU architecture ensures these memory transfers do not affect the performance of the algorithm

## Multicore:

The multicore and multiple GPU implementation of NDCFL algorithm was designed. Since 3 channels of color image are processed in parallel in three different cores, the execution time of the algorithm should reduce by 1/3. But in practical 3 times faster speed is not possible to achieve because of the data transfer time between core and between GPUs. The execution time of various images starting from 80X86 to 590X580 was noted down for original Matlab version and the multicore version. Table 5 shows the finer details about the execution time and speedup.

Table 5: Execution time and Speedup of multicore NDCFL algorithm over MATLAB

As noticed, after a certain image size execution time taken by MATLAB starts to increase exponentially, whereas in the multicore version the execution time is more stable and follows the usual speedup pattern of GPU. For 590X580 image, the multicore version 0f NDCFL algorithm is 51 times faster than its original implementation. Figure gives the plot of the image sizes versus the speedup attainted on the 3CPU+3GPU configuration over MATLAB.