# Accelerating Code On Gpus Using Cuda Computer Science Essay

**Published:** **Last Edited:**

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.

The introduction of compute unified device architecture programming model simplified the programmability of graphics processing units allowing to utilize their data-parallel, multithreaded-cores for remarkable speed ups. This has lead to a dramatic increase in the use of GPUs for science and engineering applications. While the number of cores on GPUs is relatively large compared to the multi-core CPUs, achieving its maximum theoretical performance requires understanding of the architecture and the memory organization in addition to modifying the algorithms to suitably use them. This report summarizes the contributions of the three papers and briefly discusses potential future work. Particularly, Che et al. [2008] discusses a subset of applications from Berkeley's dwarf taxonomy and optimization techniques used to achieve maximum speed up on the NVIDIA GeForce GTX 260. Stone et al. [2008] present an advanced magnetic resonance imaging (MRI) reconstruction algorithm achieving significant speedup on a NVIDIA Quadro FX 5600 while exhibiting lower error than conventional reconstruction algorithms. Li et al. [2013] present two different k-Means algorithms, one each for low dimensional datasets and high dimensional datasets, and describe the optimizations techniques used to improve the speedup compared to the state-of-the-art algorithms for GPUs. The different approaches to optimizing the algorithms used in Che et al. [2008], Stone et al. [2008] and Li et al. [2013] are described and shortcomings of one over other are briefly discussed and potential improvements and directions for future research are proposed.

## Introduction

In the past decade, the limitations of single-core processors such as the size of transistors, power consumption, and instruction level parallelism have lead to adapting multi-core processors. Although the multi-core architecture existed in the form of graphics processing units (GPUs) much before the development of multi-core processors, their use for general purpose applications was limited by the absence of arithmetic and logical functional units and also due to difficulty in programming using APIs like OpenGL and Cg. The introduction of Compute Unified Device Architecture (CUDA) by nVidia for their newer generation of GPUs allowed for easier programmability and paved way for wide adaptation by scientific and engineering applications community. Furthermore, the CUDA programming model is an extension to the high-level programming languages like C, C++ and FORTRAN which makes it learning relatively easy. In spite of the advantages of CUDA programming model, it is required to understand the underlying architecture of the GPUs and make optimizations to the algorithms to utilize the full computational potential of hundreds of processing cores, and high memory bandwidth of GPUs. In this report, the optimizations and changes required to be made to different classes of algorithms are examined.

## Overview of the GPU architecture

Traditionally, the graphics processing units (GPUs) were specialized integrated circuits designed to generate images for display. The processing cores mainly consisted of two fixed function pipelines - vertex processors to execute vertex shader programs and pixel fragment processors to execute pixel shader programs (Lindholm et al. [2008]). The potential to compute large number of graphics operations in parallel and the limitations of single core microprocessors led to the evolution in design of GPU cores to perform general computations like arithmetic and logical operations. The unification of graphics and computing architecture came to be known as unified device architecture. With the introduction of Unified Device Architecture and CUDA programming model by NVIDIA, the programmability of GPUs simplified and their use for general scientific and engineering applications increased dramatically.

The first of NVIDIA's unified device architecture is the Tesla architecture which is available in a scalable family of GPUs for laptops, desktops, workstations and servers. Figure 1 shows a block diagram of Tesla C1060 GPU with 30 streaming multiprocessors (SM) each consisting of 8 processing elements, called stream processors (SP). To maximize the number of processing elements that can be accommodated within the GPU die, these 8 SPs operate in SIMD fashion under the control of a single instruction sequencer. The threads in a thread block (up to 512) are time-sliced onto these 8 SPs in groups of 32 called warps. Each warp of 32 threads operates in lockstep and these 32 threads are quad-pumped on the 8 SPs. Multithreading is then achieved through a hardware thread scheduler in each SM. Every cycle this scheduler selects the next warp to execute. Divergent threads are handled using hardware masking until they reconverge. Different warps in a thread block need not operate in lockstep, but if threads within a warp follow divergent paths, only threads on the same path can be executed simultaneously. In the worst case, if all 32 threads in a warp follow different paths without reconverging effectively resulting in a sequential execution of the threads across the warp - a 32x penalty will be incurred. Unlike vector forms of SIMD, Tesla's architecture preserves a scalar programming model, like the Illiac or Maspar architectures; for correctness the programmer need not be aware of the SIMD nature of the hardware, although optimizing to minimize SIMD divergence will certainly benefit performance.

C:\Users\Ramaprasad\Documents\Personal\Univ\PhD\Written exam\Figures in the report\Fig1-1.tif

Figure 1: Tesla unified graphics and computing GPU architecture. SM: streaming multiprocessor; SP: streaming processor; SFU: special functional unit. Araya-Polo et al. [2011].

When a kernel is launched, the driver notifies the GPU's work distributor of the kernel's starting PC and its grid configuration. As soon as an SM has sufficient thread and per-block shared memory (PBSM) resources to accommodate a new thread block, a hardware scheduler randomly assigns a new thread block and the SM's hardware controller initializes the state for all threads (up to 512) in that thread block.

The Tesla architecture is designed to support workloads with relatively little temporal data locality and only very localized data reuse. As a consequence, it does not provide large hardware caches which are shared among multiple cores, as is the case on modern CPUs. In fact, there is no cache in the conventional sense: variables that do not fit in a thread's register file are spilled to global memory. Instead, in addition to the PBSM, each SM has two small, private data caches, both of which only hold read-only data: the texture cache and the constant cache. (The name texture comes from 3D graphics, where images which are mapped onto polygons are called textures.) Data structures must be explicitly allocated into the PBSM, constant, and texture memory spaces.

The texture cache allows arbitrary access patterns at full performance. It is useful for achieving maximum performance on coalesced access patterns with arbitrary offsets. The constant cache is optimized for broadcasting values to all PEs in an SM and performance degrades linearly if PEs request multiple addresses in a given cycle. This limitation makes it primarily useful for small data structures which are accessed in a uniform manner by many threads in a warp.

## Performance and optimization study of general purpose applications - Che et al. [2008]

Che et al. [2008] evaluated algorithms from five dwarves, each of which capture a body of related problems. The dwarves are a benchmark to test the effectiveness of parallel architectures. The applications from the dwarves were implemented in CUDA to be run on GPU and OpenMP to be run on a multi-core CPU. They limited their test to five dwarves: Structured Grid, Unstructured Grid, Combinational Logic, Dynamic Programming, and Dense Linear Algebra. They observed different parallelism and data-sharing characteristics, and thus take advantage of the GPU's parallel computing resources to different degrees. The level of programmer effort required to achieve satisfactory performance also varies widely across different applications. The application access patterns range from bit-level parallelism to row-level parallelism. The optimizations, modifications done on applications and some of the limitations of GPU architecture for further performance improvement are discussed here.

Structured grid applications are at the core of many scientific computations. Some notable examples include Lattice Boltzmann hydrodynamics and Cactus. In these applications, the computation is regionally divided into sub-blocks with high spatial locality, and updating an individual data element depends on a number of neighboring elements. A major challenge of these applications comes from dealing with the elements that lie at the boundary between two sub-blocks. One of the structure grid application examined by Che et al. [2008] is Speckle Reducing Anisotropic Diffusion (SRAD).

SRAD is a diffusion method for ultrasonic and radar imaging applications based on partial differential equations. It is used to remove locally correlated noise, known as speckles, without destroying important image features. The CUDA implementation of SRAD is composed of three kernels. In each grid update step, the first kernel performs a reduction in order to calculate a reference value based on the mean and variance of a user specified image region which defines the speckle. Using this value, the second kernel updates each data element using the values of its cardinal neighbors. The third kernel updates each data element of the result grid of the second kernel using the element's north and west neighbors. The application iterates over these three kernels, with more iterations producing an increasingly smooth image.

For this application, CUDA's domain-based programming model is a good match. The whole computational domain can be thought of as a 2D matrix which is indexed using conceptual 2D block and thread indices. Also, since the application involves computation over sets of neighboring points, it can take advantage of the on-chip shared memory. While accessing global memory takes 400-600 cycles, accessing shared memory is as fast as accessing a register, assuming that no bank conflicts occur. Thus, for improved performance, the application prefetches data into shared memory before starting the computation.

For a 2048 Ã- 2048 input, the CUDA version achieves a 17Ã- speedup over the single-threaded CPU version and a 5Ã- speedup over the four-threaded CPU version. For this application, the development of the CUDA version was found to be more difficult than the development of the OpenMP version. This was chiefly due to the need to explicitly move data and deal with the GPU's heterogeneous memory model in the CUDA version, whereas the OpenMP version was simply an extension of the single-threaded CPU version using compiler pragmas to parallelize the for loops.

In structured grid applications, the connectivity of each data element is implicitly defined by its location in the grid. In unstructured grid applications, however, connectivity of neighboring points must be made explicit; thus, updating a point typically involves first determining all of its neighboring points. Che et al. [2008] chose Back Propagation from the structured grid dwarf, which is a machine-learning algorithm that trains the weights of connecting nodes on a layered neural network. The application is comprised of two phases: the Forward Phase, in which the activations are propagated from the input to the output layer, and the Backward Phase, in which the error between the observed and requested values in the output layer is propagated backwards to adjust the weights and bias values. In each layer, the processing of all the nodes can be done in parallel.

The data structure required by back propagation was implemented in a way that can take advantage of CUDA's domain-based programming model. For each two adjacent layers, a weight matrix was created, with each data element representing the weight of the corresponding column of the input layer and corresponding row in the output layer. In this way, the whole computational domain was partitioned in a straightforward manner.

For a network containing 65,536 input nodes, their CUDA implementation achieved more than a 8Ã- speedup over the single-threaded CPU version and approximately a 3.5Ã- speedup over the four-threaded CPU version. However, efficiently implementing multithreaded parallel reductions is non-intuitive for programmers who are accustomed to sequential programming. For such programmers - the vast majority of them - the OpenMP reduction is much simpler since it is a relatively straightforward extension of the sequential CPU reduction using the OpenMP reduction pragma.

The combinational logic dwarf encompasses applications implemented with bit-level logic functions. Applications in this dwarf exhibit massive bit-level parallelism. DES is a classic encryption method using bit-wise operations. The DES algorithm encrypts and decrypts data in 64-bit blocks using a 64-bit key. It takes groups of 64-bit blocks of plaintext as input and produces groups of 64-bit blocks of ciphertext as output by performing a set of bit-level permutations, substitutions, and iterations.

While developing this application, they found that the entire DES algorithm is too large to fit in a single CUDA kernel. When compiling such a large kernel, the compiler runs out of registers to allocate. To reduce the register allocation pressure and allow the program to compile, the large kernel was divided into several smaller ones. However, because the data in shared memory is not persistent across different kernels, dividing the kernel results in the extra overhead of flushing data to global memory in one kernel and reading the data into shared memory again in the subsequent kernel.

The improved CUDA version significantly outperforms both of the CPU versions due to the massive parallelism that is exploited among different blocks. For an input size of 218, the CUDA version achieves a speedup of more than 37Ã- over the single-threaded CPU version and more than 12Ã- over the four-threaded CPU version.

A dynamic programming application solves an optimization problem by storing and reusing the results of its subproblem solutions. Needleman-Wunsch is a nonlinear global optimization method for DNA sequence alignments. The potential pairs of sequences are organized in a 2D matrix. In the first step, the algorithm fills the matrix from top left to bottom right, step-by-step. The optimum alignment is the pathway through the array with maximum score, where the score is the value of the maximum weighted path ending at that cell. Thus, the value of each data element depends on the values of its northwest-, north- and west-adjacent elements. In the second step, the maximum path is traced backward to deduce the optimal alignment. Their CUDA implementation only parallelizes the first step, the construction of the matrix. Naively, only a strip of diagonal elements can be processed in parallel. By assigning one thread to each data element, their first implementation exhibited poor performance because of the high overhead of accessing global memory. For a 4096 Ã- 4096 input, the CUDA version can achieve a 2.9Ã- speedup over the single-threaded CPU version. This limited speedup is due to the limited parallelism and the fact that this application is not inherently computationally intensive. Interestingly, unlike other applications, the four-threaded CPU version outperforms the CUDA version.

Data mining algorithms are classified into clustering, classification, and association rule mining, among others. Many data mining algorithms show a high degree of task parallelism or data parallelism. K-means is a clustering algorithm used extensively in data mining and elsewhere, important primarily for its simplicity. Our reference OpenMP implementation of k-means is provided by the Minebench suite. In k-means, a data object is comprised of several values, called features. By dividing a cluster of data objects into k sub-clusters, k-means represents all the data objects by the mean values or centroids of their respective sub-clusters. In a basic implementation, the initial cluster center for each sub-cluster is randomly chosen or derived from some heuristic. In each iteration, the algorithm associates each data object with its nearest center, based on some chosen distance metric. The new centroids are calculated by taking the mean of all the data objects within each sub-cluster respectively. The algorithm iterates until no data objects move from one sub-cluster to another.

In their CUDA implementation, the clusters of data objects are partitioned into thread blocks, with each thread associated with one data object. The task of searching for the nearest centroid to each data object is completely independent, as is the task of computing the new centroid for each cluster. Both of these steps can take advantage of the massive parallelism offered by the GPU. For a dataset with 819,200 elements, the CUDA version achieves a 72Ã- speedup compared to the single-threaded CPU version and a 35Ã- speedup compared to the four-threaded CPU version. Because the CPU is ultimately responsible for calculating new cluster centroids, the memory transfer overhead is relatively large. In each iteration, the data is copied to the GPU in order to compute the membership of each data object and perform part of the reduction, and then the partial reduction data is copied back to the CPU to complete the reduction.

## Anatomically constrained MRI reconstruction - Stone et al. [2008]

Stone et al. [2008] demonstrated the performance improvement obtained in magnetic resonance image (MRI) reconstruction algorithm in addition to the decrease in percent error of the reconstruction by almost three times compared to the conventional reconstruction techniques. They achieve a 21 times speedup of their reconstruction algorithm on NVIDIA's Quadro FX 5600 over a quad-code CPU. A particular point of interest is how different optimizations on the code, taking into account the GPU architecture, achieve significant speedups. The details of their reconstruction algorithm, its implementation using CUDA, and different types of optimizations on the algorithm done to improve its performance are discussed.

The Quadro FX 5600 used in their study is a graphics card with a G80 GPU. Some of the distinguishing features of G80 GPU include 16 streaming multiprocessors (SMs) each containing 8 streaming processors (SPs) and 8192 registers shared among all the threads assigned to it running in a SIMD (Single Instruction Multiple Data) fashion. Each processor core has an arithmetic unit and two special functional units (SFUs) to perform complex operations such as trigonometry functions with low latency. The Quadro has a 1.5 GB off-chip global memory and a 64 kB off-chip constant memory. In addition, each SM had 8 kB constant memory cache and 16 kB shared memory. The SFUs and the constant memory cache are particularly useful in MRI reconstruction algorithms and their utilization for accelerating the code is discussed in a later section. The peak theoretical performance that can be achieved on a G80 GPU is 388.8 GFLOPS and the global memory provides a peak bandwidth of 76.8 GB/s.

Magnetic resonance imaging is used for various applications particularly for non-invasive medical imaging and diagnosis of soft tissues of the body such as brain, heart and other organs. The two main phases in MR imaging are scanning and reconstruction. In the scan phase, the scanner samples data in spatial-frequency or Fourier transform domain along a predefined trajectory. The sampled data are used to generate 2D or 3D image in the reconstruction phase.

C:\Users\Ramaprasad\Documents\Personal\Univ\PhD\Written exam\Figures in the report\Fig2-2.tif

Figure: MRI sampling techniques. a) The scanner samples k-space on a Cartesian grid and reconstructs image via FFT. b) The scanner samples on non-Cartesian grid, then interpolates the data unto a uniform grid and reconstructs image using FFT. c) The scanner samples on non-Cartesian grid and directly reconstructs image using advanced reconstruction algorithm of Stone et al. [2008].

There are two different sampling trajectories used in MRI - sampling k-space on a uniform Cartesian grid and sampling in a non-Cartesian (spiral) trajectory. The advantage of sampling on Cartesian grid is that the samples obtained can be directly used to reconstruct image using fast Fourier transform (FFT) and so is computationally efficient. However, the scanning is slow and is sensitive to imaging artifacts. On the other hand, sampling in a non-Cartesian grid is faster and is less sensitive to imaging artifacts caused by non-ideal experimental conditions. Furthermore, it can accurately model image physics using statistically optimal image reconstructions by incorporating prior information in the form of anatomical constraints. Although, statistical image reconstruction is computationally expensive, they enable short scan times to yield high quality images.

The anatomically constrained reconstruction algorithm of Haldar et al. [2008] was implemented that finds solution to the following quasi-Bayesian estimation problem:

(1)

Here, is a vector containing voxel values for the reconstructed image, F is a matrix that models the imaging process, d is a vector of data samples, and W is a matrix that can incorporate prior information such as anatomical constraints which are derived from scans of a patient which reveal location of anatomical structures. The analytical solution to the reconstructed image is

(2)

The size of the matrix makes the direct inversion impractical for high resolution reconstructions. Instead a conjugate gradient (CG) algorithm is used to solve the matrix inversion. The three steps of computations in the advanced MRI reconstruction algorithm involves computing the in the first step followed by computing and finally using CG to iteratively solve for the matrix inversion of equation (2).

## Algorithm implementation and optimizations:

The first step in computing , which depends only on the scan trajectory, involves FFT operations of the voxel basis function. Since depends only on the scan trajectory and image size, it can be computed before scan occurs. In the second step, the vector is computed which depends on the scan trajectory and the scan data. Finally, the CG linear solver is used to reconstruct image iteratively. Because the first step can be performed before a scan, the second and third steps form the critical path in the reconstruction algorithm. Particularly, there is potential to accelerate computing on GPU because it contains substantial data-parallelism. The vector is defined as

(3)

In the above equation, km denotes the location of mth sample with M k-space sampling locations, xn denotes coordinates of the nth voxel with N voxel coordinates, and is the Fourier transform of the voxel basis function. Although the above equation requires using sin and cos functions and each nth element of depends on all the M k-space samples, the data-parallelism is observed in the independence between computing each of the N elements of . However the two major bottlenecks are the ratio of floating-point operations to memory access is observed to be 3:1 and the ratio of number of floating-point arithmetic to number of floating-point trigonometry operations is only 13:2.

In their CUDA code implementation, Stone et al. [2008] overcome the memory bottleneck due to small ratio of FP operations to memory access by tiling. The scan data is divided into many tiles each containing a subset of sample points. For each tile, CPU loads the corresponding subset of sample points directly into the GPUs constant memory enabling each CUDA thread to compute a partial sum for a single element of by iterating over all the sample points in the tile. This optimization was observed to significantly increase the ratio of FP operations to memory accesses. To reduce the stalls due to long-latency trigonometric operations, the special functional units (SFUs) of G80 were made use of. The trigonometric operations can be executed as low-latency instructions on SFUs by invoking the use_fast_math compiler option. There was no particular optimization done on the implementation of preconditioned conjugate gradient (PCG) algorithm except for implementing the FFT and IFFT operations using NVIDIA's CUDA CUFFT library and BLAS and sparse BLAS operations in CUDA.

The effects of Quadro's architectural features on performance and quality of reconstruction were quantified by implementing seven different versions of the algorithm for which are discussed briefly. GPU.Base is the base verion implementation without any optimizations to the CUDA code and is observed to be significantly slower than CPU.SP, the optimized, single-precision, quad-core implementation of . In GPU.Base, the inner loops are not unrolled. Because GPU.Base leverages neither the constant memory nor the shared memory, memory bandwidth and latency are significant performance bottlenecks. With a significant ratio of FP operations to memory accesses, and other performance bottlenecks, the kernel actually achieves only 7.0 GFLOPS, less than half of the 16.8 GFLOPS achieved by CPU.SP.

Relative to GPU.Base, GPU.RegAlloc decreases the time required to compute from 53.6 min to 34.2 min. In short, register allocating the voxel data increases the computation intensity (the ratio of FP operations to off-chip memory accesses) from 3:1 to 5:1. This substantial reduction in required off-chip memory bandwidth translates into increased performance.

By changing the layout of the scan data in global memory, GPU.Layout achieves an additional speedup of 16% over GPU.RegAlloc. The underlying causes of GPU.Layout's improved performance relative to GPU.RegAlloc are difficult to identify. GPU.Layout does require less overhead to marshal the data into struct-of-arrays format than GPU.RegAlloc requires to marshal the data into array-of-structs format, which may account for a small fraction of the improvement in performance.

GPU.ConstMem achieves speedup of 6.4x over GPU.Layout by placing each tile's scan data in constant memory rather than global memory. GPU.ConstMem therefore benefits from each SM's 8 kB constant memory cache. At 4.6 min and 82.8 GFLOPS, this version of is 4.9X faster than the optimized CPU version.

GPU.FastTrig achieves acceleration of nearly 4x over GPU.ConstMem by using the special functional units (SFUs) to compute each trigonometric operation as a single operation in hardware. When compiled without the use_fast_math compiler option, the algorithm uses implementations of sin and cos provided by an NVIDIA math library. Assuming that the library computes sin and cos using a five-element Taylor series, the trigonometric operations require 13 and 12 floating-point operations, respectively. By contrast, when compiled with the use_fast_math option, each sin or cos computation executes as a single floating-point operation on an SFU. The SFU achieves low latency at the expense of some accuracy.

C:\Users\Ramaprasad\Documents\Personal\Univ\PhD\Written exam\Figures in the report\Fig2-8.tif

Figure: Performance comparison of the seven versions of the algorithm to compute .

To determine the impact of experiment-driven code transformations, Stone et al. [2008] conducted an exhaustive search that varied the number of threads per block from 32 to 512 (by increments of 32), the tiling factor from 32 to 2048 (by powers of 2), and the loop unrolling factor from 1 to 8 (by powers of 2). All the previous configurations (GPU.Base - GPU.FastTrig) performed no loop unrolling and set both the number of threads per block and the tiling factor to 256. The exhaustive, experiment-driven search selected 128 threads per block, a tiling factor of 512, and a loop unrolling factor of 8. This configuration was implemented in GPU.Tune which increased the algorithm's performance by 47%, with the runtime decreasing to 49 s and the throughput increasing to 184 GFLOPS.

The final version GPU.Multi was not an optimization but demonstrated the use of multiple GPUs to achieve speedup. The voxels are divided into four distinct subsets, with one of four Quadros computing for each subset. This implementation decreases the time required to compute to 14.5 s and increases the throughput to over 600 GFLOPS. The acceleration is slightly sub-linear because the overheads (I/O, data marshaling, etc.) represent a significant fraction the time required to compute . With 's runtime reduced to just 14.5 s, Amdahl's law is beginning to assert itself.

## K-means algorithm and optimizations - Li et al. [ 2013]

Clustering is an unsupervised learning method in which the data is partitioned into clusters while satisfying an objective function. One of the simplest and popular methods of clustering is K-means algorithm and is used in various applications such as image analysis, data mining, bioinformatics and pattern recognition. Although the algorithm is computationally fast for small dimensional dataset, the running time increases significantly with dimensionality and also the size of the dataset. Fortunately, the algorithm is parallelizable and several implementations have been developed for cluster based computers, multi-core and multi-processor shared memory architectures and for GPUs. Because of the large number of processing cores available on GPUs compared to cluster computers and shared memory systems, Li et al. [2013] explore ways to accelerate K-means algorithm on GPUs and propose different optimizations for low-dimensional and high-dimensional datasets to utilize GPUs architectural features.

The GPU used is the NVIDIA GTX280 with 30 SMs each containing 8 SPs totaling 240 processor cores. The main features of the GTX280 used are the on-chip registers to accelerate low-dimensional and on-chip shared memory together with on-chip registers to accelerate high-dimensional K-means algorithm.

In the K-means algorithm, the aim is to cluster N d-dimensional data points {xi; 1<i<N} into K clusters {Cj; 1<j<K, K<N} by minimizing the within cluster sum of squares (WCSS)

(4)

In the above equation, is the mean of data points (also the centroid) of cluster Cj, and represents the Euclidean distance. In the algorithm, the first step of iteration involves assigning cluster memberships to data points with clusters having least distance to its centroid and in the second step the centroid for each cluster is computed based on the data membership. Before the first iteration, the centroids are initialized either randomly or deterministically. The two steps are repeated until there is no change in centroids or a predefined number of iterations are reached.

Each of the steps in an iteration of the above algorithm can potentially be parallelized however consecutive iterations cannot be parallelized since the centroids or data memberships depend from the values in previous iteration. The running time of the algorithm is O(NKd+NK+Nd). For small dimensional datasets, the running time can be approximated as O(NK) and for large dimensional datasets as O(NKd) ignoring other terms due to negligible contribution. [3] proposed different parallelization and optimization techniques for low and high dimensional data and compare the running times with three other state of the art GPU implementations: UV_k-Means, GPUMiner, and HP_k-Means. The design and optimizations along with the comparison of results are briefly discussed.

The first step in parallelization is assigning computation of data membership of each data point to a thread. A simple implementation would create as many number of threads as the data points which get loaded into on-chip registers to compute data membership. For low dimensional data, the data points can be accommodated onto the on-chip registers however for higher dimensional data the data points gets stored into shared memory which increases reading latency thus decreasing performance. To overcome this drawback, [3] use tiling. The computation of data membership can be viewed as a matrix multiplication of data[N][d] and centroid[d][K] to give the distance matrix dist[N][K]. A data point, n is assigned to cluster, k corresponding to the column with minimum value in the nth row of the dist matrix. This observation of matrix multiplication is made use of to compute data membership. In the tiling method, the data is loaded from global memory into shared memory tile by tile efficiently by adopting coalescing which accesses sixteen consecutive addresses for the thread in a half-warp to avoid bank conflict.

The second step in parallelization of the K-means algorithm is in the computation of the centroids. If the data membership is stored in an array cluster[n] where the nth entry is the cluster number to which the nth data point belongs to, computing new centroid for a cluster is to take the mean of all the data points belonging to that cluster. The computational complexity is O(Nd + Kd) and is difficult to parallelize. This is because if each data point is assigned to a thread, it will generate write conflict when adding the data belonging to particular cluster. On the other hand, if we assign computation of each centroid to a thread, the computing power of the GPU is barely utilized. [3] used the "divide and conquer" strategy in which the N clustered data points are divided into N/M groups, where M is a multiple of number of multiprocessors, and temporary centroids are computed. The temporary centroids, N' are further divided into N'/M groups and N'/M centroids are computed. The second step is repeated until the number of temporary centroids is less than M, upon which the final K centroids are computed by the CPU. Because of dividing data/centroids into groups, the number of write conflicts is decreased thus increasing the performance.

The optimizations discussed above are for low and high dimensional data with the assumption that the number of data points, N is within the limit that entire data set can be stored on GPUs on-board global memory. For large datasets which cannot fit in single GPU's memory, [3] propose the use of divide-and-merge method as follows: load the data set in chunks, compute temporary results and merge them into final results.

## Summary, drawbacks and potential improvements of selected literature

An interesting question that arises with the advent of GPU having hundreds of processing cores is if they will make the traditional CPU obsolete. I believe the answer is no. The reason is twofold - Firstly, the GPU processing cores are designed to perform a subset of arithmetic and logical operations. Second, the GPUs are designed as co-processors and the main task is initiated by CPU with a subtask dispatched to GPU. The data transfer time from CPU to GPU is still significant compared to computation time and hence the full computational performance of GPU can be utilized only for large data sets. For these reasons, the GPU will not make a CPU obsolete. However the technology is converging as observed in the latest trend in mobile and hand-held computers where the CPU and GPUs are present on the same die.

The speedup reported in the selected publications and in general by researchers do not report the data transfer time between CPU and GPU but report only the computation time on GPU and the resulting speedup achieved. In many cases the transfer time is so significant that the use of GPU for small datasets is computationally less advantageous than using CPU alone (Gregg and Hazelwood, [2011]). Since any application is initiated on a CPU, I believe it is always desirable to report the total application time and speedup improvement with use of GPU compared to without using GPU.

While some research compares performance improvement of their parallel algorithm implementation on GPU in terms of the order of speedup achieved over serial implementation on CPU, the ideal comparison is to compare the theoretical performance achievable in GFLOPS and performance in GFLOPS achieved for both GPUs and CPUs used in their study. This is because, the performance in GFLOPS is independent of architectural features such as number of cores, operating frequency.

In their parallelization of k-means algorithm for low dimensional data, [3] mention that they observed block size of 256 to give better performance based on experimentation with block sizes 32, 64, and 128. This is because the maximum number of threads allowed in a block in CUDA version 2.3 is 256 and was increased to 1024 in later versions. Thus, the code has to be optimized for each version of CUDA.

The limited on-board and on-chip memory available on GPUs limits them for applications dealing with very large datasets. This can be overcome by either algorithmic techniques such as 'divide and conquer' or by concurrent kernel launches and simultaneous data transfer and execution. However, modifying algorithms increases development time and the approach of transferring data from CPU to GPU while GPU is performing operations on other data may produce desirable speed up on a case by case basis.

To ease the burden of development of optimized parallel code and to support GPUs from different vendors, another open source programming platform known as OpenACC (Open Accelerator) is being developed in the lines of OpenMP for shared memory and MPI for distributed memory architectures. The OpenACC development is still in beta stage and is offered only by compilers from PGI (The Portland Group).