Gpus In General Purpose Computing Computer Science Essay

Published: Last Edited:

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.

Over the past few years there has been a marked increase in the computation power of the GPU.Modern GPUs have become extremly flexible and powerful processors.They provide tremendeous memory bandwidth and computation power and also modern GPUs support double presicion floating point operations.Rapid increase of programability and computation power of GPU has been made a research community. High level languages has emerged to make use of the resources of the GPU. Researchers has achived a remarkable speedup over CPU on some non graphics problems.In this research we describe background of GPU hardware architecture,evolution of the GPU from discreet design to unified design,characteristics of GPU,applicability of these characteristics for general purpose computing,software environments and languages available for GPU programming,algorithms and applications and fututure of GPU computing.






I would like to thank Mr.K.P.M.K. Silva and Dr.H.L.Premarathne whose encouragement, guidance and support which are very much helpful me to accomplish the task.


Finaly I offer my regards to everyone those who help me in any respect during the completion of the project.











\begin{tabular}{ l l }

& \\ [0.5ex]

CUDA & Compute Unified Device Architecture \\ \\

GPGPU & General Purpose Computation on Graphics Processing Units \\ \\

GPU & Graphics Processing Unit \\ \\

SIMD & Single Instruction Multiple Data \\ \\

HPC & High-Performance Computing \\ \\






GPUS are special types of processors which were traditionally designed for a particular class of applications. Today there is a higher visibility of using GPUs in scientific application and high performance computing. In June 2010 world fastest GPU supercomputer was released. It is the highest ranked GPU supercomputer ever made and it is the fastest in the peak theoretical performance in the world. NVidia Tesla C2050 GPU is used here to achieve highest computation power.

Currently GPUs have evolved into general purpose programmable processes with great computational power with higher level of parallelism. Computation power increase of CPU accepts the moors low. But when considering on GPUs it outperforms the moors low and deliver about 1000x computational power increase per decade.


GPUs are made for class of applications which demonstrate following characteristics.




\item Computational requirements are large:

when processing a pixel requires hundreds of operations to be applied on it. Billions of pixels are computed per second. Therefore GPU must deliver an enormous computational performance to satisfy these demands.

\item Parallelism is more important:

Vertex processing and fragment processing need higher level of parallelism. They are done as block of elements in the GPU

\item Throughput is more important than latency:

graphics pipeline is somewhat deep and thousands of primitives are in the pipeline at any given time. Therefore it emphasize on the throughput hiding the latency.


Applications with the above characteristics can be successfully implemented on the GPU.

In chapter 2, Graphics pipeline and evolution of the GPU architecture are discussed. Basic steps in the graphics pipeline and evolution of the GPU from discrete design to a unified design are discussed in this chapter. In chapter 3, GPU programming model will be discussed. Characteristics of the GPU and applicability of these characteristics for general purpose computing will be discussed here. In chapter 4, Software environments for GPU computing will be discussed. Abilities and limitations of different languages available for GPU computing will be discussed here. In chapter 5, Algorithms and applications of GPU are discussed. Basic algorithms that can be applied on the GPU and current application scenarios are discussed in this chapter. And finally future trends of GPU computing will be discussed in chapter 6.



\chapter{GPU Architecture}

GPU is a processor with huge computational resources. Before few years ago GPU was a fixed function processor which was made only for graphics operations. But now the trend is that it has been begun to improve the functionality of GPU into a parallel programmable processor on which real world programs can be implemented.

\section{Graphic pipeline}

The graphics pipeline accepts some three-dimensional scene (typically triangles) as the input and produces a 2D raster image as output. There are some steps in the graphics pipeline.





\caption{graphic pipeline}





\item Vertex processing:



Transformation: \= Vertices are transformed into primitives (triangles) in this\\

\> stage.\\

Lightning: \> Color values are calculated for all vertices depending on the \\

\>properties of these primitives.\\

Projection: \>The 3d scene is converted into a 2D image space.\\




In the vertex processing each vertex is processed individually. In the typical usage image has hundreds to thousands of vertices. Processing of these vertices require same instruction with different data sets. GPUs are stream processors on which thousands of threads can be implemented simultaneously. Since same function is applied on different data sets, GPUs are well suited for this stage.

\item Rasterisation:

Rasterization is the process of hidden surface removing. This is done by considering which screen-space pixel locations are covered by each triangle. In the rasterization step each triangle generates a fragment at each pixel location that it covers. Many triangles can overlap at any pixel location. So color values of fragments should be computed using several fragments.

\item Fragment processing:

Each fragment is shaded to determine the final color. Each fragment is processed in parallel and the output is written to one pixel location. Therefore GPU's are well suited for this stage as the vertex processing stage.


\section{Evolution of GPU architecture}

In the past, vertex processing and fragment processing are fixed function pipelines and was not able to programmable. The introduction of shader model made GPUs programmable. First shader model were introduced with DirectX8 and OPENnGL 1.5.

GPUs with fixed function pipeline were unable to work with complex lightning and shading effects. So GPU architects began to change vertex processing and fragment processing stages with programmable units. After that user defined programs could be run on each vertex and fragment. As time goes these vertex and fragment programs become fully featured with many instruction sets.

Historically fragment processing and vertex processing are separate processes with separate instruction sets. Introduction of the unified shader model harmonize the difference between those instructions. This is a major innovation in the GPU industry and this made GPUs more programmable and efficient.





\caption{discerte design vs unified design}


After the introduction of the unified shader model, NVIDIA introduced the unified shader architecture. With the unified shader architecture, vertex processing and fragment processing become threads running on the different programs on the same stream processors.

Right side picture of the figure 2.2 depicts the design of unified shader architecture. In the unified design there are no more vertex, geometry, and pixel shaders. There are only shaders. Same shader core is used for different shader tasks (vertex shading, pixel shading and geometry shading). This design emphasizes an increase performance and efficiency with maximum GPU resources.


\chapter{GPU Programming Model}

GPUs have programmable units that's follows Single Instruction Multiple Data(SIMD). Multiple data elements are processed in the same time using threads. Data elements are independent from other elements and same program is applied to the data elements. Elements can read data from shared global memory (this is called gather) and modern GPUs support write back data to arbitrary locations (this is called scatter). In GPU operations all elements are grouped together into blocks. Then these blokes are processed Appling same program code. Processing large amount of data in one step will cause to deliver a higher throughput.

The stream processing programming model is throughput oriented processor architecture. There are no data dependencies between elements. This minimizes the necessity of cache memory. This approach helps to reduce the size of caches increasing the amount of transistors. In CPU high proportion of transistors are used for cache memories. But in GPUs high proportion of transistors are used for calculation units.(ALUs). So GPUs delivers a great computation power.

In the past GPGPU programs directly used the graphics APIs and the programs had to be structured in terms of the graphics pipeline. Writing programs using graphics APIs for general purpose computing was a very complicated task and the programmer need to have a keen knowledge about graphics APIs.

Modern programming environment provide a direct non-graphics interface to the hardware and programmable units.

Structure of the newer programming model is as follows.

Programmer defined the input as grids. Values for each grid are computed as threads. Threads are computed by math operations using gather and scatter. Gather means reads from memory and scatter means writes to memory. In the newer programming model same memory buffer can be used for scatter and gather operations. This makes a more flexible programming style.

This programming model eliminates the restrictions on data communication between elements and between kernels. And it provides a way for direct accessing programmable units of the GPU. Therefore GPU programming has become simpler and easier task.

As a result of GPU programming become an easier task, high level programming languages have been emerged for GPU programming and these programming languages allow taking the full advantage of GPU's powerful hardware.

\chapter{Software Environment}

In the past GPGPU programming was done using graphics APIs. Initially people used fixed function graphics specifics units for GPGPU operations. But there were mismatch between fundamental programming languages and this. But gradually GPUs were evolved into processors with programmable vertex and fragment units. With the directX 9, higher level shader programming was possible and several languages were emerged like Cg,HLSL, and GLSL. But those languages define computation in graphic primitives (such as textures and fragmets)

Although GPUs are powerful processing resources, there was no correct abstraction with hardware. So programmers needed a higher level language that is free of graphics terms.

A language called brook was designed to provide abstraction of graphics terms taking data as streams and computations as kernels. Brook has the stream access functionality. In brook kernels are mapped to fragment shader code and streams are mapped to textures. Data are transferred in and out to the GPU via explicit read/write calls translating into texture updates and frame buffer read backs.

Several languages were emerged after the brook. Microsoft accelerator, Rapidmind and Peaksteam are examples for these languages. All those are third party languages which were designed with the support of the GPU venders. Now GPU venders are also have their own GPU programming systems. ATI made CTM (close to metal) programming interface for GPGPU computing. But with the introduction of the OpenCL, ATI switched from CTM to Open CL. NVIDIA introduced C for CUDA for their GPUs. Today OpenCL, NVIDIA CUDA and Microsoft direct compute are main competitors in GPGPU industry.

\subsection*{NVIDIA CUDA Architecture}

CUDA is a hardware and software architecture for doing computations on NVIDIA GPUs as a data-parallel computing device without the need of mapping them to a graphics API. CUDA delivers two levels of parallelism. They are data parallelism and multithreading. CUDA Architecture has multiple levels of memory hierarchy, per-thread registers, fast shared memory between threads in a block, board memory, and host memory. Major design goal of CUDA is that is able to run parallel parts of the program on the GPU and serial parts of the application on the CPU.






\caption{nvidia CUDA architecture}


The CUDA Software Development Environment supports two different programming interfaces. They are device-level programming interface and Language integration programming interface.(look at figure 4.1)

Device level programming interface provides three different ways for writing applications.

Using DirectX Compute it is possible to write kernels in HLSL. Using OpenCL driver, it is possible to write OpenCL kernels in a C-like language called "OpenCL C"¿½. Using CUDA driver API, applications can be written to directly access the CUDA driver. This offers huge control of hardware. But it is complex than other approaches since it handles with assembly code (PTX assembly).

Language integration interface helps programmers to write application for both GPU and CPU with more flexibility .

This interface is implemented on "C runtime for CUDA" and developers use a small set of extensions to indicate which compute functions should be performed on the GPU instead of the CPU.

This approach will help to solve problems of real word application running on GPUs. Using CUDA 3.0 API developers has the ability to use visual studio as a tool to develop applications since CUDA 3 SDK added support for integration with Visual Studio.


\chapter{Algorithms and applications}

\section{Basic primitives}

There are four basic primitives when dealing with GPUs.

Scatter and gather: reads from memory locations or write to memory locations. Current GPU architecture allows unlimited reads and writes to arbitrary locations in memory.

Map: apply operations to each element in the collection. Same operation is applied on many elements in parallel. Each pixel's fragment program fetches the element data and after performing the operation result is written back to the output pixel.

Reduce: reducing the collections of values to a single element value. Fragments programs reads values from multiple textures and after computing the sum those values are written to output pixels of another texture.(now texture size is smaller) Result is again bound to the input of the same fragment shader and this operation is repeated until the output contain one pixel.

Scan: scan takes an array A of elements and returns an array B of elements of the same length. Each elements of the array B (B[i]) represents reduction of the sub-array A[1...i].

GPU based algorithms and applications make use of these primitives.


Researchers have demonstrated many algorithms that GPUs can be used effectively. some of them are described below.

%GPU Researchers has rediscovered, adapted, and improved sorting algorithms.

Sorting: Bitonic merge sort is a parallel algorithm for sorting that can be optimized with GPU computing using simple texture mapping operations and effectively utilizing the memory bandwidth. GPU sort, GPU-Quick sort, Radix-Merge sort are some sorting algorithms that GPU acceleration has been applied.

Search and Database quarries: Researchers have implemented several forms of GPU based search algorithms (Eg: Smith-Waterman algorithm, k-nearest neighbor algorithm). High performance database operations has been implemented using fast GPU based searching and sorting algorithms

Differential equations: GPUs have been used to solve problems in partial differential equations (Eg:the Navier-Stokes equations for incompressible fluid flow) . "Modern Taylor Series Method'' (MTSM) is a non traditional method for solving differential equations and that has been implemented on GPUs.

Linear algebra: Linear algebra operations on graphics hardware are built upon efficient vectors and matrices operations. Vector-vector and matrix-vector operations are implemented using fragment programs on GPU. By means of these operations, implicit solvers for algebraic equations can be implemented.

\section{Application Scenarios}

The major application scenario for GPU computing is the scientific computing which needs large computation power. GPUs are well suited for dealing with problems which need a high arithmetic intensity and data parallelism.


Tianhe-I is a super computer built by China and it uses Intel Xeon processors with 5120 (ATI Radeon HD 4870 X2) GPUs. It ranks as world's 5th fastest supercomputer in the TOP500 list. Tianhe-l has achieved a computation power of 563teraflops in the tests, but theoretically it is capable of delivering 1.206petaflops processing power. This supercomputer is used to simulate petroleum exploration and aircraft simulation.Other thing is that GPUs provide cost and energy efficiency when building supercomputers.

CSIRO's super-computer cluster is another Super computer built by Australia and it delivers over 256teraplops performance. This supercomputer has 64 (NVIDIA Tesla S1070s) Units (each unit has 4 Tesla GPUs and hence totals of 256 GPUs). This super computer supports the research areas like computational biology, climate and weather, multi-scale modeling, computational fluid dynamics, computational chemistry, astronomy and astrophysics, computational imaging and visualization, advanced materials modeling, computational geosciences. Currently there are some projects are undertaken using this super computer. They are

High-content analysis of nerve cell images for medical research and drug discovery,

deconvolving (un-blurring) 3D images from astronomy, microscopy and medical imaging,

reconstructing large 3D computer tomography (CT) images from medicine and materials science and developing specialised software to run on the Australian Synchrotron,

quantifying uncertainty in complex environmental models,

simulating in high resolution biomechanical processes like the movements of a person swimming

GPU programming concepts can be applied in distributed grid computing. One example project for this is [email protected] Goal of this project is to understand protein folding, misfolding, and related diseases.



In addition to scientific computing GPU's are already used in some other areas.

GPUs have been successfully used to accelerate digital video processing and image processing. "Badaboom video encoder"¿½ and "Mediacoder"¿½ are capable of encoding video using CUDA H264 codec with NVIDIA CUDA technology. This approach delivers about 20x performance increase compared with CPU based video encoding. Currently major video editing softwares like "Cyber link power director" and "Adobe premier" use GPU acceleration for video processing. And also major image editing softwares like "Adobe photoshop" also use GPU acceleration.

%"vReveal"¿½is a new video editing software and it uses NVIDIA CUDA technology to %accelerate video processing and they say it can deliver up to 5x performance %increase with GPU acceleration.Some other video editing software developers like %"TMPG express"¿½, "virtualdub"¿½ and "NERO vision"¿½ have already added GPU %accelerated video processing capabilities with their softwares.

%GPU acceleration has already been used in video decoding also. As an example %"core AVC" is the currently fastest h264 video decoder and it is capable of %decoding video with NVIDIA CUDA technology.

Researchers are being used GPUs for optimizing database query processing. PostgreSQL is an example DBMS that has been used for researches.

GPUs are also been used in cryptography. Brute force password recovery is a very computation intensive process. So in this scenario it can be made use the computation power of GPU. As an example "ElcomSoft Distributed Password Recovery"¿½ is a GPU accelerated password recovery tool and developers say that GPU acceleration reduces password recovery time by a factor of 50. "Parallel Password Recovery"¿½and "OctoPass"¿½ are some other tools that use GPU acceleration. With the invention of GPU based password recovery algorithms, it has to be revised common password polices(likepassword length of 8).

Physics simulation is another area that GPUs can be used. Physics has very high data parallelism. There are 1000s of collision to resolve every frame and large amount of operations are required per collision. Very high data parallelism nature of GPU helps to solve this. In the past, dedicated cards were used for physics processing. It is now possible to use graphic cards for simulating physics.

\chapter{Conclusion and Future Directions}

Since the last few years, GPU computing has been becoming a widely talking topic. Many researchers have proven the capability of using GPU for general purpose computing. In 2007 NVIDIA developed CUDA architecture which reduced the difficulty of programming GPUs for non graphic compute operations. CUDA gave grate flexibility when using GPUs for general purpose computing. This flexibility made a high motivation in GPGPU industry. NVIDIA next generation CUDA architecture is called FERMI and "it makes GPU and CPU co-processing pervasive by addressing the full-spectrum of computing applications"¿½ said developers.

In December 2009, OpenCL 1.0 was released by Khronos Group and OpenCL specification is still under development. OpenCL is a cross language and so Both ATI and NVIDIA GPUs can be programmed. OpenCL provide features like CUDA give and has achieved a large interest.

GPU computing has achieved a big place in scientific computing and high performance computing(HPC). In June 2010, fastest GPU based supercomputer (with highest theoretical computation power in the world) was developed and in future many a supercomputers will be GPU based ones.



In the upcoming years we expect to see several changes in GPU computing.

Rcently both NVIDIA and ATI added double precision support in the GPU hardware. Double precision support made a large interest in scientific computing. Double precision support of GPU is still not substantial. So GPU developers are further improving the double precision support of the hardware.

The PCI Express bus has been a bottleneck for many GPU applications. Latest PCI Express 2 bus delivers more bandwidth to overcome this issue to some extent. Maximum bandwidth of PCI Express 2 is 16GB/s and today CPUs have much bandwidth than this. So improvements of the CPU-GPU bandwidth are needed in the future.

AMD is planning to place GPU and CPU together in the same die called APU (accelerated processing units). The AMD Fusion APUs are expected to launch in the 2011. Today APIs like OpenCL has proven the ability of programming using both CPU and GPU. So this thing will be a good trend.

In GPU computing there is still not good cooperation between GPUs. Windows display driver model 2.1 will help to make use the multitasking capabilities of GPUs with good memory management.



John D. Owens, Mike Houston, David Luebke, Simon Green,

John E. Stone, and James C. Phillips.

\emph{ GPU Computing}.



Mark Harris.

\emph{GPGPU: General General-Purpose Computation on GPUs}.

NVIDIA Corporation.


Alexander Zibula.

\emph{General Purpose Computation on

Graphics Processing Units (GPGPU) using CUDA}.



Mike Houston.

\emph{Advanced Programming (GPGPU)}.


David Luebke.

\emph{GPU Architecture:

Implications \& Trends}.

NVIDIA Corporation.



John Owens,UC Davis.

\emph{GPU Architecture Overview}.




\emph{NVIDIA CUDA Architecture Introduction \& Overview}.

NVIDIA Corporation.



\emph{CSIRO's CPU-GPU supercomputer cluster}.






\emph{A GPU Framework for Solving Systems of Linear Equations}.