Graphics Processing Units Computer Science Essay

Published: Last Edited:

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.


GPU (GRAPHICS PROCESSING UNIT) is used for 3-D applications. It consists of a single chip that creates lightning effect and transforms objects every time a 3D scene is redrawn. GPU lifts the burden from the CPU and frees up the cycles that can be used for other processes. In a personal Computer GPU can be present on Video Card or on the Motherboard.

GPU consists of several microchips which contain special mathematical operations used in graphics rendering.  A GPU implements a number of graphics primitive operations that makes running them much faster than drawing directly to the screen with the host CPU.GPU is used to play 3D games.

GPU has the capability of supporting heavy graphics which CPU cannot support.

GPU was first introduced in Nvidia Geforce 256 in 1999 and at that time it was capable of processing a minimum of 10 million polygons per second and 480M pixels/second of performance. The transform and lightening intense process enhance the Photo-realism and creates a realistic image on the screen.

CUDA is the programming language of GPU's which helps in Parallel Computing.


A Graphic Processing Unit (GPU) also known as Visual Processing Unit is designed to share and speed up the operations, commands and applications executed by the CPU. A GPU can manipulate the tasks much faster and efficiently than a general central micro-processor. Super computers, game consoles, personal computers, work stations and mobile phones consist of GPUs for producing high quality graphics and fast manipulation of data. GPUs are not only used for 3D graphics but they are also used for implementing algorithms and solving mathematical problems.

TIANHE-1A is the fastest super computer in China containing 14,336 Xeon X5670 processors and 7,168 Nvidia Tesla M2050 GPU's. TINHE-1A use Linux operating system. The total memory size of the system is 262 terabytes.



GPUs are fast because they have hundreds of cores built inside which helps to process high quality 3D graphics. Group of 2 or more than 2 GPUs form Clusters which are needed to make Supercomputers. A GPU is designed for parallel programming through which every instruction is sent to different processors at a same time for fast execution. It is also called multi threaded processing unit. The term 'multi thread' means that one or more than one instruction can be processed and executed at a same instant of time through parallel programming. CPUs on the other hand have several sequential cores where as GPUs consist of hundreds of parallel cores. The parallel processing of GPU's makes them much faster than CPU's. Modern GPUs run 10,000s of threads concurrently.


Inner View of CPU:

A Central Processing Unit (CPU) is the brain of every computer. It executes, solves, manages and controls all calculations, problems, instructions and input/output devices of the computer. A CPU is composed of different components but some of the main components are control unit, arithmetic and logical unit(ALU), read only memory (ROM), random access memory (RAM) and cache memory. Cache memory does not exist physically like other components of the CPU but still it plays a very important role in the performance of the CPU. Cache memory is virtually situated between the processor and main memory of the system. Processor first sends the request for saving or retrieving the data to the cache memory. If data is not found or saved in the cache than this process is called cache miss and if data is found or saved successfully by cache than this process is called cache hit. Cache miss enables the processor to send the respective request to the main memory of the CPU. More cache memory size, faster will be the speed of the computer. So in CPU only one cache memory is located which load the entire data and instructions of the computer. This affects the performance of the CPU.

Inner View of GPU:

A Graphical Processing Unit (GPU) boosts up the speed and performance of the computer. It computes the instructions parallel with the processor of the CPU. A CPU has only one control unit and one cache with four ALU's where as a GPU has several cores, each equipped with one control unit, one cache memory and several ALU's. A single instruction or calculation is solved by breaking it into small instructions and then these small instructions are assigned to respective cores of a GPU. Later the result of these cores combined and put forward the solution of the problem at a very high speed. This type of processing is called parallel processing. (Fig 1.2) shows the skeleton view of CPU and a GPU.

A GPU have more transistors than a CPU. Programmers use ANSYS Language also known as 'CUDA' for GPU programming. CUDA is the main language which is used for GPU programming. GPU's can be used in the production of Heterogeneous Systems. Heterogeneous Systems are those systems in which GPU is combined with a CPU. This type of systems serves as both a programmable graphics processor and a scalable parallel computing platform.

Two or more GPU's can combine to solve many large calculations into seconds. It provides real-time visual interaction with computed objects via graphics images, and videos.

Modern computers and laptops consist of build inn GPU's but these GPU's are much slower than those which are build on video cards. GPU fetch the graphic tasks from the CPU to manipulate them efficiently and send them back to the CPU. A GPU performs 3D operations which enable it to produce high graphics for 3D games and 3D rendering.

The first GPU was introduced by NVIDIA in 1999 which was built on a single chip processor. It offered a notably large leap in 3D gaming performance. GeForce 256 at that time had the capability of supporting 3D graphics. GeForce provided up to a 50% or greater improvement in frame rate. Later in 2002 ATI technology introduced Radeon 9700 and changed graphic processing unit (GPU) to visual processing unit (VPU).

GPU Architecture/Pipeline:

Modern GPU's works initially by C/C++ syntax programs. The 3D-images are compiled and produced by going through a particular pipeline. Following is the architectural design of a GPU:

3D Application / Program

API Commands



GPU data/ commands


Assembled Polygons, Lines & Points

& Point

Pixel Location Stream

Vertex Index Stream


Rasterization Interpolation

Raster Operations

Primitive Assembly

GPU Front End

Pre transformed Vertices


Transformed Vertices

Transformed Fragments

Pre transformed Fragments

Pixel Updates

Programmable Fragment Processor

Frame Buffer

Programmable Vertex Processor

(FIG 1.3)


In this stage of GPU pipeline there are applications or programs input by the user to the CPU. The application or the program is in C/C++ syntax. The CPU compiles and executes the application or program of the user. If any error or bug is detected during the compilation and execution phase, then the application halts and closed by the CPU. Otherwise the application is compiled and executed successfully then Application Program Interface (API) commands are transferred to the driver of the CPU.

The syntax of the application is given below:




glTexCoord2f (2,1); glVertex3f (1,2,1);

glTexCoord2f (3,6); glVertex3f (-3,-3,1);

glTexCoord2f (2,2); glVertex3f (4,-2,3);

glEnd( );



Driver of the CPU converts the Application Program interface (API) commands into GPU data and commands. The Input buffer sends the program or application of the user to the driver of the CPU. Then it converts the application commands into GPU program commands. These GPU program commands are valid for further execution and processes. The GPU data or commands are then transferred to the front end of the GPU.



(FIG 1.4)

GPU Front End:

GPU front end manipulates the API commands. This organizes the data of the application in a sequence so that the vertices of different images can be produced and execute easily. It receives the data and commands from the driver of the CPU. The data is further sent to the Programmable Vertex Processor to convert pre-transformed vertices into transformed vertices.PCI (Peripheral Component Interconnect) express is also used at this stage.



(FIG 1.5)

Programmable Vertex Processor:

Programmable vertex processor is programmed to create and organize the vertices of the application. Every shape in the application program is composed of many different vertices. A specific program constructs and designs each vertex which is used to make the required shapes of the application. Than these vertices are sent to the vertex processor, where texture and shader generate graphics of each and every vertex. Now transformed vertices are sent to Primitive Assembly stage of the GPU.

(FIG 1.6)

Primitive Assembly:

Primitive Assembly arranges and compiles the generated vertices into points, lines and polygons. This phase of GPU transform different types of geometrical shapes using points, lines and polygons for the application. These shapes or images are only the abstract design of the given application. This phase only combines and assembles different random vertices to produce geometrical skeleton designed shapes, triangles or other primitives. It links one element to another to make the required images for the application. The resultant material is then sent to Rasterizer and Interpolation stage.

(FIG 1.7)

Rasterizer and Interpolation:

Rasterizer and Interpolation receives the assembled lines, points and triangles from the primitive assembler. This phase determines the specific area of the primitive assembled shapes with the help of Barycentric Coordinates. Barycentric Coordinates are those coordinates in which the position of a point is defined as the centre of masses assign at the vertices of the shapes. This coordinate system also resembles to homogeneous coordinate system. Interpolation approximately determines the functions by using the values between the points of the retransformed fragments.

(FIG 1.8)

Programmable Fragment Processor:

Programmable Fragment Processor obtains the shapes and functions from the Rasterizer and Interpolation. This part of GPU programs the fragments of the images by using the triangles and functions made in the precious stage. To make the whole image or picture of the application, first many small fragments of that picture are formed. The program constructs and designs the fragments. These fragments are than shaded and textured. Each fragment is filled with the designated color. Fully transformed fragments are sent to Raster Operation state.

(FIG 1.9)

Raster Operations:

Raster operations put together the fragments to produce the final shape of the application. This state of GPU also checks the frame buffer. If the completed final image is composed of high pixels and texture, which the frame buffer is not eligible for this type of image than Raster operation lowers the pixels and texture so that the frame buffer can easily support the image.

(FIG 2.0)

Frame Buffer:

Frame Buffer gives the final touching to the image and signal the image to the I/O devices to for exposition of the manufactured 3D images of the application..

(FIG 2.1)



Developers first tried to do parallel computing with the help of GPU. In the first phase developers were limited to use the function of some hardware such as buffering and reasterization but when the shaders appeared, they were able to accelerate matrix calculation. This attempt was named as GPGPU (General Purpose Computing on GPU).


CUDA aka (Computer Unified Device Architecture) is a programming language for GPU's.

CUDA was first introduced in NOV 2006.Cuda is a software hardware computing architecture through CUDA we can access to GPU instruction and control center of video memory for parallel computing.Cuda is specially used for parallel computing with Nvidia GPU. CUDA contains the simple C programming language. Through CUDA we can calculate the problems and the operation by using libraries e.g FFT and BLAS. Through CUDA language we can optimize the data transfer rate between GPU's and CPU's.

CUDA is capable of running in both 32-bit and 64 bit operating system which means that it is supported by both windows and LINUX and even by MACOS X.

CUDA also requires tools for implementing the given instructions. NVCC compiler is used for CUDA language compiling.

(FIG 2.2)

This is a view of processing stages in a graphic pipeline. Triangles are first generated by the geometry unit and moves to the next phase where pixels are generated by the raster unit and displayed on the screen.

In this example the two vectors X and Y are added and then result will be shown on the screen. The pixel shader calculates the color of each pixel and the figure is reasterized.The data is first taken and the given data is read by the program, it then calculated the given data. The result is given to output buffer.

In this example the vectors are added the Pixel shader is only able to record these formulas with C like syntax. CUDA application interface is based on standard C language. CUDA gives us access to 16KB of memory which increases the transfer rate between system and video memory.


CUDA is the present language used for the GPU's. The old GPGPU method did not use vertex shader units in previous non-unified architecture. The data which were given stored in texture and output was given to the screen buffer. The Hardware features were not completely accessible by GPGPU method. The new way of GPU computing doesn't use graphics APIs. CUDA gives access to 16KB memory which can be accessed by thread blocks. CUDA is used for Linear algebra and image processing filters and allows cache the most frequently used data and provide higher performance. CUDA provides optimized data exchange rate between CPU and GPU.CUDA also offers assembler to get access to different programming language.

CUDA consists of two APIs

1). High Level API

2). Low level API

(FIG 2.3)

HIGH LEVEL API (CUDA RUNTIME API) remains at the top of the Low level API(CUDA driver API).the process cannot be done at the same time which means that one API can run at the same time they cannot run in parallel. The instruction are translated into simpler language instructions and processed by the CUDA driver API.


(FIG 2.4)

GPU consists of several clusters. Each cluster has a large block of texture fetch units and 2-3 streaming multiprocessor which has a total 8 computing units and two super functional units. SIMD (single Instruction multiple data) principle is used for the execution of instruction. A group of 32 threads are working at the same time. Processors are arranged parallel to one another. The best feature of GPU is that it is Power efficient. It has more transfer and processing rate than the CPU. The processors works according to the tasks given to the GPU. If the resolution is low and a game is running on 800x600 resolutions then the load on the processors will be less. If the Resolution is 1680x1050 then the load on the GPU is double and its processors will work fast to show the required graphics on the display.

The execution method is called Single instruction Multiple thread (SIMT). A total of 16KB shared memory is available to each multiprocessor. For the exchange of data between threads of a single block shared memory is used because it allows the data to be transferred. Multiprocessor are able to access video memory but it involves high latencies and worse throughput. Multiprocessor does not match with the multi-core processor, it is designed for those operations supporting up to 32 warps. The cycles are selected by the hardware and then executed.CPU cores has the capability of executing only one program at a time but GPU can execute more than one program at a time.GPU is able to process thousand of threads simultaneously. The Present GPU's have the capability of running 3D Graphics with Full resolutions.GPUs are high latency and high throughput processors.

High throughput means that the processors must process millions of pixels in a single frame. GPU's use user stream processing to achieve high throughput. They are designed to solve problems that tolerate high latencies in this case lower cache is required.GPU doesn't require large cache they dedicate more of the transistors area to computation horsepower.

(FIG 2.4)

The above diagram shows the Compiler Stages of CUDA application


CUDA programming makes the GPU's more useful then they are. It gives the features of High performance Computing and makes it much simpler and adds a lot of new features which involves the shared memory, thread syncing, double precision and integer operation. CUDA is the best method for increasing the performance of the GPU's.

Every programmer can use CUDA for parallel Computing. If the algorithm matches well and runs with parallel then the result will be surprising. Parallel Computing raises the Performance of the GPU. CPU's cannot compete with the GPU's. GPU's cannot work like the CPU's. Developers are working day and night to make GPU's stronger day by day. CPU's are not that fast as compared to GPU's and they are not much capable of processing fast like CPU's. GPU's are now trying to move towards CPU's and same is the case with CPU's they are also trying to become more parallel.