CUDA Brook Parallel

Published: Last Edited:

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.


In this report, an overview of GPUs, CUDA, and Brook+ programming will be explore represents in parallel computing.

GPU Computing provides a fully programmable, massively multithreaded chip with up to 128 scalar processor cores and capable of delivering hundreds of billions of operations per second. Researchers across many scientific and engineering disciplines are using this platform to accelerate important computations by up to 2 orders of magnitude.

CUDA is a scalable parallel programming model and software environment for parallel programming. By using tiny extensions from the C language, CUDA allow programmers to focus on writing efficient parallel algorithms without the burden of learning a multitude of new programming constructs. However, although Brook+ was slightly different with CUDA but Brook+ was doing the same thing just as the CUDA programming, the different between them was only CUDA is from the NVIDIA and Brook+ is from the AMD. The most important issues for these 2 programming was aimed to well perform in parallel programming and architecture.

Finally, a comparison between CUDA and Brook+ will be done in order to figure out their characteristic.

A sample code will be attached at the end of this report for further understanding how was CUDA code run multitasking mechanism based on concurrent theory.

  • Introduction GPU

The GPU stand for Graphic processing unit and it is effectively the unit that done all the processing in graphics card or video card, the function of the GPU is similar to the CPU, central processing unit of a computer. However, the difference between a CPU and GPU is that GPU only works with the calculations required for graphics and video where else the CPU running calculations for many more system processes than graphics alone. There are two main manufacturers of graphics and video cards and GPU's, which are Nvidia and ATI.

  • Parallelism

There are 2 types of parallelism, which is task and data parallelism. Task parallelism, is a form of parallelization of computer code across multiple processors in parallel computing environments. Task parallelism concept is based on distributing and devising of execution processes (threads) across different parallel computing nodes. It is an Independent processes with little communication and it is relatively easy to use. A contrast to parallelism, the concept of data parallelism is parallelizing all the processors in computing process. Data parallelism is based on dividing the data across various parallel computing nodes. In data process, a lot of data on the same computation is being executed. There are no dependencies between data elements in each step in the computation. Data parallelism can saturate many ALUs, however often requires redesign of traditional algorithms.[4]

GPUs are designed for highly parallel tasks such as rendering. GPUs have the ability to process independent vertices and fragments. During the process, temporary registers are zeroed. In GPU, there are no shared or static data and no read-modify-write buffers. Therefore there is no communication between vertices or fragments. GPUs are suitable for data-parallel processing because GPU architectures are ALU-heavy as describe in the above section. Thus lots of computing power can be gained. Besides, GPU memory systems are designed to stream data where linear access patterns can be pre-fetched.[2]

  • GPU Architecture

State-of-the-art GPU has 8 vertex processors and 24 pixel processors (NVidia GeForce 7800 GTX). However depending on their cost, many GPUs have a variety of vertex/pixel

processors .For different GPUs it is 100% software compatible with each other and exchangeable it does not require re-coding and the number of vertex/pixel processors does not affect what user can do on the GPU but how fast user can do it. Recently GPUs became programmable in a high-level language such as CUDA and BROOK. Below are example of Nvidia Geforce 8800 is composed of 128 stream processor turning at the frequency of 1350 MHz each.

Figure 1.0: NVidia GeForce 7800 GTX architecture

ATI on the other hand, chooses other architecture. For 2900XT. Instead of using SIMD like Nvidia, they use MIMD 5-way. This architecture causes five instructions to be dependant from each other. A Radeon HD 2900 can support up to 320 simple operations. This architecture operates at a frequency of 742MHz.


General-purpose computing on graphics processing units (GPGPU) is a trend based on utilizing GPUs to perform computations/calculation on various task instead of CPU. With the inclusion of higher precision arithmetic to the rendering pipelines and programmable stages, this brings benefit to the software developers for using GPUs for non graphical related operations. By using GPU's which has parallel architecture that utilize stream processing approaches, many real-time and co-concurrent computing problems can be solved effectively. Although GPGPU does exploit the parallel processing concept it still has certain limitation such as programming of GPU is rather complicated and difficult, there is also limited memory interface available, besides there is a need to use a graphic API which stand for application programming interface (API), it is a set of declarations of the functions that an operating system, library or service which has the main purpose of supporting requests from computer programs. In addition, there is a limitation of bandwidth from GPU to CPU and in certain condition frame buffer read can cause pipeline flush, therefore there is a limitation communication to CPU where large & frequent communication to CPU must be avoided.[17]

  • Introduction of CUDA

The mainstreams of processor chips in the world are now run in parallel system. CUDA (Compute Unified Device Architecture) is a scalable parallel programming model and a software environment for parallel computing.CUDA developed by NVIDIA and works with all NVIDIA GPUs from the G8X series onwards.[4]

CUDA is a new technology that used minimal extensions of C/C++ environment to code algorithms for graphic processing unit (GPU) execution include solve for the most complex compute intensive challenges.[5]The operating system can manage the access to GPU by several CUDA to run concurrently through their multitasking mechanism. CUDA support both gather and scatter operations neither read nor write are cached.[14] There was three key parallel abstraction for CUDA, which are hierarchy of concurrent thread, shared memory model for cooperating thread, and lightweight synchronization primitives.[5] These can guide user to do a segmentation of the problem to devide into sub problems and solved in parallel, then threads will cooperate to solve sub problem. Thus, a CUBA program can execute on any number of processor core.[5]

  • Hierarchy of Concurrent thread

CUDA allow programmer to define kernels, only one kernel allow at a time. GPU kernel is a grid of block. However, each block consist of threads, when kernel called, it executed N times in parallel by N different CUDA threads, as opposed to only once like regular C functions. Each thread that executes the kernels has their unique thread ID. The threads are group into thread blocks, threads can cooperate in the same block by sharing data through share memory. The parallel kernels may consist of many threads which all thread executes the same sequential program in a block. [6]Multiple blocks will then setup in a grip of thread block. Thread blocks can execute order either in series or in parallel independently.

  • Shared Memory Model for cooperating thread

Thread in thread block has own memory; each thread block has a global shared memory.

  • Lightweight Synchronization Primitives

When threads call for some shared memory simultaneously it may detect deadlock, to avoid this happen, programmer can set a synchronization point in the kernel as a barrier which all thread must wait until the memory are free to access by using an intrinsic function in C, called _syncthreads().[6]

  • Language Extension

There were 4 type language extensions in C language, which are:

Function type qualifiers to specify whether a function executes or call on the host or on the device such as __device__ , __ global__ and __host__.

Variable type qualifiers to specify the memory location on the device of a variable such as __device__, __constant__ and __shared__.

A new directive to specify how a kernel is executed on the device from the host example: foo<<<GridDim, BlockDim>>>(: : :)

Four built-in variables that specify the grid and block dimensions and the block and thread indices.[6]

  • Limitation

Two major limitations are on the hardware, the small size in shared memory and lack of basic synchronization methods.

In order to increase the efficiency of cooperation of threads the synchronization point must be lightweight and threads must in same processor core. The memories in a processor core will limit the number of thread in a block.[6]

  • Brook+ and stream processing

Brook is a compiler and runtime implementation of the Brook stream program language for graphics hardware. It is an extension of standard ANSI C and is develop to integrate the concept of data parallel computing and arithmetic intensity. It is a cross platform language which is capable of running on, Linux and windows ATI and Nvidia. However Brook+ is an implementation by AMD of the Brook GPU spec on AMD's compute abstraction layer with some enhancements. The main objective of these language is to simplified the programming of GPU. User does not need to understand graphics to program the GPU.

Stream processing has been develop as a is new paradigm to maximize the potential and efficiency of parallel computing. It can be divided into two parts:

Stream- It is a series of objects which can be processed in parallel and being process by the same computation.

Kernel-It is a function which is performed on the whole stream, which can be looks like a “for-each” loop.

From a set of data called stream, a series of functions which is kernel functions are applied to each of the element in the stream. Uniform streaming, where one kernel function is used for all elements in the stream, is the most common. Kernel functions are usually pipelined, and local on-chip memory is reused to minimize external memory bandwidth

In brook, the basic data type is the stream and the most important type of functions available is Kernel which takes one or more input streams and produces an output stream. These two key elements in Brook language that make it particularly powerful in stream processing.[18]

  • Streams

A stream is a series of data which operated in parallel. In Brook+ Streams are declared with angle-bracket syntax which is similar to arrays. Each stream consists of elements which may be float, Cg vector types such as float2, float3, and float4, and structures composed of these native types. For example, a stream of rays can be defined as:

typedef struct ray_t {

float3 o;

float3 d;

float tmax;

} Ray;

Ray r<100>;

Streams are similar to C arrays, however, the difference is , in stream , access to stream data is restricted to kernels and the streamRead and streamWrite operators, which is used to transfer data between memory and streams.[19]

  • Kernels

Brook+ kernels are a special functions, which is used by specifying the keyword' kernel', which operate on streams. By calling a kernel on a stream, it will perform an implicit loop over the elements of the stream, invoking the body of the kernel for each element.

An example kernel is shown:

kernel void saxpy (float a, float4 x<>, float4 y<>,

out float4 result<>) {

result = a*x + y;}

void main (void) {

float a;

float4 X[100], Y[100], Result[100];

float4 x<100>, y<100>, result<100>;

... initialize a, X, Y ...

streamRead(x, X); // copy data from mem to stream

streamRead(y, Y);

saxpy(a, x, y, result); // execute kernel on all elements

streamWrite(result, Result); // copy data from stream to mem [20]

  • Double precision point

The initial hardware platform is the FireStream 9170 board, which includes 2 GB of onboard GDDR3 memory and a single ATI Radeon 3870 GPU. It's the first AMD GPU in the marketplace to support double precision floating point. The chip achieves 500 peak gigaflops of single precision or 102 peak gigaflops of double precision performance. By incorporating 320 stream processors, the chip is able to process a lot of data operations in parallel. Since the GPU is implemented on a 55nm process technology, the whole board consumes less than 100 watts, which yields an impressive 5 gigaflops/watt of single precision performance (1 gigaflop/watt for double precision). The ability to offer double precision floating point and constrain power consumption is big deals in the GPU computing space. Double precision, even at relatively low peak performance, gives the GPU a more complete story in the technical computing space.[21]

  • Limitation of brook+

The Brook+ files have some limitations. The Brook+ header files can sometimes interfere with normal C definitions, producing undesirable output. For this reason the Brook+ kernel definitions were created in headers while the main body of the code was written in C++. As a result, this prevented errors due to Brook+/C compiler interaction.

  • Comparison of CUDA and Brook+

CUDA and Brook+ is from different vendor which are NVIDIA and AMD separately and their API is vendor dependent. CUDA and Brook+ were provided the developer a high level API (Application programming interface). CUDA has 2 API which are a high level and a driver level (lower level) API, however the brook+ API supports double precision floating point and did not show much noise at the area they work on.[8]

Although CUDA and Brook+ are both an extensions to Brook for GPUs source code. They using the C language even thought they have some common element between them but they were inharmonic and not portable between them. They also cannot compile for different GPU from AMD and NVIDIA.[7]

Brook+ was support for Window XP, Linux and Vista for both 32 bits and 64 bits. However, CUDA only support for the Window XP and Linux and does not support 64bits. Besides that, the CUDA only works with single precision while Brook+ is works with single and double. [9]

CUDA executable requires two dynamic libraries which is The CUDA runtime library (cudart), and The CUDA core library (cuda). CUDA use the nvcc as compiler.[12]Whereas, the Brook+ used the BRCC as a compiler to convert source file (.br) into .cpp files by using BRT, Brook RunTime library. Brook RunTime library is a class library which presents a genetic interface for compiler use. [13]

Brook+ (AMD) and CUDA (NVIDIA) both achieve a standard which is both of them provided OpenCL (compiler language) to the public.[10]

For CUDA the programmer need to specify the size and shape of data and the block size indicates the sub-area of the original grid that will be assigned to a set of stream processors that share local memory. Brook+ will call the stream processors of GPU in multi processors and every processor has its own registers and local memory. However, the block size needs to be chosen so that there enough resources in a multiple processors to execute a block at a time. Block are then divided into warps which are a group of threads that executed simultaneously.[11]

When compare from the physical aspect, the AMD (used Brook+) latest video card ATI X1950 XTX consists of 384 million transistors. [15]That's a lot pales in comparison to the 754 million transistors of NVIDIA's (used CUDA) video card, the 8800 GT.[16]

  • Conclusion

The research in this report show the detailing of the GPU, the CUDA and Brook+ programming capability and the way they implement in concurrent programming.

CUDA and Brook+ programming are a model that very suited to expose the parallel capability in GPUs. They are an extension to the C language for steam programming. They are aimed to have a simple aimed to have a standard programming code, C language and easy for programmers to familiar with it.

Furthermore, when compare the CUDA and the Brook+, the resultant from the analysis proved that Brook+ is a user friendly, stable, reliable, and can be adapted to the future change.

For future improvement, due to the limitation shared memory of processor core thus should increase the features and performance without sacrificing more cores.




[3] httpwww.gpgpu.orgasplos2008









[12] By Eva Suci, Department of Informatics, University of Bergen, Norway

Multicore-Seminar, December 10, 2007










Appendix (example code for Brook+)

#define _WIN32_WINNT 0x0500

#include <stdio.h>

#include <windows.h>

struct dataBlock {

int blockNum;

int bytesRead;


int addToQueue( struct dataBlock *pdbk );

long delayMillisecs;

long activeItems = 1L;

HANDLE hHandle;

bool atEOF;

#define BLK_SIZE 5000

int main ( int argc, char *argv[] )


FILE *infile;

char *inbuf;

int bytes_read = 0;

long bytes_read_total = 0;

long block_number = 0;

struct dataBlock *dblk;

if ( argc != 3 )


printf (

"Usage: threadQwithPool file delay-duration(ms)\n" );

return ( -1 );


infile = fopen ( argv[1], "r+b" );

if ( infile == NULL )


printf ( "Error opening %s\n", argv[1] );

return ( -1 );


inbuf = (char*) malloc ( BLK_SIZE+1 );

if ( inbuf == NULL )


printf ( "Could not allocate read buffer\n" );

return ( -1 );


delayMillisecs = atol( argv[2] );

// hHandle is an event that is set to unsignalled.

// When the number of executing threads == 0, the last

// thread will signal this event. This signal tells this

// main line that it can exit (see WaitForSingleObject()

// call below.

hHandle = CreateEvent( NULL, TRUE, FALSE, NULL );

// now start reading (so, becoming the producer thread)

bytes_read = (int) fread ( inbuf, 1, BLK_SIZE, infile );

if ( bytes_read < BLK_SIZE )


printf (

"Need a file longer than %d bytes\n", BLK_SIZE );

return ( -1 );




bytes_read_total += bytes_read;

dblk = (struct dataBlock *) malloc(

sizeof( struct dataBlock ));

dblk->blockNum = ++block_number;

dblk->bytesRead = bytes_read_total;

addToQueue( dblk );


while ( !feof ( infile ))


bytes_read = (int) fread ( inbuf, 1, BLK_SIZE, infile );

bytes_read_total += bytes_read;

dblk = (struct dataBlock *) malloc(

sizeof( struct dataBlock ));

dblk->blockNum = ++block_number;

dblk->bytesRead = bytes_read_total;

addToQueue( dblk );

//Sleep( 50 ); // if you want to, simulate I/O delay


printf ( "\nRead a total of %d bytes\n",

(int) bytes_read_total );

// cannot exit right away or you'll kill the threads

// executing in the thread pool, so wait to give pool time to

// finish all processing. We wait for the event handle to

// be signaled (no threads left) or for 5 seconds (that is,

// 5000 milliseconds), whichever comes first.

InterlockedDecrement( &activeItems );

if ( activeItems != 0 )


WaitForSingleObject( hHandle, 5000 );


return ( 0 );



* The consumer thread. Pauses for command-line specified number

* of milliseconds and then prints out the block number.


DWORD CALLBACK ProcessData( void* pv )


struct dataBlock *pdblk = (struct dataBlock *) pv;

Sleep( delayMillisecs );

printf( "processing block %d\n", pdblk->blockNum );

// now, decrement the number of active work items

InterlockedDecrement( &activeItems );

if ( activeItems == 0 )


// if this is the last thread out, then signal the

// event, so that the main line knows it can exit.

printf( "activeItems = 0, signaling main thread\n" );

SetEvent( hHandle );


return( 0 );



* Add items to the queue.


int addToQueue( struct dataBlock *pdblk )


// interlocked increments increase a long integer as a

// single atomic operation, so they're safe for use by

// multiple threads.

InterlockedIncrement( &activeItems );

BOOL ret =



(PVOID) pdblk,

//WT_EXECUTEDEFAULT ); << don't use, despite docs.


if ( ! ret )


"Error occurred with blk %ld, bytes read: %ld\n",

pdblk->blockNum, pdblk->bytesRead );

return( 0 );


(original source from:

Appendix (example code for CUDA-Matrix Multiplication)

// Device multiplication function called by Mul()

// Compute C = A * B

// wA is the width of A

// wB is the width of B

__global__ void Muld(float* A, float* B, int wA, int wB, float* C)


// Block index

int bx = blockIdx.x;

int by = blockIdx.y;

// Thread index

int tx = threadIdx.x;

int ty = threadIdx.y;

// Index of the first sub-matrix of A processed by the block

int aBegin = wA * BLOCK_SIZE * by;

// Index of the last sub-matrix of A processed by the block

int aEnd = aBegin + wA - 1;

// Step size used to iterate through the sub-matrices of A

int aStep = BLOCK_SIZE;

// Index of the first sub-matrix of B processed by the block

int bBegin = BLOCK_SIZE * bx;

// Step size used to iterate through the sub-matrices of B

int bStep = BLOCK_SIZE * wB;

// The element of the block sub-matrix that is computed

// by the thread

float Csub = 0;

// Loop over all the sub-matrices of A and B required to

// compute the block sub-matrix

for (int a = aBegin, b = bBegin;

a <= aEnd;

a += aStep, b += bStep) {

// Shared memory for the sub-matrix of A

__shared__ float As[BLOCK_SIZE][BLOCK_SIZE];

// Shared memory for the sub-matrix of B

__shared__ float Bs[BLOCK_SIZE][BLOCK_SIZE];

// Load the matrices from global memory to shared memory;

// each thread loads one element of each matrix

As[ty][tx] = A[a + wA * ty + tx];

Bs[ty][tx] = B[b + wB * ty + tx];

// Synchronize to make sure the matrices are loaded


// Multiply the two matrices together;

// each thread computes one element

// of the block sub-matrix

for (int k = 0; k < BLOCK_SIZE; ++k)

(Original source from :

// includes, system

#include <iostream>

// Required to include CUDA vector types

#include <vector_types.h>

#include "cutil.h"

__global__ void

transpose( float *out, float *in, int w, int h ) {

__shared__ float block[BLOCK_DIM*BLOCK_DIM]; //allocate share memory.

unsigned int xBlock = blockDim.x * blockIdx.x;//set up indexing.

unsigned int yBlock = blockDim.y * blockIdx.y; //set up indexing.

unsigned int xIndex = xBlock + threadIdx.x; //set up indexing.

unsigned int yIndex = yBlock + threadIdx.y; //set up indexing.

unsigned int index_out, index_transpose; //set up indexing.

if ( xIndex < width && yIndex < height ) {//Check where we in the

unsigned int index_in = width * yIndex + xIndex;//domain, calculate

unsigned int index_block = threadIdx.y * BLOCK_DIM + threadIdx.x;//more indices.

block[index_block] = in[index_in];//write to shared memory.

index_transpose = threadIdx.x * BLOCK_DIM + threadIdx.y;//calculate output

index_out = height * (xBlock + threadIdx.y) + yBlock + threadIdx.x;//indices.


__synchthreads();//synchronize, avoid deadlock happen.

if ( xIndex < width && yIndex < height ) {

out[index_out] = block[index_transpose];//write to global memory.




(original source from

// includes, system

#include <iostream>

// Required to include CUDA vector types

#include <vector_types.h>

#include "cutil.h"


// declaration, forward

extern "C" void runTest(const int argc, const char** argv,

char* data, int2* data_int2, unsigned int len);


// Program main



main(int argc, char** argv)


// input data

int len = 16;

// the data has some zero padding at the end so that the size is a multiple of

// four, this simplifies the processing as each thread can process four

// elements (which is necessary to avoid bank conflicts) but no branching is

// necessary to avoid out of bounds reads

char str[] = { 82, 111, 118, 118, 121, 42, 97, 121, 124, 118, 110, 56,

10, 10, 10, 10};

// Use int2 showing that CUDA vector types can be used in cpp code

int2 i2[16];

for( int i = 0; i < len; i++ )


i2[i].x = str[i];

i2[i].y = 10;


// run the device part of the program

runTest(argc, (const char**)argv, str, i2, len);

std::cout << str << std::endl;

for( int i = 0; i < len; i++ )


std::cout << (char)(i2[i].x);


std::cout << std::endl;

CUT_EXIT(argc, argv);