Massively parallel computation

Published: Last Edited:

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.


Massively parallel computation can be achieved using Graphics Processing Unit and Compute Unified Device Architecture (CUDA). In this research, we have presented how high performance can be achieved for traversing various graphs using CUDA, also showing how the program executes on Graphics Processing Units. The starting phase covers the overview of the hardware architecture and the CUDA programming model. The later phase discusses the Depth First Search Algorithm using recursion and non recursion. It also includes a comparison of recursion and non recursion algorithms of Depth First Search.


Depth first search is a technique which accomplishes more than simply traversing the graphs. It is used in many problem solving techniques as well as game problems. It is also used in some Artificial Intelligence problems like expert system, theorem proving, etc. The main advantage of depth first search is it requires very little memory. There have been many algorithms implemented for sequential depth first search, but there are very few algorithms on parallel depth first search. The Parallel depth first search requires the nodes to be distributed among several processors, so nodes on one processor need to use the nodes on another processor. This makes it difficult to implement. The main goal of this research is to find how optimally the non recursive depth first search executes on a graphics device. We have used the GFORCE GTX260 graphics card by Nvidia.

Graphics Processing Unit

The majority of software programs are traditionally written as sequential codes. The executions of these types of codes are step by step in sequence. A sequential program runs only on a single processor, so not much speed can be achieved. As technology advances every day, users become accustomed to increasingly newer technologies. Today, users want much higher performance machines than they accepted in the past. To fulfill the requirement parallel programming is required. In the past parallel programs, multiple threads execute simultaneously. The concept of parallel programming is not new. In the past, parallel programs have been developed; these parallel programs run on very large capacity computers which are very expensive. Because of their higher cost, these types of parallel applications have been limited to certain computers.

Graphics Processing Unit (GPU) as parallel computers

Graphics processing Unit consists of many-core processors, used to achieve a high floating point performance. The GPU is much faster than general purpose microprocessors. The floating point calculation is 10 times faster in a GPU as compared to multi-core CPUs. There is a large performance difference between multi-core GPUs and general purpose multi-core CPUs because of their design. The CPU is optimized for using sequential code; it executes sequential code so that instructions from a single thread of execution execute in parallel. It maintains the appearance of sequential execution. To reduce data latencies and instructions, large cache memories are provided in multi-core CPUs. The new general purpose multi-core microprocessors are designed for high sequential code performance.

The GPU is mainly used by the gaming industry, because it improves the performance of graphics for the video games. Basically, its high performance floating point computation capability is very useful for video frames in video games. The optimization criteria for the Graphics Processing Unit are the execution of large numbers of threads in parallel simultaneously. The hardware design of a GPU is such that it takes the benefit of a large number of threads. These threads divide work and execute it in parallel. Cache memories are used to control the bandwidth, so multiple threads do not need to access the DRAM. In the hardware design of a GPU, much area is covered by the floating point calculations.

The GPU is mainly designed for high computation, so it may not perform some tasks which are optimally performed by the CPU. Consequently, some of the applications may require both CPUs and GPUs. To execute a sequential part, the CPU is required and to execute a parallel part, the GPU is required. The CUDA programming model is used for this purpose because it supports both CPU and GPU executions of an application.

Architecture of CUDA capable GPU

A GPU consists of 16 highly threaded streaming multiprocessors. Each streaming multiprocessor consists of 8 streaming processors, so the total number of processors is 16*8=128. Each streaming processor has a multiply-add unit and additional multiply unit. All are running at the speed of 1.35 GHz. Each GPU has currently 1.5Mb of DRAM. This DRAM is used as a buffer for the graphics.

One important characteristic of the GPU is performance. The speed of current GPUs is 10 times faster than any currently available microprocessor.

For our project, we have used NVIDIA Gforce GTX 260.

Why do we need more speed?

For certain applications, high performance is necessary and to achieve high performance, we need to have parallelism in our application so the execution time can be faster. For applications such as video and audio coding and manipulation, parallelism has been developed; parallelism is also useful to achieve High definition graphics for application such as High Definition TV. To achieve high graphics for HD TV, multiple threads work simultaneously. Another example is the gaming industry. Most games are sets of scenes, and each scene can be divided into the different threads to achieve parallelism. All applications requiring parallelism involve use large amounts of data to be processed as well as large amounts of computation required to process on that data.

CUDA Programming Model

The computing system consists of a central processing unit such as an AMD or Intel microprocessor and various devices. One type of device is a graphics card, which consists of massively parallel processors to achieve high performance computation. The multi-core processors a graphics card on this device include a large number of arithmetic execution units, which make computation faster compared to the CPU. There are many applications which have both parts GPU and CPU to execute sequential as well as parallel programs, so for such applications, parallelism can be achieved by programming on this GPU whereas sequential code can be executed on the CPU. Such a configuration has everything we require: data parallelism as well as a property with massive arithmetic computation. Data parallelism is the most important thing which can be achieved using CUDA.

Data parallelism using CUDA

Many applications in the real world use CUDA to process massive amount of data. This takes a long execution time on single processor machines. Data parallelism allows arithmetic operations to be safely performed on the data structures simultaneously. Using CUDA, massive amounts of data can be accessed in parallel simultaneously, which makes the performance better and helps to improve the execution time as well.

CUDA Program Structure

Executing CUDA code consists of two phases: it can be executed on the central processing unit or on the graphics processing unit. The parts of the code which have no parallelism or very less parallelism are executed on the host, which is the CPU. The parts which include massive parallel computation are executed on the device, which is the GPU. The CUDA program combines both host and device code into a single program. The compiler of NVIDIA, that is NVCC, separates host code and device code. The host code is an ANSI C code and is compiled with standard C compiler of host. The device code is the extension of ANSI C with some more keywords that support parallel coding. It also includes some functions which call the kernel and data structures related to the kernel. The device code is compiled with the NVCC compiler and is executed on the GPU.

The kernel functions generate a large number of threads, which are mainly for achieving data parallelism. These threads take very few cycles to generate and for scheduling compared to CPU threads; CPU threads take thousands of clock cycles for scheduling.

When the invocation of the kernel occurs, then execution moves to the device. On this device, there are hundreds of threads running simultaneously to achieve data parallelism. The threads are in blocks, and the place where the blocks reside is called a grid.

Device Memories and Data Transfer

The Host and device have their own separate memory spaces. Devices come with a limited amount of dynamic random access memory. To execute the kernel on the device, the programmer needs to allocate memory on the device. After allocation of memory, data can be transferred from host to device memory. After the device performs a calculation, it needs to write the result back to the host, so at that time, there must be memory allocated on the host also. A CUDA application programming interface provides functionality to allocate memory as well as to copy data from device to host and host to device.

Basic concepts for parallel programming and parallel algorithms

Basic parallel computation involves dividing of the problem in such a way that its sub-problems can be solved on different threads at the same time. The main goal of parallel computation is to solve particular problem in less time and in efficient way. This solution can be applied to large amounts of data, and it does not affect the performance of the application.

The main step is to find the concurrency within a particular problem. If we find the concurrency in the problem, it will be easy to understand the problem and to divide it. Other steps to be considered are to identifying and managing data dependency. There are many criteria to consider in performance. For instance, if the load is divided imbalanced or improper way, this can affect the performance.

Main Part

Simple Depth First Search

When addressing problems like finding some path or gaming problems, graph search methods are very useful. In well directed graphs, the solution is found by searching from the initial node to the end node. The first step is to expand the initial node by generating its successors. At every level, the node which has been generated previously is expanded until the destination node or any leaf node is found. Then, a search performs the backtracking. In backtracking it explores nodes which have not been visited yet. The time and space complexity of Depth First Search differs on the basis of which nodes are chosen for the expansion. The time and space complexity of DFS varies according to the application in which it is being used. As DFS is used to traverse an entire graph, the linear graph takes time of the addition of the total number of nodes and the total number of edges, meaning the time complexity of DFS for a linear graph using adjacency list or recursion is O(|V|+|E|).

There are many drawbacks using sequential DFS in some applications. For example, DFS is used in some search problems for Artificial Intelligence. In some Artificial Intelligence problems, the graph to be searched is either too large in size to traverse or sometimes infinite; in either situation DFS may not be terminated; consequently, the search is only performed for a limited number of nodes up to a certain depth. If the graph is too large, it may be very difficult to store the whole graph in the memory as the memory size is limited. There is more than one way to implement DFS; it can be implemented using recursion or non recursion as well as by iterative methods. Nvidia GPU does not support recursion, however parallel implementation of DFS, can be done using non- recursion or iterative method.

Correctness of Depth First Search

The Depth First Search graph traversal, visits every node of a graph. It traverses all the successors of each node. We have used adjacency matrix to implement the Depth First Search, so to visit each not it will be O(n), where n is the number of graph nodes. Thus traversing all the successors of all the nodes is O(n2). For this reason, it will be of O(n+n2).

A Parallel Depth First Search

A Parallel depth first search can be formulated by sharing the task among multiple processors. The basic idea is if there is more than one processor, a large graph can be solved more quickly by dividing it into independent levels. These sub-levels of the graph are solved at the same time, one on each processor. Each processor searches the different levels of the graph in depth first fashion. When a processor has finished its level of searching, then it tries to communicate with the other processors to obtain the unsearched levels. When the destination node is found then the search is completed, and all the processors quit. If the nodes in the graph are limited and the graph has no solutions, then all the processors will be idle and will terminate the search.

Non Recursive Depth First Search

Initially, the stack will be loaded with the source node, which is input from the user. Next it will pop the node from the stack and decrement the top of the stack by 1, so it will again become -1.

If the node which is popped is not visited, then it will visit the node and make it true.

Now after traversing all the nodes, it will traverse unvisited nodes, which will push all the unvisited nodes to the stack.

Recursive Depth First Search

A Recursive Depth First Search does not require a stack. The User will input the first node from where he wants to start the traversal and that node will become the current node. Then the graph will be traversed by any edge incident to the current node. If the edge leads to an already visited node then a backtrack to the current node will be performed. If the edge leads to an unvisited vertex, then go to the last visited node, which becomes the current node. This continues until the goal node is found.


For all edges which are adjacent to V do following:

If edge is unvisited then Return the end point of edge e till cost to node v if node w is unvisited then mark it as visited recursively call DFS(v); else mark e as a back edge.

Experimental Results

Comparison between Recursive DFS and Non Recursive DFS

Execution of a recursive depth first search is much faster than a Non recursive depth first search.

Conclusion and Future work

Today, parallel programming has been used by many applications requiring massive high performance computation. Graphs with a massive number of vertices are used to develop various applications in the gaming industry and other Artificial Intelligence related fields. We implemented the sequential part of a non recursive depth first search and generated significant results.

To achieve more speed a non-recursive depth first search can be implemented on GPUs using CUDA; the vertices must be assigned to each thread, and the execution takes place in such a way that each thread performs traversal on different nodes simultaneously.

The limitation of a GPU is that each block has a very limited amount of memory, so it is very tedious to store a whole stack of nodes in that memory. Another limitation is that a thread in one block cannot communicate with a thread in another block. Also, CUDA limits some of the scientific calculations. Mainly a GPU is used to enhance graphics. In the future, more use of GPUs will make the performance much better than that of single core CPUs.

Works Cited

  • Harish, Pawan, and P. J. Narayanan. Accelerating Large Graph Algorithms on the GPU using CUDA. Vol. 4873, in High Performance Computing, 197-208. Springer, 2007.
  • Kirk ,David. Corporation, NVIDIA. Nvidia CUDA programming Guide 2.3. 2008.
  • Mehlhorn Kurt, Naher Stefan and Sanders Peter. Engineering DFS-Based Graph Algorithms, October 1,2007.
  • Freeman,John. Parallel Algorithms for Depth-First Search, University of Pennsylvania.
  • Varman Peter, Doshi Kshitij.Improved Parallel Algorithms for the Depth-First Search and Monotone Circuit Value Problems.
  • Rao V.Nageshwar,Kumar Vipin. Parallel Depth First Search, Department of Computer Sciences,University of Texas at Austin.
  • Hussein Mohamed, Varshney Amitabh, Davis Larry. On Implementing Graph Cuts on CUDA, Department of Computer Sciences,University of Maryland.
  • Wang J.S, Wang B.F., Peng C.H. Recognizing Depth-First-Search Trees In Parallel, in 9th International Parallel Processing Symposium.