String Algorithm On Gpgpu Computer Science Essay

Published: Last Edited:

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.


General-purpose computing on Graphics Processing Unit (GPGPU) is the utilization of GPU to handle computation in applications which are traditionally performed by the central processing unit (CPU). It involves using GPU along with CPU with the aim of accelerating general-purpose scientific and engineering applications. CPUs contains of a few cores which are optimized for serial processing whereas GPUs contains hundreds of smaller cores with high efficiency and designed for parallel performance. Hence, compute-intensive portions of the application which can be done in parallel are handled by the GPU while the remaining serial portion of the code by the CPU. This leads to applications which run considerably faster.

This project focuses on using GPGPU to develop a string manipulating algorithm. It is then compared with serial programming on CPU to verify its advantage over CPU-only implementation.

Current Trend

The computing technology began with a single core processor, and only one operation can be performed at a time. Instruction pipelining was then introduced such that the processor's cycle time is reduced and throughput of instructions is increased. Starting from year 2005, CPU manufacturers have begun offering processors with more than one core. In recent years, 3-, 4-, 6- and 8-core CPU have been developed. Leading CPU manufacturers have also announced plans for CPUs with even more cores, thus proving that parallel computing is the current trend.

The first GPU served to support line drawing and area filling. Blitter, a type of stream processor, accelerates the movement, manipulation and combination multiple arbitrary bitmaps. The GPU was also equipped with a coprocessor which has its own instruction set, hence allowing graphics operations to be invoked without CPU intervention. In 1990s, GPU begin supporting 2D GUI acceleration. Various application programming interfaces (APIs) were also created. By mid-1990s, GPU had begun supporting 3D graphics. NVIDIA became the first GPU company to produce a chip that is capable of programmable shading which allows each pixel and each geometric vertex to be processed by a program before it is projected onto the screen. Continuous development in early 2000s led to the ability of pixel and vertex shaders to implement looping and lengthy floating point math. The flexibility of GPU had become increasingly comparable to that of CPU. In recent years, GPUs are equipped with generic stream processing units, allowing them to evolve into more generalized computing devices. GPGPU have since been applied in applications which require high arithmetic throughput, and have shown better performance than conventional CPU


Problem Statement

In the case of string search, large datasets may be processed using computer cluster in supercomputers. A computer cluster consists of several loosely connected computers which work together. However, using supercomputers for processing datasets proves to be costly. Hence, an alternative option is needed such that large datasets processing can be efficient yet cost effective.

The rise of GPU computing is a good solution as the alternative option. Due to the high arithmetic throughput of GPUs, applications running on them will be less computationally expensive aside from being more cost effective as compared to traditional central processing technologies.

Motivation and Proposed Solution

This project is motivated by the need to increase efficiency in processing large datasets, specifically strings. In recent years, the computing industry has been concentrating on parallel computing. With the introduction of NVIDIA's CUDA Architecture for GPU in 2006, GPUs which excel at computation are created. GPU offers high performance throughput with little overhead between the threads. By performing uniform operations on independent data, massive data parallelism in application can be achieved. This provides an opportunity for data processing to be completed in less time. In addition, GPU requires lower cost and consumes less power compared to computers having similar performance. It is also available in almost every computer nowadays.


The research in this thesis involves using Bloom Filter as the string searching algorithm. The algorithm is implemented using GPGPU programming, whereby parts of the code which involve uniform operations on independent data are performed on GPU. The performance of the algorithm implemented in only CPU is compared to when it is implemented using both CPU and GPU. The goal of this project is to successfully implement the aforementioned string algorithm using GPGPU programming.

Scope of Research

Generally, this project consists of three parts:

The understanding of NVIDIA's CUDA architecture on GPU and the programming language used (CUDA C)

The current implementation of Bloom Filter on CPU

The implementation of Bloom Filter on CPU and GPU

Report Structure

The report for this project consists of

Chapter 2 presents background information on 3D video, specifically the stereoscopic video in terms of its production, source material, quality assessment and coding. Compression of the 2D video and depth information (and later reduced resolution depth information) using MPEG4-MAC is also investigated in this chapter.

In Chapter 3, the compressed 3D video (2D plus depth video) are intended for transmission over communication networks. Error resilience tools are needed to suppress the effect of channel errors on the compressed 3D video.

Chapter 4 proposes a novel multiple description coding for 3D video based on even and odd frames MDC developed in Chapter 3.

In Chapter 5, the combination of scalable coding and multiple description coding (scalable MDC) is proposed to improve error robustness, and at the same time provides adaptability to bandwidth variations and receiving device characteristics.

The last Chapter presents the overall conclusion of the thesis leading to some suggestions for future work.



This chapter presents background information on the architecture of the Graphics Processing Unit used, the language used in programming as well as the algorithm for string searching. The CUDA architecture is explained in Section 2.2. Section 2.3 describes the CUDA C language which was introduced by NVIDIA to facilitate general-purpose programming on GPUs. Examples of applications of CUDA are presented in Section 2.4. Different searching algorithms are discussed in Section 2.5 while Section 2.6 describes Bloom Filter, the chosen algorithm in this project.


CUDA, the abbreviation for Compute Unified Device Architecture, is a parallel programming paradigm that was released in November 2006 by NVIDIA. Apart from being used to develop software for graphics processing, this architecture alleviates many of the limitations that previously prevented general-purpose computation from being performed legitimately by graphics processor.

Architecture of CUDA-capable GPU

Previous generations of GPU partitioned computing resources into vertex and pixel shaders, whereas CUDA architecture included a unified shader pipeline. This enables each and every ALU on the chip to be marshalled by a program intending to perform general-purpose computations. The ALUs were also built to comply with IEEE requirements for single-precision floating-point arithmetic. The instruction set used by the ALUs are those tailored for general computation instead of those specifically for graphics. Apart from having arbitrary read and write access to memory, the execution units on the GPU have access to shared memory, a software-managed cache. The addition of these features allows the GPU to excel at general-purpose computation apart from performing traditional graphical tasks.

Figure 2. Architecture of a CUDA-capable GPU

Figure 2.1 shows the architecture of a CUDA-capable GPU. The GPU is organized into an array of highly threaded streaming multiprocessors (SMs). 2 SMs form a building block for the GPU in Figure 2.1. However, different generations of CUDA GPUs may have different number of SMs in a building block. Each SM has a number of streaming processors (SPs) that share control logic and instuction cache. The graphics double data rate (GDDR) DRAM, referred to as global memory in Figure 2.1, are different compared to those in CPU motherboard. For computing applications, they function as very-high-bandwidth, off-chip memory, but with more latency than typical system memory. If the application is massively parallel, then the higher bandwidth will make up for the longer latency.

CUDA Programming Model

Figure 2. Heterogeneous architecture

The CPU, known as the host, executes functions (or serial code as shown in Figure 2.2). Parallel portion of the application is executed on the device (GPU) as kernel. The kernel is executed as an array of threads, in parallel. All threads execute the same code, but may take different paths. The threads can be identified by its IDs, for the purpose of selecting input or output data, and to make control decisions. The threads are grouped into blocks, which in turn are grouped into a grid. Hence, the kernel is executed as a grid of blocks of threads. The grid and each block may be one-dimensional, two-dimensional or three-dimensional. The architecture is illustrated in Figure 2.3.

Figure 2. CUDA compute grid, block and thread architecture

Threads within a block may need to cooperate for memory accesses and results sharing. Shared memory is accessible by all threads within the same block only. This restriction to "within a block" permits scalability. When the number of threads is large, fast communication between the threads will not be feasible. However, each block is executed independently and the blocks may be distributed across an arbitrary number of multiprocessors.

CUDA Memory Model

The host and device memories are separate entities. There is no coherence between the host and device. Hence, in order to execute a kernel on a device, manual data synchronization required. This is done by allocating memory on the device and then transferring the relevant data from host memory to the allocated device memory. After device execution, the resultant data will need to be transferred back from device memory to host memory.

Figure 2. Cuda Device Memory Model

As shown in Figure 2.2, the global memory is memory that the host code can transfer data to and from the device. This memory is relatively slow as it does not provide caching. Constant memory allows read-only access by the device code. API functions help programmers to manage data in these memories. Each thread has its own set of registers and local memory. Local variables of the kernel functions are usually allocated in local memory. Shared memory is shared among threads in the same block whereas global memory is shared by all the blocks. Shared memory is fast compared to global memory. It is also known as parallel data cache (PDC). All these memories (excluding constant memory) may be read and written to.

CUDA Execution Model

As mentioned in Section 2.2.2, the main (serial) program runs on the host while certain code regions run on the device. The execution configuration is written as <<< blocksPerGrid, threadsPerBlock >>>. The value of these two variables must be less than the allowed sizes depending on the device itself.

After a block has been assigned to a streaming multiprocessor (SM), it will be further divided into 32-thread units known as warps. All threads of a warp are scheduled together for execution, in other words, they share the same program counter. When an instruction executed by the threads in a warp requires the result from a previous long-latency operation, the warp will not be selected for execution. Another resident warp which has its operands ready will then be selected for execution. This mechanism is known as latency hiding. If there is more than one warp ready for execution, one warp will be selected for execution based on the priority mechanism. Warp scheduling hides the long waiting time of warp instructions by executing instructions from other warps. As a result, the overall execution throughput will not be slowed down due to long-latency operations.

As a conclusion for the execution model, each thread is executed by a core, each block is executed on a multiprocessor and several concurrent blocks may reside on a multiprocessor. Each kernel is executed on a device.


Prior to the introduction of CUDA C, OpenGL and DirectX were the only means to interact with a GPU. Data had to be stored in graphics textures and computations had to be written in shading languages which are special graphics-only programming languages. In 2007, NVIDIA added a relatively small number of keywords into industry-standard C for the sake of harnessing some of the special features of the CUDA Architecture, and introduced a compiler for this language, CUDA C. This has also made CUDA C the first language specifically designed by a GPU company to facilitate general-purpose computing on GPUs. A specialized hardware driver is also provided so that CUDA Architecture's massive computational power may be exploited. With all these improvements by NVIDIA, programmers no longer need to disguise their computations as graphic problems, nor do they require knowledge of OpenGL and DirectX.

Device Function

The number of blocks within a grid and number of threads per block must be specified when calling the kernel function. Kernel functions have return type void. It also has a qualifier __global__ to indicate that the function is called from host code and it is to be executed on the device. The device functions are handled by NVIDIA's compiler where as host functions are handled by standard host compiler. The input arguments of the kernel function must point to the device memory.

Basic Device Memory Management

To allocate device memory: cudaMalloc()

Example: float* dev_a;

cudaMalloc ( (void**) &dev_a, sizeof (int) );

To copy from host memory to device memory and vice versa: cudaMemcpy()

Example (host to device):

cudaMemcpy ( dev_a, &host_a, sizeof (int), cudaMemcpyHostToDevice );

Example (device to host):

cudaMemcpy ( &host_a, dev_a, sizeof (int), cudaMemcpyDeviceToHost );

To free device memory: cudaFree()

Example: cudaFree ( dev_a );

Launching Parallel Kernels

Launching with parallel blocks

Example: kernel <<<N, 1>>> (dev_a, dev_b)

Kernel() will be executed N times in parallel, there will be N copies of kernel() launched

blockIdx.x can be used to index the arrays whereby each block handle different indices

Launching with parallel threads within a block

Example: kernel <<<1, N>>> (dev_a, dev_b)

Kernel() will be executed N times in parallel, there will be N copies of kernel() launched.

threadIdx.x is used to index the arrays

Launching with parallel threads and blocks

Example: kernel <<< M, N>>> (dev_a, dev_b)

There will be M*N copies of kernel() launched.

The unique array index for each entry is given by:

index = threadIdx.x + blockIdx.x*blockDim.x

BlockDim.x refers to number of threads per block

Shared Memory

Parallel threads have a mechanism to communicate. Threads within a block share an extremely fast, on-chip memory known as shared memory. Shared memory is declared with the keyword __shared__. Data is shared among threads in a block. It is not visible to threads in other blocks running in parallel.

Thread Synchronization

Parallel threads also have a mechanism for synchronization. Synchronization is needed to prevent data hazards such as read-before-write hazard. The function used to synchronize thread is __syncthreads(). Threads in the block will wait until all threads have hit the __syncthreads() instruction before proceeding with the next instruction. An important point to note is that threads are only synchronized within a block.

Applications of CUDA

Various industries and applications have enjoyed a great deal of success by building applications in CUDA C. The improvement in performance is often of several orders-of-magnitude. The following shows a few ways in which CUDA Architecture and CUDA C have been put into successful use.

Oil Resource Exploration

The oil prices in the global market continue to rise steeply. Hence, there is an urgent need to find new sources of energy so that the price can be moderated. However, the exploration and drilling of deep wells are very costly, and can reach hundreds of millions of dollars. In many cases, there is only one chance of succeeding in drilling. To minimize the risks, 3D seismic imaging technology is now being used for oil exploration. SeismicCity in Houston, Texas is one of the companies providing such technology which allows for a detailed look at potential drilling sites. However, imaging complex geological areas often involve terabytes of complex data.

With the introduction of faster and more accurate codes by SeismicCity, the algorithms brought a 10-fold increase in the computational intensity. Such algorithms would then require large number of CPUs setup, which is impractical. Running CUDA on NVIDIA Tesla server system solves this problem. Oil majors like Schlumberger, Petrobas and Chevron are also using Tesla GPUs for seismic processing. The massively parallel computing architecture offers a significant performance increase over CPU configuration. With NVIDIA's GPU computing technology, SeismicCity is now able to move to the next-generation of algorithms faster and more economically.

With the dramatic increase in speed and accuracy of the depth imaging technology by SeismicCity, selection of new drilling locations is now much faster than before. Drillers are now able to proceed confidently in areas that were previously obscured by unclear seismic data.

Design Industry

Conventionally, fashion designers had to physically produce cloth samples of the actual clothing line for prototyping and to show to potential investors. This consumes a lot of time as well as cost, causing wastage. Israel-based OptiTex Ltd introduced 3-D technology which modernizes this process by allowing designers to simulate the look and movement of clothing designs on virtual models. This allows them to review, refine and measure the fabric before physically producing the samples.

By using GPU for computation, the developers managed to obtain a 10-fold performance increase in addition to removing bottlenecks that occurred in the CPU environment. The typical product development time to market takes 190 days, but with the reconstructed OptiTex 3D solutions (using CUDA) it takes only 35 days.

With the promising performance by GPU computing solution, production costs and design cycle time has been greatly reduced. Users in various industries such as apparel and automotive, enjoy reduction in manufacturing expenses by virtually eradicating wasted fabric, power usage and manpower.

Weather and Climate

Weather is a prime factor affecting our daily lives, and even survival. Common weather phenomena are rain, win, snow and fog. Natural disasters like tornadoes, typhoons and hurricanes may be less common but their impact on human lives is undeniably huge. Early predictions of weather events are necessary so that people will have more time to prepare for what is bound to come.

At the National Center for Atmospheric Research (NCAR) in United States, a team of scientists have successfully developed sophisticated forecasting models for immediate, short-term and long-term weather conditions. The most widely used model is Weather Research and Forecasting Model (WRF). It is utilized by National Weather Service (US), Air Force Weather Agency (US), foreign weather services as well as commercial weather forecasting companies. There has been exponential increase in processor power for these models over the past 50 years. Despite having large-scale parallelism in these weather models, the increase in performance has come from the processor speed rather than from increased parallelism. With the development of NCAR's climate and weather models from Terascale to Petascale class applications, conventional computer clusters and addition of more CPUs prove to be no longer effective for speed improvement. This problem is especially severe in applications that involve a real-time component or any other time-to-solution constraints.

To improve overall forecasting speed and accuracy of the models, the NCAR technical staffs have collaborated with the researchers at University of Colorado at Boulder and came up with NVIDIA GPU Computing solutions. It is observed that there is a 10-fold improvement in speed for Microphysics after porting to CUDA. The percentage of Microphysics in the model source code is less than 1 percent, but the conversion to CUDA has resulted in a 20 percent speed improvement for the overall model. This is because Microphysics is a computationally intensive component of WRF.

With the invention and continuous improvement in WRF, many agencies that rely on it are now able to produce the much needed forecasts. Hence, people will be able to prepare better for these weather events.

Searching Algorithms

A search algorithm is used to find an element with specified properties among a collection of elements. Computers systems usually store large amount of data from which individual records are to be retrieved or located based on a certain search criterion. It may also be used to solely check for the existence of the desired element.

Linear Sequential Search

This algorithm is very simple. The search starts from the beginning and walk till the end of the array, checking each element of the array for a match. If the match is found, the search stops. Else, the search will continue until a match is found or until the end of the array is reached.

Apart from using array, this algorithm can be implemented using linked lists. The "next" pointer will point to the next element in the list. The algorithm follows this pointer until the query is found, or until a NULL pointer is found thus indicating that the query is not in the list. The advantage of linked list over array is that the former allows easy addition of entries and removal of existing entries.

This algorithm is useful when the datasets are limited in size as it is a straightforward algorithm. It also proves to be useful when the data to be searched is regularly changing. The weakness of this algorithm is that it consumes too much time when dataset increases in size. This is because it is an O(n) algorithm.

Figure 2. Linear Sequential Search for "Amy"

Binary Search

To utilize binary search algorithm, the data must have been already sorted. To begin, the position of the middle element is first calculated. A comparison is made between this element and the query. If the query is found to be smaller in value, the search will be continued on elements from the beginning until the middle. Else, the search will proceed onto elements after the middle until the end. The time complexity for binary search is O(log n).

Figure 2. Binary Search for Dan

In the case whereby data is not sorted yet, a binary search tree would come in handy. A binary search tree is either empty or its every node contains a key that satisfies 2 conditions. The conditions are: the key in the left child of a node is less than the key in its parents, while the key in the right child of a node is more than the key in its parents. To search for an element in the binary search tree, the data at the root node is first inspected if it matches the query. If it doesn't match, the branches of the tree will be searched recursively. If it is smaller, then recursive search is done on the left subtree. If it is greater, then the right subtree is searched. If the tree is null, it indicates that the query isn't found. In the case whereby data is already sorted, the binary tree will become unbalanced and it degenerates into a linked list. The algorithm will then become inefficient. Hence, this algorithm will only be efficient if we have prior knowledge of the datasets.

Hash Tables

A hash table is used to implement an associative array, a structure that maps keys to values. Hash function is used to convert the keys into numbers that are small enough to be practical for a table size. To build a simple hash table, the key (database) is sent to the hash function. Value returned from the hash function then becomes the hash table index for storing the index of the key in database. To search for a query, the query is sent to the hash function to obtain the hash table index. The element stored at that index will give the location of the key in the database. Information required from the database entry can then be retrieved. A collision will occur when more than one key produce the same index after hash function is performed. Collisions can be resolved by replacement, open addressing or chaining.

Hash table proves to be a good choice when the keys are sparsely distributed. It is also chosen when the primary concern is the speed of access to the data. However, sufficient memory is needed to store the hash table. This will lead to limitation in the datasets as larger hash table will be needed to cater for larger datasets.

Figure 2. Using Hash Table to search for Dan's data

Bloom Filter

Bloom filter was introduced in 1970, by Burton Howard Bloom. It is a memory-efficient data structure which rapidly indicates whether an element is present in a set. False positives are possible, but not false negatives. Addition of elements to the set is possible, but not removal. The probability of false positives increases when more elements are added to the set.

Algorithm Description

A bloom filter is implemented as an array of m bits, all set to 0 initially. K different functions are defined; each will map or hash the set elements to one of the m array positions. The hash functions must give uniform random distributed results, such that the array can be filled up uniformly and randomly.

Bloom filter supports two basic operations, which are add and query. To add an element into the filter, the element is passed to each of the k hash functions and k indices of the array are generated. The corresponding bits are then set to 1. To test if an element is in the set, in other words query for it, the element is passed to each of the hash functions to get indices of the array. If any of the bits at these array positions is 0, then the element is confirmed not in the set. If all of the bits in these positions are 1, then either the element is indeed in the set, or the bits have been set to 1 by chance when other elements are inserted hence causing false positive.

Removal of an element from the bloom filter is impossible. This is because if the corresponding bits of the element to be removed are reset to 0, there is no guarantee that those bits do not carry information for other keys. If one of those bits is reset, another key which happens to be encoded to the same bit will then give a negative result when it is queried. Hence, the only way to remove the key from Bloom filter would be to reconstruct the filter itself sans the unwanted key.

False Positive Probability

Assuming the hash function used is independent and uniformly distributed, the array position will be selected with equal probability.

Let k represents the number of hash functions,

m represents the number of bits in the array,

n represents the number of inserted elements, and

fpp represents the probability of false positive.

The probability that a certain bit is set to 1 by one of the hash function during element insertion is

Probability of bit not set to 1 by a certain hash function (Equation 2.1)

Probability of bit not set to 1 by any of the hash functions (Equation 2.2)

Probability of a bit not set to 1 after inserting n elements (Equation 2.3)

Probability of a bit set to 1 after inserting n elements (Equation 2.4)

When false positive probability occurs, all the corresponding bits are set to 1, the algorithm erroneously claim that the element is in the set. The probability of all corresponding bits being set to 1 is given as

(Equation 2.5)

From Equation 2.5, it can be concluded that the probability of false positive increases when m decreases or when n increases.

Figure 2.8 shows an empty Bloom filter (all bits set to 0) with

Figure 2. An empty Bloom filter

3 hash functions are performed on an element ()

The corresponding bits (bits 2, 10 and 11) in the Bloom filter are set to 1 (coloured)

Figure 2. The Bloom filter after adding first element

The hash functions are performed on a second element.

The corresponding bits (bits 0, 7 and 10) are then set to 1.

One of the bit positions (bit 10) is coincidently the same as for the first element.

Figure 2. The Bloom filter after adding second element

Hash functions are performed on a query (neither same as first nor second element).

The bits at position 2, 7 and 11 are checked. Since the bits at these locations are set, the filter will indicate that the query is present in the dataset. However, the query is actually not part of the dataset as it differs from the first and second element. Hence, false positive has occurred.

Figure 2. False positive in Bloom filter


Bloom filter shows a significant advantage in terms of space saving, as compared to other algorithms involving arrays, linked lists, binary search trees or hash tables. Most of these algorithms store the data items themselves, hence the number of bits required may vary depending on the type of data stored. Additional overhead for pointers is needed when linked list is used. Bloom filter, on the other hand, requires a fixed number of bits per element regardless of the size of the elements. However, space saving comes at a price. The downside of Bloom filter is the existence of false positive probability and no removal of elements once they are inserted. Despite so, these constraints are usually acceptable, making Bloom filter a preferred option when dataset is large and space is an issue.

A unique property of Bloom filter is that the time taken to add an element or to check if an element is in the set is a fixed constant, O(k). In other words, regardless of the number of elements already inserted in the set. Bloom filter has the advantage over hash tables as the k lookups are independent of each other and thus can be parallelized.

Bloom filter also prove to come in handy when privacy of data is preferred. In peer to pear network, Bloom filter can be used for bandwidth efficient communication of datasets. Network users are able to send sets of content while having their privacy protected. This is because Bloom filter may only be read by users with similar content.

Existing CUDA String Algorithm

The simplest algorithm for string search is Brute-Force algorithm. The algorithm tries to match the pattern (query) to the text (input word list) by scanning the text from left to right. In sequential form, a single thread will conduct the search and outputs to the console if a match is found. In CUDA version, N threads can be used to conduct for the same search. The difference lays in that each of the N parallel threads attempts to search for a match, in parallel. When the thread founds a match, the data structure used to store the found indices will be updated. The drawback of Brute Force algorithm is its incompetency in speed. Despite its implementation in CUDA, the text will need to be scanned from beginning till the end. Hence, this algorithm is more suitable when simplicity is of higher priority compared to speed.

Another algorithm which has been implemented in CUDA is QuickSearch, a variant of the popular Boyer-Moore Algorithm. "Bad character shift" is precomputed for the pattern (query) before searching is performed. The main disadvantage is the pre-processing time and the space required. If the pattern and/or the text is long, more time and space will be needed.

Bloom Filter is chosen to be the implemented algorithm in this project as it requires a fixed number of bits per element regardless of the size of the elements. In other words, less space will be required compared to the abovementioned algorithm when the input file is large. In addition to that, time taken for the algorithm to check for an element is regardless of the number of elements in the set. The lookups are independent of each other and thus can be parallelized, proving an opportunity for the algorithm to be implemented using CUDA.



Since this program was implemented in Microsoft Windows, the chosen software for development was Microsoft Visual Studio 2010. Prior to implementing this project using GPGPU, the environment for development has to be set up. The CUDA toolkit, consisting of NVIDIA GPU compiler, math libraries and debugging tools, was first downloaded from the NVIDIA developer zone website and installed. The SDK code samples as well as developer drivers were also downloaded and installed. Necessary settings were chosen for the Microsoft Visual Studio project.

The project was implemented as a combination of codes running on CPU and on GPU. The compute-intensive part that can be parallelized was implemented on GPU. A serial version of that part was also included in the program for performance comparison. The flow of the entire program is described in Section 3.2. Section 3.3 describes the hash algorithm which was implemented as well as the reason for selecting it. Bit table generation for the Bloom filter is explained in Section 3.4 while the method to check the bit table is explored in the subsequent section.

Program Flow

Figure 3. Flowchart for the entire program

The program begins with the loading of input word file into CPU memory. If the word file is not found and fails to load, the console will show failure in loading file and the program terminates. The word file is then processed before Bloom filter is instantiated. The filter is inserted by setting the bits based on hash functions performed. After all keys have been inserted, the input query file is loaded into CPU memory. Similar to word file, failure in loading the file will result in program termination. Processing of queries is required before the filter is checked for the queries. If the query is in the word file, a message is printed on the console stating its presence. After all queries have been searched and results have been printed, the program terminates.

Data Pre-processing

Data transfer from CPU (host) memory to GPU (device) memory generates overheads proportional to the data size and additional constant overhead each time data transfer is invoked. Hence, it is desirable to have fewer copy operations invoked. Pre-processing is required so that the data is laid out in a format which lessens independent data transfer operations.

There are typically 3 operations involved, namely allocating contiguous space (on host) for the input strings, preparing and filling up index tables to indicate starting position and size of each string, and copying the strings into the allocated space.

Figure 3. Pre-processing of input text file

Filter Insertion

Figure 3. Flow for filling up Bloom Filter

For the case of serial CPU programming, the iterator is incremented after a single hash function is performed and one of the bits in Bloom Filter is set to 1. After performing k number of hash functions and setting the Bloom Filter k number of times, one string element (word) is considered inserted. The entire Bloom Filter is inserted in such manner for all string elements.

For the case of GPU implementation, each thread performs a hash function on the element, computes the table index and sets the bit table. Threads within the same block perform the operations on the same string element. Thus, the hash functions and index calculations for all elements are performed in parallel. Bit table setting is performed using atomic operations. This is because different threads may be accessing the same memory location at the same time. By using atomic operations (which are performed without interference from other threads), race condition is avoided.

Filter Searching

Figure 3. Flow to check if query is in input list

To check for existence of a query in the input word list, similar process as filter insertion is performed. Hash value is calculated and the table index is calculated. The bit at the calculated position is checked for its Boolean value. If the bit in the calculated position is not set ("0"), it indicates that the query is not part of the input word list. If it is set ("1"), the same routine is repeated until all k bits have been checked and verified as set. This indicates that either the query is indeed present in the input word list, or a false positive has occurred.

AP Hash Function

The chosen algorithm for the hash function is AP hash function. It is a hybrid rotative and additive hash function algorithm produced by Arash Partow. The algorithm resembles a simple Linear Feedback Shift Register (LFSR). When the feedback function of an LFSR is chosen appropriately, the output sequence of bits will appear random. Therefore, by using AP hash algorithm, the calculated hash values (and hence bit table indices) will be random, collisions can be reduced.

Hash Algorithm

Table 3 AP Hash Algorithm




Initialize loop to 0 and iterator to point to the first character of the word

initial hash=one of the element of salt array


While word_length >=8 :

Set m at first 4 characters pointed by iterator, increase the iterator such that n points at the next 4 characters

Increase the iterator


Word_length is reduced by 8


Set p at the first 4 characters pointed by iterator

While word_length >=4 :

If (loop & 0x01):



Word_length is reduced by 4, iterator is increased


Set x at the first 2 characters pointed by iterator

While word_length >=2 :

If (loop & 0x01):



Word_length is reduced by 2

Increase iterator


If word_length=1:


Return hash

Bit-table Generation

Bit-table generation refers to inserting the elements (strings) into the Bloom Filter by setting certain values of the bit table to the value "1".

A bit_mask array of size 8 is used in the program. Each element of the array consists of 8 bits. The reason for using a bit_mask array is to eliminate collision when 2 elements contribute the same bit index.

Steps to generate bit-table:

1. Perform a hash function on the element and obtain the returned hash value

2. Calculate the variables bit_index and bit based on the following formula:

bit_index = hash % BF_table_size_

bit = bit_index % bits_per_char

3. value at bit_index in bit-table is OR-ed with one of the elements of bit_mask

(depending on the calculated bit value)

4. Steps 1-3 are repeated k number of times for insertion of 1 element.


Originally, bit_table[20] stores 0x00

1. element "above" gives bit_index = 20, bit = 0;

Hence, bit_table[20] |= bit_mask[0]

0x00 |= 0x01

bit_table[20] now stores 0x01

2. element "beyond" gives bit_index = 20, bit = 5;

Hence, bit_table[20] |= bit_mask[5]

0x01 |= 0x20

bit_table[20] now stores 0x21

Bit-table Checking

Bit-table checking refers to querying the Bloom Filter for the elements (strings), by checking if certain values of the bit table are set to the value "1".

The same bit_mask array is used as in the case of bit-table generation.

Steps to check bit-table:

1. Perform a hash function on the query and obtain the returned hash value

2. Calculate the variables bit_index and bit based on the following formula:

bit_index = hash % BF_table_size_

bit = bit_index % bits_per_char

3. value at bit_index in bit-table is AND-ed with one of the elements of bit_mask

(depending on the calculated bit value)

4. If the bit_mask element is retrieved, then steps 1-3 are repeated for the next hash

function. Else, the query is confirmed to be non-existent in the word-list.

5. If k hash functions have been performed and all respective bits have been

checked, then either the query is present in the word-list or a false positive has



bit_table[20] stores 0x21 after all elements of the word-list have been inserted.

1. query "beyond" gives bit_index = 20, bit = 5;

Hence, bit_table[20] & bit_mask[5]

0x21 & 0x20

bit_mask[5] is retrieved, hence query is found.

2. query "tiger" gives bit_index = 20, bit = 7;

Hence, bit_table[20] & bit_mask[7]

0x21 & 0x00

bit_mask[7] is not retrieved, hence query is not found

Performance Timer

Table 3: System specification of IEEE 802.11

System Parameter


Carrier Modulation


Channel Coding

Punctured Convolutional Coding





This chapter proposes a novel multiple description coding technique for 3D video based on the even and odd frames MDC developed in Chapter 3. The even and odd frames based MDC is improved by adding variable redundancy in the form of side information (MDC-EOS). The redundant side information consists of the difference between the interpolated frame and the locally reconstructed frame that can be quantised, hence, the redundancies can be controlled by the quantisation parameter. This method is described in Section 4.2. The method is improved in Section 4.3 by including the side information in the central prediction and using the concept of multiple predictions (MDC-EOSP).

The side information is found to be quite large, which affects the coding efficiency of the MDC coder. Hence, in Section 4.4, an attempt is made to reduce the side information using B-frame interpolation (MDC-EOSB). The simulation results for MDC-EOS, MDC-EOSP and MDC-EOSB are presented and discussed in Section 4.5. The side information is also reduced by making it adaptive according to the motion information and is applied to MDC-EOS and MDC-EOSP in Section 4.6. With the adaptive side information, MDC-EOS is renamed to MDC-EOAS and MDC-EOSP is renamed to MDC-EOASP. The simulation results for MDC-EOAS and MDC-EOASP are also presented and discussed in Section 4.6. Section 4.7 concludes this chapter.

Serial Insertion

Serial Searching

Cuda Insertion

Serial and CUDA comparison



This thesis involves 3D visual data compression for transmission over error prone networks. Issues such as resilience and scalability have been taken into account to mitigate the effects of transmission errors and information loss on the compressed 3D video stream.

The works in this thesis can be concluded in three main areas as follows:

3D video compression

3D video error resilience

3D video scalability

Each will be explain in the following three sections.

Summary of the Work

One of the objectives of the research is to compress the stereoscopic 3D video data for transmission purposes. A format of stereoscopic video, namely 2D video plus depth has been chosen to represent the stereoscopic video. Using the depth based image rendering technique, a left and right image sequence from the 2D image and its associated depth information is produced which can be rendered to obtain a stereoscopic 3D video. Compression of this type of data can be accomplished by using an available video coding standard such as MPEG-4. With regard to the depth compression, down-sampling prior to encoding and up-sampling of the depth data after decoding is proposed. The use of down-sampling prior to encoding and up-sampling after decoding introduces up-sampling distortions beside the quantisation errors. However, the simulation results on high resolution image sequence demonstrated that if the resolution of the depth image sequence is reduced prior to encoding and up-sampled back to its original resolution after decoding, far better objective and subjective quality could be achieved compared to compressing using the original resolution of the depth image at low bit rates.


The work presented in this thesis involved extensive studies on the video coding standard, multiple description video coding and scalable multiple description video coding for stereoscopic 3D video application

Future Work

This section describes some of the issues, which remain to be tackled in the provision of stereoscopic 3D video wireless communication.