The aim of this paper is to explore the use of the GPU for image processing applications, focusing mainly on processes offered by AGIS Image Server. The costliest stage of these process is the part involving arithmetic and floating point operations on a pixel or raster block level. In this project, we propose using the GPU as a means to parallelize computations per algorithmic pass over an image and achieve real time image processing. By transforming the image processing algorithm into various stages of a modified rendering pipeline, we hope to achieve a performance increase several times that of an algorithm solely implemented on a high-end CPU.
Keywords: Image Processing , Graphics Processors (GPUs), CUDA
Real time processing of imagery data will be very important in the near future. For obtaining surveys in disaster scenarios, mass events, or even military setups real time image processing system are required. The current process framework is insufficient to handle input and generate output on the fly. This paper highlights how common processing algorithms can be modified for GPU implementation and comparing the running times of the GPU and a CPU for performance checks. While developing such algorithms its ensured that one algorithmic pass has no dependencies on another for a simultaneous computation.
Get your grade
or your money back
using our Essay Writing Service!
As such,much of this computation can be done in parallel, as there is a large degree of independence within the streams of data. The part sections of the paper discuss the reasons for switching to GPU computing. The subsequent sections discuss Nvidia's CUDA an architecture for parallel computing. The final sections highlight the performance studies and results of GPU image processing.
2.CPU vs GPU
Image sampling processes involve a lot of Floating-Point Operations and intensive Memory Bandwidth usage.
The reason behind the discrepancy in floating-point capability between the CPU and the GPU is that the GPU is specialized for compute-intensive, highly parallel computation - exactly what graphics rendering is about - and therefore designed
such that more transistors are devoted to data processing rather than data caching and flow control.
More specifically, the GPU is especially well-suited to address problems that can be expressed as data-parallel computations with high arithmetic intensity . Because the same program is executed for each data element, there is a lower requirement for sophisticated flow control; and because it is executed on many data elements and has high arithmetic intensity, the memory access latency can be hidden with calculations instead of big data caches.
Data-parallel processing maps data elements to parallel processing threads. Many applications that process large data sets can use a data-parallel programming model to speed up the computations. In 2D and 3D rendering, large sets of pixels and vertices are mapped to parallel threads. Similarly, image applications such as post-processing of rendered images, encoding and decoding, image scaling, and pattern recognition can map image blocks and pixels to parallel processing threads.
In recent days GPU clusters have gone mainstream and allow the user to exponentially increase computational power by using several GPUs working in tandem.
3.CUDA: a General-Purpose Parallel Computing Architecture
The CUDA architecture and programming model has exposed more flexibility on GPU hardware. The Compute Unified Device Architecture (CUDA) allows programmers to write C programs which no longer require knowledge or dependence on the graphics pipeline. The programmers can now concentrate on parallelism without bothering about the low level details. CUDA is supproted on Nvidia GeForce 8 Series and above cards.
Cuda brings to the table several significant upgrades which allow the full power of the GPU to be harnessed. Unlike traditional GPU APIs, CUDA exposes a fast shared memory region that can be shared amongst threads. This can be used as a user-managed cache, enabling higher bandwidth than is possible using texture look-ups. Perhaps the most important feature of CUDA is its automatic thread manager which takes care of thread handling. Thus application programmers don't need to write threaded code explicitly. This also eliminates the possibility of deadlocks. Scalibility is another issuse that CUDA addresses. The Hardware is free to schedule thread blocks on any processor.
Always on Time
Marked to Standard
The architecture of a GPU is very different from a CPU. A CPU relies on high clock frequencies for performance whereas a GPU relies on its massive number of cores. Though these might be slower than the CPU cores but their parallel processing capabilities give them an edge.
8 cores, a register set and shared cache memory.
Cuda makes use of on-board device memory to for all computational purposes. Data must first be transferred from the host (CPU) via the host-device memory bus to the device memory before any operations can be performed on them.
Apart from the standard DRAM the device also has cache memory exclusive to each streaming processor which can be used for register level speedy access.
5.IMAGE PROCESSING USING CUDA
Image processing involves analyzing 2D arrays of color values (1D or 3D).Most image processing algorithms are inherently parallel and involve a lot of calculation on a pixel level. Image processing is also memory intensive and involves a great deal of redundant memory look-ups. Such a scenario maps perfectly to GPUs.
Raster level ArcGIS Image Server processes can be either Radiometric or Geometric. The former involves changing pixel values but not the number of pixels or where they are placed; for example, Convolution Filter or Stretching. Geometric processes refer to the process of placing pixels in their correct positions on the ground; for example, Warp.
Significant performance improvements can be achieved using the GPU as a co-processor.
Most processes take image input row wise. The normal CPU code buffers the row pixels and then modifies each pixel individually,one pixel at a time. A row can be expected to have more than 4000 pixels. Using CUDA the task is made much more efficient. All the pixels of the row can be worked on simultaneously. The thread manager initializes a matrix of threads each of which represents an algorithmic pass. If needed the threads can be synchronized explicitly. The boost in efficiency is astounding. The CPU might have a higher clock frequency than the GPU but the parallel paradigm accounts for the lower clock and a significant performance increase is noted.
The algorithm can be further optimized by using texture or shared memory instead of global memory. CUDA facilitates sharing of data between threads within a thread block. If the process involves a pixel and its neighbors then that entire blocked can be read into the shared memory cache thus avoiding redundant fetches from the global memory and increasing performance.
5.1 SULTAN PROCESS
The Sultan process is one of many raster processes which involves pixel level manipulation. The Sultans process takes a six band 8-bit image and uses the Sultan's formula to produce a three band 8-bit image. The resulting image is a classified image which shows the classification of rock formations called ophiolites on coastlines. Application of Sultan's formula to each pixel involves a lot of overhead. The process can be made efficient if ported to the GPU using CUDA API. This involves buffering the pixel values from the image, transferring them to device memory and then carrying out calculations on all the values in the buffer simultaneously using an array of threads. Memory copy does involve read and write overheads but these can be ignored if the volume of calculation is tremendous.
The results of this study demonstrate that certain classes of image processing algorithms can benefit from implementation of the GPU. The process of transforming the problem into a modified rendering pipeline has been established and the results are promising. We expect that developments in hardware will address current limitations in GPU performance.
- Image Processing on the GPU: a Canonical Example Cynthia Bruyns and Bryan Feldman of Computer Science, University of California Berkeley, Berkeley CA, USA
- GPU-BASED ORTHORECTIFICATION OF DIGITAL AIRBORNE CAMERA IMAGES IN REAL TIME U. Thomas *, F. Kurz , D. Rosenbaum, R. Mueller , P. Reinartz DLR (German Aerospace Center), Remote Sensing Technology Institute, 82234 Wessling, Germany
- Getting Started with CUDA Greg Ruetsch, Brent Oster
- ARCGIS IMAGER SERVER DEVELOPERS GUIDE