Mingpu: A Minimum Gpu Library For Computer Vision: Pavel Babenko and Mubarak Shah
Mingpu: A Minimum Gpu Library For Computer Vision: Pavel Babenko and Mubarak Shah
Pavel Babenko and Mubarak Shah University of Central Florida Computer Vision Lab, School of Electrical Engineering & Computer Science, University of Central Florida, Orlando, FL Phone: (407) 823-4733. Fax: (407) 823-0594
E-mail: pavelb@cs.ucf.edu Abstract. In computer vision it is becoming popular to implement algorithms in whole or in part on a Graphics Processing Unit (GPU), due to the superior speed GPUs can offer compared to CPUs. In this paper, we present GPU implementations of two well known computer vision algorithms  Lukas-Kanade optical flow and optimized normalized cross-correlation, as well as a homography transformation between two 3D views. We also present a GPU library, MinGPU, which contains, as minimal as possible, all of the necessary functions to convert an existing CPU code to GPU. We provide timing charts and show, in particular, that while our MinGPU implementations of optical flow algorithm perform few hundreds times faster than CPU, our MinGPU implementation of homography transformations perform approximately 600 times faster than its C++ CPU implementation and more than 7,500 times faster than its MatLab CPU implementation.
Keywords. GPU, Computer Vision, optical flow, normalized cross-correlation, homography transformation.
1. Introduction
Our paper makes two contributions to computer vision. First, we have implemented three popular computer vision methods on GPU. For each of these methods, we present timing and we provide a detailed timing comparison of GPU homography transformation to its non-GPU implementations. Second, we have created a small C++ class library, MinGPU. We intentionally designed the library and interfaces to be as minimal as possible. MinGPU provides simple interfaces which can be used to load a 2D array into the GPU and perform operations on it. All GPU and OpenGL related code is encapsulated in the library; therefore users of this library need not to know any details on how GPU works. Because GPU programming is currently not that simple, this library can facilitate an introduction to GPU world to researchers who have never used the GPU before. The library works with both nVidia and ATI families of graphics cards and is configurable. It is available online on our web site at [25]. This paper is organized as follows: in Introduction, we present general information on the current state of art and the evolution of GPUs, their architecture and design. In the Section 2, we give a brief introduction of our C++ class library for the GPU - MinGPU. The details on MinGPU
implementation can be found in Appendix A. In the Section 2 we also show how to implement few common computer vision algorithms with MinGPU, like image derivatives, Gaussian smoothing and image pyramids. Sections 3 through 5 contain the details on MinGPU implementations of three computer vision methods as well as speed comparisons and discussions. Three methods are: Lukas-Kanade variant of optical flow, normalized cross-correlation (used, for example, in MACH filters), and homography transformation between two 3D views. The Appendix B contains code listings of all programs mentioned in Sections 2 to 5.
1.1
Graphics processors
Most of todays graphics cards from the largest GPU makers, nVidia and ATI, contain two processors, the vertex processor and the fragment processor. Vertex processor All graphics cards operate on 3D polygons. Every polygon is defined by 3D coordinates of its vertex points. For example, every triangle is defined by three vertex points given by (x, y, z). The camera point is invariably set to be (0, 0, 0). When the camera moves, the coordinates of all polygon points must be recomputed to reflect a change in the camera position. This operation is performed by the so-called vertex processor. Vertex processors are specially designed to perform this operation and thus are able to highly optimize the speed of coordinate transformations. After coordinates are recomputed, the vertex processor determines which polygons are visible from the current viewpoint. Fragment processor After the vertex processor re-computes all of the vertex coordinates, the job of the fragment processor is to cover the visible parts of the polygons with textures. The fragment processor does this with a help of the so-called shader programs. Originally, in the computer graphics world, the purpose of the shader programs was to add graphics effects to textures like shades (hence comes the name), but now this feature is being inherited by the general-purpose GPU users. Few years ago all shader programs were written in the assembly language. However, as graphics hardware evolved and became capable of executing much larger shader programs, a need for a specially designed language became evident. Most contemporary shader programs are C-like programs written in Cg language. Cg language was created by nVidia; its name translates as C for graphics. nVidia supplies good quality manuals and examples on Cg, which can be found in Cg Toolkit [9]. The most important difference between contemporary CPUs and GPUs is that GPUs run programs concurrently, being in fact SIMD-type processors. The programs are executed for every output texture pixel independently. Thus, if the GPU has 8 instruction pipelines in the fragment processor, it is able to execute the same program on up to 8 texture pixels simultaneously. Contemporary fragment processors have 16-128 instruction pipelines. While both the vertex and fragment processors are programmable, we are more interested in the fragment processor because it is specifically designed to work with textures which, in our case, can be viewed as 2D data arrays. On the other hand, vertex processor is optimized to work with pixels. Therefore, all algorithms in this paper are designed to work on a fragment processor.
1.2
About 5-7 years ago, graphics processors were not generally suitable for nongraphics use, because of severe design restrictions. Thus, prior to 2003 when nVidia introduced its GeForce series FX cards, shader programs were limited to less than 256 static assembly instructions. They did not support conditional statements like if and for, there were typically no more than four
pipelines, clock rate was below 300MHz, and the amount of installed memory usually did not exceed 32-64MB. However, an increase in the capabilities of the graphics processors in later years tended to exceed Moores Law. On an average, every year sees more than 2.4 times increase in the performance rate [28]. In April 2004, nVidia introduced its series 6000 of GeForce graphics cards [21], which supported a new shader model 3.0 which had much better programming capabilities than 5000 series cards. The series 6000 supported 65,535 static assembly instructions (due to that it became possible to put average-sized programs on the GPU) as well as up to 16 pipelines. A year later nVidia introduced its GeForce 7000 series line which is now a standard video card in most desktop computers. By that time graphics cards became well suitable for the general-purpose operations which gave rise to the new community of general-purpose GPU users (GPGPU). This community maintains a website [16] which has lots of useful code and examples for those who want to use GPU for non-graphics programming. The 8000s, the most recent series of nVidia cards as to date, has seen the light in mid-2006. The 8000s features up to 128 pipelines, with a pipeline clock rate up to 1.5GHz, and on-board memory of up to 384MB. Apart from a startling increase in the productivity, it brought a major change in the GPU philosophy. Both vertex and fragment processors are now combined into one stream processor [31], which is able to perform a set of scalar operations. This design ensures that neither of processors is idle at any time. It is also the most expensive series of graphics cards ever, whose costs are currently comparable to a desktop computer. In the late 2006, both graphics cards manufacturers, nVidia and AMD (who purchased ATI in 2006), claimed that they will directly support the general purpose computing in newer products. By 2007, both nVidia and AMD devised new software technologies which specifically target GPGPU developers. These are nVidias CUDA [30] and ATI CTM technologies [2] which we discuss in Section 1.6. In addition, in June 2007 nVidia released their first graphics chip oriented to GPU computing only, Tesla C870. Tesla is a powerful graphics station with no video output based on GeForce 8800 technology. Different versions of Tesla contain either 1, 2 or 4 GeForce 8800-series GPUs. Each GPU is equipped with 1.5GB of memory and works at an increased clock rate. The competitor, AMD, is also keeping up. Their latest hardware, ATI Radeon HD 2900 XT of series R600 contains 320 shader pipelines, which work at 742 MHz clock rate. For this video card, memory bus width is 512 bits, which is significantly more than the 384 bit bus width in competitive nVidia series 8800. A good introduction to scientific computing with GPU can be found at [15, 19]. Our MinGPU library is designed to work on nVidia 7000 series cards as well as on ATI Radeon cards series R400 (ATI X800) and later versions. We have designed and tested our library on nVidia 7300LE card, which is the most basic card in 7000 series, and now is in widespread use. Some functionality may work on nVidia 6000 series cards as well as on former ATI cards.
1.3
At the time of writing, the typical upscale desktop computer is equipped with Intel Core 2 Duo processor working at 2.4GHz. Lets roughly compare a productivity of this processor to a productivity of nVidia 7300LE (light edition) GPU, which is a ordinary graphics card installed in the same desktop. The clock rate of 7000 series nVidia GPUs lies in 400-700MHz range. 7300LE runs at 450MHz. There are 4-24 pipelines in fragment processor in 7000 series [29]. We also take into account that nVidia GPU pipelines can typically process two pixels in one clock cycle, and they process each of pixels 4 color channels simultaneously. Each pixel is represented as 4-float number (RGBA). Thus, if we pack our array so that each of RGBA floats assumes a data value (more on it in Section 2) we gain an additional 4 times speedup.
After we multiply all speedups, nVidia 7300LE works as a processor with a virtual 57Ghz rate, due to parallelism; this rate is already 6 times higher than that of Intel Core 2 Duo 2.4GHz CPU. However, this is not all. In todays computers the onboard installed memory (DRAM) tends to be hundreds of times slower than memory installed on both CPU caches. Therefore, program execution time greatly depends on the cache hit rate, which is typically about 90-98% for the modern computers. There are two considerations with respect to caches in the GPUs. When the fragment processor processes texture, it cannot write into a position other than the position of a current pixel. Thus, for current pixel, (x, y), the shader program can write an output only to a position (x, y) in the output texture. This is a very drastic limitation of the GPUs - every program can only modify the value at the location of a pixel it currently processes. However, there is a flip side to this limitation it is very good for the cache write optimization, because the memory write address is always known. Thus, it is possible to attain 100% cache write hit rate for GPU. Second, because the same program is run for every pixel in the texture, it often results in a predictable pattern of memory reads. Unless you are using the conditional statements in your GPU program memory reads have a perfectly predictable pattern. Therefore, it is natural to expect that cache read hit rate will be higher for the GPUs than for CPUs. Lets make a rough estimate of a marginal case when GPU cache hit rate for both reads and writes is equal to 100%. We make the following assumptions: DRAM memory is 500 times slower than cache memory, for both CPU and GPU an access to the cache takes 2 clocks on the average, and CPU cache hit rate is 90%. These are all reasonable assumptions for commodity computers today. In this example, the GPU with 100% cache hit rate will work about 26 times faster than the CPU with 90% cache hit rate. We have found that the modest GPU card installed in our computer can work 32-624 times faster than the latest CPU. And if we opt to install the latest nVidia 8800 GPU (1.5GHz, 128 pipelines) we gain an additional 26 times speedup. On the other hand, it is not possible to run a conventional processor at more than 4GHz clock speed due to physical constraints [32]. The only theoretical workaround to this limit lies in the use of different base technologies like quantum technology, which at the time of writing is years ahead. On the other hand, researchers in computer vision know well that many vision algorithms are not currently able to run in real-time due to their high computational complexity. Therefore, we feel that the one possible way to make them realtime in observable future is the use of parallel technology, such as present in the graphics processors today.
1.4
GPU limitations
All GPUs suffer from two very drastic design limitations. The first limitation is, ironically, the very fact that GPU computations are done in parallel. The algorithm must be ready to work in multi-thread mode in order to work on the GPU, which is not the feature of many algorithms. In the GPU every pixel in 2D texture is processed independently from others. It may not be possible to know the sequence in which pixels are processed therefore there is no way to pass any data between pixels while they are processed. For example, lets consider a popular connected components algorithm which is used in many vision algorithms today, like Canny edge detector. There exist two versions of a connected components algorithm: recursive and non-recursive. A recursive version cannot be implemented on the parallel processor, because recursiveness implies the knowledge of the order in which pixels are being processed which we dont have. A non-recursive version of the connected components algorithm makes use of a running variable which contains the largest region label currently assigned. There is no way, as it was stated above, to maintain this counter on the parallel processor. There exist some
parallel versions of connected components [17], however, those versions use binary structures like graphs and trees which are hard to implement on the GPU. As of today, we do not know of any successful implementation. We have already mentioned the second limitation in the previous subsection. In the GPU, when one processes a pixel at (x, y) location one can only write the result into (x, y) location of the output texture. There may be more than one output texture (as many as 4-24 for 7000 series and up to 48 in 8000 series). We can also consider the fact that every pixel is comprised of 4 floats (RGBA value), so for the 7300LE card one can write 16 values for every pixel processed, but they cannot be written into any other than (x, y) location in the output textures. For example, lets consider building a color histogram of the grayscale image. Every pixel in the input image may take values in 0 to 255 range which makes for 256 bins in the color histogram. For every pixel processed, we must increment one of 256 bins, which, due to the above limitation is obviously not possible to do on the GPU. This is a very simple algorithm yet it is not possible to implement it on the GPU. One source [33] came up with an approach which computes approximate color histogram on small set of input values. As many other authors suggest, the most promising algorithms to implement on the GPU are filter-like algorithms, which process every pixel independently of others. Among good examples are Gaussian smoothing, convolutions, image pyramids, geometric transformations, image de-noising, crosscorrelation as well as many other algorithms.
1.5
GPU bottleneck
The contemporary computer architecture features one bottleneck with respect to the GPU which we cannot avoid mentioning. This is a transfer rate between main (CPU) memory and GPU memory. All latest computers are equipped with PCI Express memory bus, which is the best memory bus to date in desktop computers. This memory bus has a full duplex transfer rate of 250MB/s for every lane (500MB/s for PCI Express 2.0 released in January 2007). There may be up to 32 serial lines, however it is natural to expect that commodity computers are equipped with less than that. In practice, we measured main memory to GPU memory bus transfer rate of our new DELL XPS 410 desktop to be about 860MB/s. At this speed it would take about 4-5 ms to transfer an array of 1 million 32-bit floating point numbers (1k x 1k image) from the CPU to GPU. Simple operations (addition, for example) over the same array in the CPU (Core 2 Duo 2.4MHz) would take about 4 ms. This means that for the simple array operations time required to transfer an array from CPU to GPU is comparable to the time required for applying an operation in the CPU. An example of this bottleneck is given in the beginning of Section 2; we would like to add here that some older GPU cards feature slower GPU to CPU transfer rate than CPU to GPU.
1.6
CUDA technology
In the end of 2006, along with 8800 series GeForce cards nVidia introduced CUDA (Compute Unified Device Architecture) technology which is a blend of software and hardware technologies. CUDA works only with 8800 series nVidia cards. CUDA specifically targets GPGPU users; it includes C compiler that compiles existing C programs into binary codes executable on nVidia graphics processor. All user needs to do is to supply compiler an existing C code after making some minimal changes [30]. CUDA bypasses OpenGL completely, which gives considerable speedup and also makes programs OpenGLindependent. However, being a high-level technology CUDA is not usually capable of performing the entire computation in the GPU. As mentioned in the previous subsections few algorithms can be executed entirely on the GPU. Thus, as a consequence, it is probable that some of the C code will be executed on the
CPU and some on the GPU; the execution speed will be a function of the memory bus speed discussed above. CUDA users cannot control how the compiled C code is executed. On contrary, in our paper we use relatively lowlevel methods of accessing graphics processor which gives us full control of how our code is executed on the GPU. Obviously, only by directly controlling the graphics processor one can get the most of it. We tested the homography transformation algorithm (described in Section 5) with CUDA on nVidia GeForce 8800 graphics card. We compared performances of homographies Cg GPU code and homographies C CPU code. We compiled homographies C CPU code with CUDA and ran. We found that programs run directly on GPU about 10 times faster than the programs compiled and run with CUDA which speaks in favor of our approach. AMD recently introduced a new technology called CTM, which is similar to CUDA. CTM works with AMD Stream graphics processor introduced in November 2006. It features 48 shader pipelines, up to 1GB of memory and PCI Express 16 lane support. Also, AMD recently unveiled plans to integrate CPU and GPU in new graphics hardware. The embedded CPU will locally handle all sequential code which is not possible to execute on GPU, thus avoiding memory bottleneck.
1.7
Related work
Not much work has been reported on general-purpose computation on GPU before 2000. Since 2002 there is a lot of interest in GPU. A comprehensive list of papers is available at [7]. In 2001, Rumpf and Strzodka described a method to implement level sets on GPU [35]. In 2002, Yang and Welch studied image segmentation and smoothing on graphics hardware [40]. There were several papers published in 2003-2004, many of them appeared in computer graphics conferences rather than computer vision. Thus, in 2003 Moreland and Angel published a paper in SIGGGRAPH conference on how to implement Fast Fourier transform on GPU [28]. In the same year, Colantoni et al. described color space conversions, PCA and diffusion filtering on GPU [6]. In 2004, GPU Gems 2 book dedicated a chapter to computer vision on GPU [11]. It discussed correcting radial distortions, Canny edge detector (partly), tracking hands and image panoramas on GPU. The same year, Yang and Pollefeys published a CVPR paper on implementation of real-time stereo on GPU [39]. Ohmer and Maire implemented a face recognition system using Kernel PCA and SVM [34]. Labatut et al. implemented level set-based multi-view stereo [22]. Well-known SIFT algorithm was implemented in 2007 by Heymann et al. [18]. Among other implemented methods were [7]: Generalized Hough Transform, skeletons, depth estimation in stereo, motion estimation, 3D convolutions, contour tracking, segmentation and feature tracking and matching. However, since pre-2005 papers used previous generations of graphics hardware, their contribution is often rather outdated. MinGPU is a library which incapsulates all graphics-related routines from the end user, which is especially useful for people who do not have particular skills or desire to delve into low-level details of GPU programming. MinGPU can be used as a basic block of any GPGPU application, it helps to implement any algorithm on GPU. Besides our library, there exists another library which is dedicated to computer vision on GPU, OpenVIDIA [12, 33]. This is an open source library which is a collection of many useful vision algorithms, like edge and corner detection, feature-based corner and object tracking, image enhancement/preprocessing routines and so on. The complete algorithm list can be found at [33]. The other issue of interest about OpenVIDIA is that it maintains a web page which lists many recent computer vision related papers [7]. However, this library is not built upon a reusable core. Another notable effort is a BrookGPU library [4]. BrookGPU was conceived as an extension to C language which supports stream programming and data-parallel constructs. It includes a compiler and run-time support libraries. This project is not under development since 2004. The latest v0.4 release of BrookGPU can be
downloaded from [5]. Yet another open source GPU library is Sh [36]. Sh is a GPU metaprogramming library designed in C++. On contrary to BrookGPU, it is still being developed; the latest version v0.8 was produced in 2006. However, both these medium-sized open-source projects feature no documentation and are complex to a user without advanced knowledge of graphics processors and C++ programming.
2. Software interface
In computer vision, we often encounter the situations when we need to process every point of a 2D array (usually an image) with a double loop like this for (row = 0; row < MaxRow; row ++) { for (col = 0; col < MaxCol; col ++) { // do something } } Any 2D array can be represented as a GPU texture. If we do not need to carry over any information between the points of this array, we can implement the inner part of this double loop as a shader (Cg) program. The array is uploaded into the GPU as texture. Then, the shader program is run and the output texture downloaded back into main memory. In this section we introduce the smallest possible library, MinGPU, which can implement the above mentioned code on a GPU. We attempt to convert the CPU code into GPU in the simplest possible manner. In the Section 2.2, we present an implementation of this double loop in MinGPU. The rest of the section is dedicated to MinGPU examples based on simple vision algorithms. This section uses a learn-by-example technique. We progress from easy towards more elaborate examples. We show some simple examples: taking image derivatives, applying Gaussian smoothing, and computing image pyramids. In the following sections, we present GPU implementation of a few popular vision algorithms.
for (row = 0; row < MaxRow; row ++) { for (col = 0; col < MaxCol; col ++) { Array[row][col] ++; } } All code listings are given in the Appendix B. The code which implements Hello, World on MinGPU is given in Listing 1. This as well as the most other listings contains two pieces: a C++ program and a Cg program. Lets first look at C++ program. It matches a scheme we discussed above in a straightforward way: we create both array and Cg program, copy array to a GPU, set program parameters and run it. As for Cg program, of course, we wont be able to cover the entire Cg language here, which you could find in a Cg user manual [8], but at least we can highlight some key points. First, our first program contains just one function, main, which is its entry point. There are two input parameters to this function, parameter coords of type float2 and parameter texture of type samplerRECT. Parameter coords is bound to pre-defined name TEXCOORD0, which contains (x, y) coordinates of the currently processed pixel. Pre-defined names are set automatically by the hardware, we do not need to provide values for these. On the other hand, parameter texture is a regular parameter which we must initialize with a call to the SetParameter function. It is our input array; one we want to increment by 1. The standard Cg function texRECT fetches a value of texture array at coords coordinates. In this simplified example, we used the same Cg texture as both an input and an output array. We store intermediate value in the result variable which is of standard 4-float RGBA type. The assignment string result = y + 1 increments each of four float values in result by 1. In this way, every float in result would contain the same value. Cg functions return one 4-float RGBA value which is the generated value for the current (x, y) texture pixel. When we download this value to the CPU, in luminance mode (discussed in the following subsection) OpenGL drivers take a mean of 4 RGBA floats to produce a single luminance value for each texture pixel. In color mode all 4 RGBA channels are returned. Note that all Cg programs run on every point in an output, not an input array. These two arrays can be of different sizes.
all current ATI cards a color mode is required. nVidia also fully supports a color mode. For list of color modes supported by different cards, see tables in [13]. In a luminance mode, every float in the quad holds the same value of one input array cell. In a color mode, MinGPU replicates every input value four times, so that each float in a quad contains the same value. Luminance and color modes are compatible on the level of a C++ code and on the level of a Cg program. In MinGPU, a color mode is specified on per-texture basis. The luminance mode is the default; textures can be created in the color mode by setting bMode to enRGBA in a call to Array::Create.
gl (i, j ) =
m =2 n =2
w(m, n) g
l 1
(2i + m, 2 j + n).
In the above equation, gl is an image at pyramid level l, matrix w contains the weight mask, and i, j are the indices for the images columns and rows, respectively. A C++ code and a Cg program for REDUCE operation are given in Listing 8. In this example, the input and output array sizes do not match. Because each pyramid reduction effectively reduces the image size by half in both dimensions, the output array side is half as long as the input array side. The important question is, how do we determine the values for array elements outside of the array boundaries? In two previous examples we accessed such elements. In all of the examples in this paper, elements located outside of the array are assigned the value of the nearest element in the array. This behavior is hard-coded by a call to OpenGL function glTexParameteri during array creation. There are no other options available, so, if we need to assign some predefined value like 0 to elements lying outside of the array area, we have to pad our array with 0s.
10
E =  (uI x + vI y + It ) 2 .
Its partial derivatives are then computed as,
11
E = 2 I x (uI x + vI y + It ) = 0. u
E = 2 I y (uI x + vI y + It ) = 0. v
After we compute the partial derivatives, we build an observation matrix (we omitted formula transformations here):
I x2 I x I y
I I I
x y 2 y
u I x I t = . v I y I t
Which we abbreviate as Au = B . In this matrix equation, (u, v) denotes an optical flow. The other remaining parameters are the pre-computed image derivatives Ix, Iy, and Iz. We compute (u, v) for every image pixel by taking the inverse of matrix A and multiplying it by matrix B.
12
Every image loaded into the GPU was used twice, once for time t and once for time t + 1. Thus, we obviously use a single load twice use scheme, which results in a relatively modest speedup. Therefore, after we account for memory transfers, the speedup of our GPU (series 7300) versus our CPU is estimated as 60% on the average. Loaded into the latest series 8800 GPU, this program would perform at about 400 frames per second.
4. 2D Normalized Cross-correlation
In this section, we discuss an implementation of a scaled version of the normalized correlation filters. Many methods of correlation pattern recognition [10, 14, 37], like OT-MACH filters [27, 38], are based upon the normalized cross-correlation (fast implementations of normalized cross-correlation can be found in [3, 23]). The limitation of the correlation filters is that they cannot deal with template scaling. We try to overcome this limitation by scaling a filter in the pre-defined [min, max] range with a step S. This results in a set of filters which are sequentially cross-correlated with an image. The confidence value of a match is obtained by taking the maximum value from the output of all of the filters at the location. The sample application for template scaling is shown on Fig. 2. There, however, may be another case when we need to apply a set rather than one correlation filter to an image: quite often it happens that the matching template can assume slightly different forms (Fig. 3), in which case it is reasonable to use a set of templates instead of one. In this second case, all filters are of the same spatial size, but contain somewhat different templates. Formally, the normalized cross-correlation  (u , v) between an image and a filter at every point (u, v) in an image is computed according to the formula [23]:
(u , v) =
( f ( x, y ) f u ,v ) 2 x , y (t ( x u , y v ) t )2 x, y
x, y
( f ( x, y ) f u ,v )(t ( x u, y v) t )
f ( x, y ) is an input image of Mx x My size, f u ,v is an average of the input image over a filter area, t is a square filter of N x N size and t is an mean of a filter. The sums x, y are computed under the window containing a filter t .
In this formula,
1. Calculate f u ,v for every image point. 2. Calculate normalized value of f: f ( x, y )  f u ,v for every image point. 3. Calculate
4.
x, y
13
point.
For differently sized filters all steps are the same, but steps 1-3 are also included under the loop 4, because sums in steps 1-3 are calculated over filter area which is variable in case of differently sized filters. The C++ program which implements all steps above is not given in Appendix B, we omit it for brevity. Listing 11 shows Cg programs which calculate averages over an input image (steps 1, 3, omitting trivial subtraction in step 2). Averages and sums over filter area are similar to averages over image so Cg programs for those averages are also omitted here. Listing 12 shows a Cg program which calculates the main formula (step 4iv). All other listings can be found in a source code [25]. Double for loops in Listing 11 are of some interest because of hidden overflow potential they may generate. According to Cg specification, basic integer type int is a 32-bit type. However, for loops in Listing 11 do not accept any values over 127 for integer counters, thus prohibiting use of filters larger than 127 by 127. This peculiar feature has an unknown origin and possibly pertains to 8-bit limitation of assembly format of for instruction. It is possible to use floats in place of integers as loop counters in Cg. In this case, there is no 128 (byte) limit, but yet another problem arises for program 1 in Listing 11: filter sizes of more than 110 fail if we use float loop counters. No error messages are generated in both cases, so it is not possible to identify the certain cause of errors; however the second problem likely lies in the limited length of Cg assembly programs. Cg programs in all contemporary GPUs, including series 8000, are limited to 65,535 assembly instructions, either dynamic or static. While the static size of the assembly program 1 in Listing 11 is tiny, its dynamic size (the number of executed instructions) may exceed this limit. In both overflow cases, the calculations went completely wrong with no error messages generated. Because a Cg loop can not be based on program parameters, all filter sizes must be hard-coded in both C++ and Cg programs. The size bounds of filters L, H, as well as a size step S are thus fixed for every executable module. For C++ program, a possible coding solution is to use a C macro, while Cg does not support macro or name definitions.
14
and screen input/output operations and auxiliary computations but including memory bus input/output for the GPU. The GPU finished one filter in 0.620s and 7 filters in 1.660s, while normxcorr2 finished 7 filters in 0.575s. This gives a good example that even in single load multiple use scheme, the GPU does not necessarily give an advantage over the CPU - everything depends on the particular algorithm. On the other hand, if we had installed the latest series 8800 video card, which has over 26 times in performance compared to our series 7300 video card, then the GPU would have had about 9 times advantage over the CPU. Also, if filter size is decreased the GPU quickly becomes on the par with the CPU. Thus, the algorithm must be particularly suitable for the GPU in order to get its performance improved greatly. The next section gives an example of such algorithm.
5. Homography Transformations
In this section, we present another example of an algorithm which works according to a single array load  multiple use scheme. This algorithm uses homography transformations. We have implemented this algorithm in Matlab (two versions), C++, and MinGPU. In the results section, we present execution times for both CPU and GPU, and show that the GPU Cg implementation works about 600 times faster than CPU C++ implementation, and about 7,700 times faster than the fastest CPU Matlab implementation. MRI imagery provides a 3D snapshot of a human body. Thousands of body slices stacked together give a snapshot of the inner human structure. In a somewhat similar idea, a method on how to fuse multi-view silhouettes to reconstruct the visual hull of the 3D object slice by slice in the image-plane has recently been proposed [20]. This method approximates an object by a set of planes (slices) parallel to some reference plane. These slices are related by plane homographies in different views. In this section we present a GPU implementation of this algorithm. The algorithm cannot handle concave objects, however, for many other objects like human bodies it works reasonably well. We provide an outline of the idea here, while transformation formulas and detailed explanation are found in [20]. They take images of a scene from different viewpoints, with uncalibrated cameras, with an aim to recover a 3D representation of the objects in a scene. The ground plane homographies between different viewpoints are estimated using SIFT features and RANSAC algorithm. Vanishing points are computed by detecting vertical lines in a scene and then finding their intersection with RANSAC algorithm. In this way the geometry of the scene is learned. To create a stack of object slices, they use the fact that warping the silhouettes of an object from the image plane to the plane in the scene using a planar homography is equivalent to projecting a visual hull of the object on the plane. Foreground silhouettes from different views are warped to a reference view using the homographic transformation. Fig. 4c contains an example of what transformed silhouette of original image may look like. After transformations are applied, visual hull intersection is found by overlapping transformed silhouettes. This operation is repeated for every slice because slices reside on different parallel planes and therefore different transformations are required between image and slice planes. The following two examples illustrate this method. In the first example, we have a single camera and a single object on a rotating pad (Fig. 4 a-b). The pad rotates at a constant speed and the camera captures views at equal intervals of time, which gives us views of the same figure from every 6 degrees, resulting in a total of 60 views. We create 100 slices of this figure. For every slice and every view, we apply a homography transformation to the slice plane (Fig. 4c), and then we overlap (or fuse) the transformation results. Results for slices 0, 35, 75, 99 are given in Fig. 4d  g. The second example contains a view of an outside scene (Fig. 4 h - i). There are four cameras in the scene, and their relative positions are not known.
15
Some people are present in the scene and we have captured pictures of those people from all four cameras. We find scene geometry as described above. Then, by applying the homography transformations, we construct a 3D sliced representation of these people (Fig. 4j). The results are coarser than ones in the first example, because here we are using just four views as compared to 60 in the first example. The more views we have, the better the outline of the resulting slice.
16
If we increase or decrease the number of views or slices, the execution time increases or decreases correspondingly, which means that the time spent has a linear dependence on the number of views and slices. However, tests have shown that the time starts increasing exponentially if the image size exceeds some threshold. This threshold seems to vary depending only on the video card you use, so it is a hardware threshold. For our nVidia 7300LE graphics card, we found it at approximately 6MB (Fig. 5). This threshold roughly corresponds to the size of the installed GPU memory cache. If the total amount of data accessed by Cg program exceeds this threshold, processing time grows according to exponential law (Fig. 5).
6. Conclusions
We have created a small library, MinGPU, which proved to be a very practical tool in our research. Using MinGPU, we have implemented and tested some computer vision algorithms on graphics processor (GPU), and have found that GPUs are indeed very useful for computer vision researchers. Not only that they can offer a considerable speed up, they make possible for many algorithms to move to real-time domain. We have to point that due to inherent CPU limitations real-time execution is often not feasible on CPU even in remote future. Todays GPUs are in fact SIMD-type (Sinlge Instruction Multiple Data) processors which already work hundreds of times faster than CPUs and their productivity continues to rise astoundingly quickly. Thus it is not hard to conclude that boom of GPGPU computing is bound to happen in coming years. Acknowledgements We wish to express our gratitude for all the help we received in writing of this paper. We would like to thank Andrew Miller for his useful advices and help on GPU programming. The dataset and method of homography transformations is a courtesy of Saad M. Khan. UAV IR dataset is a courtesy of VACE CLEAR evaluation.
References
[1]. Adelson, E. H., Anderson, C. H., Bergen, J. R., Burt, P. J., Ogden, J. M.: Pyramid methods in image processing, RCA Engineer, 29(6), 33-41 (1984). [2]. ATI CTM Guide. AMD Technical Reference Manual (2006). [3]. Briechle, K., Hanebeck, U.D.: Template Matching Using Fast Normalized Cross Correlation. Proceedings of SPIE, 4387, (2001). [4]. Buck, I. et al.: Brook for GPUs: Stream Computing on Graphics Hardware. ACM Transactions on Graphics, 23(3), 777-786 (2004). [5]. BrookGPU source code: http://brook.sourceforge.net [6]. Colantoni, P., Boukala, N., Da Rugna, J.: Fast and accurate Color Image Processing using 3D Graphics Cards. Vision, Modeling and Visualization conference (VMV), (2003).
17
[7]. Computer Vision on GPU, paper list: http://openvidia.sourceforge.net/papers.shtml [8]. Cg Toolkit Users Manual. (2005). [9]. Cg Toolkit: http://developer.nvidia.com/object/cg_toolkit.html [10]. Duda, R. O., Hart, P. E., Stork, D. G.: Pattern Classification (2nd edition), WileyInterscience (2000). [11]. Fung, J.: GPU Gems 2, chapter Computer Vision on the GPU, 649-666. Addison Wesley (2005). [12]. Fung, J., Mann, S., Aimone, C.: OpenVidia: Parallel GPU Computer Vision, Proceedings of ACM Multimedia, 849-852 (2005). [13]. Gddeke, D.: Online tutorial on GPGPU. http://www.mathematik.unidortmund.de/~goeddeke/gpgpu/tutorial.html [14]. Gonzalez, R. C., Woods, R. E.: Digital Image Processing, Reading, Massachusetts: AddisonWesley (1992). [15]. GPGPU: General Purpose Computation on Graphics Hardware. ACM SIGGRAPH Course (2004). [16]. GPGPU community: www.gpgpu.org [17]. Han, Y., Wagner, R. A.: An Efficient and Fast Parallel-Connected Component Algorithm. Journal of the ACM 37:3, 626-642 (1990). [18]. Heymann, S., Mller, K., Smolic, A., Frhlich, B., Wiegand, T.: SIFT Implementation and Optimization for General-Purpose GPU. International Conference on Computer Graphics, Visualization and Computer Vision (WSCG), (2007). [19]. Introduction to Image Processing on GPU. nVidia Technical brief (2005). [20]. Khan, S. M., Yan, P., Shah, M.: A Homographic Framework for the Fusion of Multi-view Silhouettes, to appear in ICCV 2007. [21]. Kilgariff, E., Fernando, R.: GPU Gems 2, chapter The GeForce 6 Series GPU Architecture, 471-492. Addison Wesley (2005). [22]. Labatut, P., Keriven, R., Pons, J-P.: Fast Level Set Multi-View Stereo on Graphics Hardware. 3rd International Symposium on 3D Data Processing, Visualization and Transmission (3DPVT), 774-781, (2006). [23]. Lewis, J.P.: Fast Normalized Cross-Correlation. Vision Interface, 120-123, (1995). [24]. Lucas, B. D., Kanade, T.: An iterative image registration technique with an application to stereo vision. Proceedings of the International Joint Conference on Artificial Intelligence, 674-679 (1981). [25]. MinGPU source: www.cs.ucf.edu\~vision\MinGPU [26]. MinGPU Google group: http://groups.google.com/group/MinGPU [27]. Mahalanobis, A., Vijaya Kumar, B. V. K., Song, S., Sims, S. R. F., Epperson, J. F.: Unconstrained correlation filters. Applied Optics 33 (17), 3751-3759 (1994). [28]. Moreland, K., Angel, E.: The FFT on GPU. SIGGRAPH 2003, 112-119 (2003).
18
[29]. NVIDIA(R) GeForce 7950 GX2 Technical Specifications. [30]. NVIDIA CUDA Compute Unified Device Architecture, Programming Guide. [31]. NVIDIA GeForce 8800 GPU Architecture Overview. nVidia Technical brief (2006). [32]. Meindl, J. D.: Low Power Microelectronics: Retrospect and Prospect. Proceedings of the IEEE 83(4), 619 635 (1995). [33]. OpenVIDIA GPU computer vision library. http://openvidia.sourceforge.net [34]. Ohmer, J., Maire, F., Brown, R.: Implementation of Kernel Methods on the GPU. Proceedings 8th International Conference on Digital Image Computing: Techniques and Applications (DICTA), 543-550, (2005). [35]. Rumpf, M., Strzodka, R.: Level Set Segmentation in Graphics Hardware. Proceedings of IEEE International Conference on Image Processing 3, 11031106 (2001). [36]. Sh GPU metaprogramming library: http://www.libsh.org/ [37]. Vijaya Kumar, B.V.K., Mahalanobis, A., and Juday, R. D.: Correlation Pattern Recognition. Cambridge University Press (2005). [38]. Vijaya Kumar, B. V. K., Carlson, D. W., Mahalanobis, A.: Optimal trade-off synthetic discriminant function filters for arbitrary devices. Optic Letters 19 (19), 1556-1558 (1994). [39]. Yang, R., Pollefeys, M.: Multi-Resolution Real-Time Stereo on Commodity Graphics Hardware. IEEE Computer Vision and Pattern Recognition, 211-218, (2003). [40]. Yang, R., Welch, G.: Fast image segmentation and smoothing using commodity graphics hardware. Journal of Graphics Tools, 7(4):91-100, (2002).
19
20
Most functions in Cg programs have some input parameters. For instance, we have to specify some input parameters for the entry function before we execute a program. Parameters can be values, arrays, matrices, or textures. Enumerator nType specifies the type and number format of input parameter, string szName contains its Cg program name and pValue holds the parameter value. Besides parameters we set with SetParameter method, functions may have some parameters which refer to pre-defined names [8]. For example, parameter which refers to name TEXCOORD0 will automatically receive the row and column values for the currently processed pixel. bool Program::Run(Array *output) This method executes a Cg program on the GPU. Array output accepts the output of this program; it is filled as a result of program execution. The program is run separately for every cell in array output. A new value is generated for every cell in this array. All methods in classes Array and Program return true if successful and false otherwise.
21
Listing 2 (Hello, World reduced): Array.CopyToGPU(); Program.SetParameter(enTexture, "texture", (void *) Array.Id()); Program.Run(&Array); Array.CopyFromGPU(); Listing 3 (image derivative in x, y direction): Output.Create(NULL, cols, rows, Luminance); Array.CopyToGPU(); Program.SetParameter(enTexture, "T", (void *) Array.Id()); Program.SetParameter(enMatrixf, "K", (void *) Kernel.Id()); Program.Run(&Output); Output.CopyFromGPU(); float4 Derivative3x3 ( float2 C : TEXCOORD0, samplerRECT T, uniform float3x3 K) : COLOR { float4 result = 0; for (int row = 0; row <= 2; row ++) { for (int col = 0; col <= 2; col ++) { result = result + K[row][col] * texRECT(T, C + float2(col - 1, row - 1)); } } return result; } Listing 4 (image derivative in t direction): float4 DerivativeT3x3 ( float2 C : TEXCOORD0, samplerRECT T1, samplerRECT T2, uniform float3x3 K) : COLOR { float4 result = 0; for (int row = 0; row <= 2; row ++) { for (int col = 0; col <= 2; col ++) { float4 p1 = texRECT(T1, C + float2(col - 1, row - 1)); float4 p2 = texRECT(T2, C + float2(col - 1, row - 1)); result = result + K[row][col] * p2 - K[row][col] * p1; } } return result; } float4 DerivativeT1x1 ( float2 C : TEXCOORD0, samplerRECT T1, samplerRECT T2) : COLOR { return texRECT(T2, C + float2(col, row)) - texRECT(T1, C + float2(col, row)); } Listing 5 (image derivative, unfolded loops): float4 Derivative3x3 ( float2 C : TEXCOORD0, samplerRECT T, uniform float3x3 K) : COLOR
22
{ float4 result = 0; result = result + K[0][0] result = result + K[0][1] result = result + K[2][1] result = result + K[2][2] return result; * texRECT(T, C + float2(-1, -1)); * texRECT(T, C + float2(-1, 0)); * texRECT(T, C + float2(1, 0)); * texRECT(T, C + float2(1, 1));
Listing 6 (Gaussian smoothing): Array.CopyToGPU(); Gaussian.CopyToGPU(); Program.SetParameter(enTexture, "T", (void *) Array.Id()); Program.SetParameter(enArrayf, "G", (void *) Gaussian.Id()); Program.Run(&Output); Output.CopyFromGPU(); float4 Gaussian5x5 ( float2 C : TEXCOORD0, samplerRECT T, uniform float G[5][5]) : COLOR { float4 result = 0; for (int row = 0; row <= 4; row ++) { for (int col = 0; col <= 4; col ++) { result = result + G[row][col] * texRECT(T, C + float2(col-2, row-2)); } } return result; } Listing 7 (Gaussian smoothing with the use of convolution primitive): Out.Create(NULL, cols, rows, Luminance); Array.CopyToGPU(); float K[3][3] = {{-1, 0, 1}, {-1, 0, 1}, {-1, 0, 1}}; {dx derivative kernel} _convolve(In, Out, (float *) &K, 3); Out.CopyFromGPU(); Listing 8 (pyramid REDUCE operation): Array.CopyToGPU(); Mask.CopyToGPU(); Program.SetParameter(enTexture, "T", (void *) Array.Id()); Program.SetParameter(enArrayf, "Mask", (void *) Mask.Id()); Program.Run(&Output); Output.CopyFromGPU(); float4 ReducePyramid ( float2 C : TEXCOORD0, samplerRECT T, uniform float Mask[5][5]) : COLOR { float4 result = 0; for (int row = 0; row <= 4; row ++) { for (int col = 0; col <= 4; col ++) {
23
result = result + Mask[row][col] * texRECT(T, float2(2 * C.x + col - 2, 2 * C.y + row - 2)); } } return result; } Listing 9 (Lukas-Kanade optical flow): Array[t].CopyToGPU(); Array[t + 1].CopyToGPU(); { Take x, y, t derivatives from both input images, and average them } float Kx[3][3] = {{0, 0, 0}, {0, -1, 1}, {0, 0, 0}}; fRet &= _convolve(Array[t], Ix, (float *) &Kx, 3); fRet &= _convolve(Array[t + 1], Itmp, (float *) &Kx, 3); fRet &= _average(Itmp, Ix); float Ky[3][3] = {{0, 0, 0}, {0, -1, 0}, {0, 1, 0}}; fRet &= _convolve(Array[t], Iy, (float *) &Ky, 3); fRet &= _convolve(Array[t+1], Itmp, (float *) &Ky, 3); fRet &= _average(Itmp, Iy); fRet &= _subtract(Array[t + 1], Array[t], It); { Calculate sums in 5x5 area } Program Program; fRet &= Program.Create("opticalflow.cg", "Sum2"); fRet &= Program.SetParameter(Program::enTexture, "I", (void *) Ix.Id()); fRet &= Program.Run(&SIx2); { do the same for all other 4 sums } { calculate [u, v] from A*[u, v] = B matrix equation } fRet &= Program.Create("opticalflow.cg", "OpticalFlow"); fRet &= Program.SetParameter(Program::enTexture, "SIx2", (void *) SIx2.Id()); fRet &= Program.SetParameter(Program::enTexture, "SIy2", (void *) SIy2.Id()); fRet &= Program.SetParameter(Program::enTexture, "SIxIy", (void *) SIxIy.Id()); fRet &= Program.SetParameter(Program::enTexture, "SIxIt", (void *) SIxIt.Id()); fRet &= Program.SetParameter(Program::enTexture, "SIyIt", (void *) SIyIt.Id()); fRet &= Program.Run(&result); fRet &= result.CopyFromGPU(); float4 OpticalFlow ( float2 C : TEXCOORD0, samplerRECT SIx2, samplerRECT SIy2, samplerRECT SIxIy, samplerRECT SIxIt, samplerRECT SIyIt) : COLOR { float2x2 A; A[0][0]= texRECT(SIx2,C).r; A[0][1]= texRECT(SIxIy,C).r; A[1][0]= texRECT(SIxIy,C).r; A[1][1]= texRECT(SIy2,C).r; float2 B; B[0] = - texRECT(SIxIt, C).r; B[1] = - texRECT(SIyIt, C).r;
24
float2x2 invA = 1/determinant(A) * float2x2(A[1][1], -A[0][1], -A[1][0], A[0] [0]); float2 uv = mul(A, B); float4 result; result.r = uv[0]; result.g = uv[1]; return result; } Listing 10 (image sum (squares)): float4 Sum2 ( float2 C : TEXCOORD0, samplerRECT I) : COLOR { float4 val; float4 result = 0; for (int row = -2; row <= 2; row ++) { for (int col = -2; col <= 2; col ++) { val = texRECT(I, C + float2(col, row)); result = result + val * val; } } return result; } Listing 11 (normalized cross-correlation): { computing the image average under the 77 by 77 shifting window } float4 favg ( float2 C : TEXCOORD0, samplerRECT I) : COLOR { float4 result = 0; for (int row = 0; row < 77; row ++) { for (int col = 0; col < 77; col ++) { result = result + texRECT(I, float2(C.x + col, C.y + row)); } } return result / (77.0 * 77.0); } { computing the squared sum under the 77 by 77 shifting window } float4 Snorm2 ( float2 C : TEXCOORD0, samplerRECT fnorm) : COLOR { float4 val = 0; float4 result = 0; for (int row = 0; row < 77; row ++) { for (int col = 0; col < 77; col ++) { val = texRECT(fnorm, float2(C.x + col, C.y + row)); result = result + val * val; } } return result; }
25
Listing 12 (main formula in normalized cross-correlation): float4 Correlation ( float2 C : TEXCOORD0, samplerRECT fnorm, samplerRECT tnorm, samplerRECT Snorm2, samplerRECT Tnorm2) : COLOR { float4 numerator = 0; float4 denominator = sqrt(texRECT(Snorm2, C) * texRECT(Tnorm2, float2(0, 0))); for (int row = 0; row < 77; row ++) { for (int col = 0; col < 77; col ++) { numerator = numerator + texRECT(fnorm, float2(C.x + col, C.y + row)) * texRECT(tnorm, float2(col, row)); } } return abs(numerator / denominator); } Listing 13 (homography transformation): {Initialize script. Set script parameters that wont change later} Script Script; Script.Create(strScript_Homography, "main"); Script.Select(); Script.SetParameter(enFloat, "ratio", (void *) &fRatio); Script.SetParameter(enFloat, "crop_x1", (void *) &crop_x1); Script.SetParameter(enFloat, "crop_y1", (void *) &crop_y1); {For every slice} for (int slice = 1; slice <= SLICES; slice++) { {Create and load empty output texture} Texture Accumulator; Accumulator.Create(fbuf, crop_x, crop_y); Accumulator.CopyToGPU(); {Bind script to this texture} Script.SetParameter(enTexture, "result", (void *) & Accumulator) {For every view} for (int i = 1; i <= VIEWS; i++) { {set script parameters - view and homography matrix} Script.SetParameter(enTexture, "view", (void *) &View); Script.SetParameter(enMatrixf, "H", &HMatrix); {run the script} Script.Run(&Accumulator); } {recover results from the script} Accumulator.CopyFromGPU(); } {Cg Script} float4 main ( {current coordinates in cropped region} float2 coords : TEXCOORD0, {3x3 homography transformation matrix} uniform float3x3 H, {(x, y) gives offset of cropped region in target image} uniform float ratio, uniform float x, uniform float y,
26
{view refers to input view image, result refers to accumulator image} uniform samplerRECT view, uniform samplerRECT result) : COLOR {Input coordinates, coord, contain coordinates in cropped region, 200x180} {To do proper transformation, we have to convert coords to target image} float3 C = float3(coords[0] * ratio + x, coords[1] * ratio + y, 1.0); {the next three lines do homography transformation} float k = H[2][0] * C[0] + H[2][1] * C[1] + H[2][2] * C[2]; float A = (H[0][0] * C[0] + H[0][1] * C[1] + H[0][2] * C[2]) / k; float B = (H[1][0] * C[0] + H[1][1] * C[1] + H[1][2] * C[2]) / k; {now take already accumulated value from the result } float4 x = texRECT(result, coords); { and add a new value to it} float4 y = texRECT(view, float2(A, B)); return x + y;
27
Fig. 1. The results of Lukas-Kanade optical flow algorithm implemented on GPU. The first row shows the original images, the second row shows the computed optical flow.
Fig. 2. This figure illustrates a possible application for GPU scaled correlation filters. UAV IR night video has occasional inclusions of people of unknown scale which must be detected. The video is pre-processed with thresholds and Harris corner detector, which leaves about 30-150 possible people candidates in every frame. The exact scale of people is unknown because aircraft altitude keeps changing; however, there exists a limiting range for the scales, both upper and lower. People are being detected by applying a set of differently sized correlation filters to every possible position. Resulting detections are then tracked to ensure their consistency.
Fig. 3. The correlation filters implemented on GPU. Railroad track video is made with a camera fixed over tracks, so the scale of objects in a field of view does not change. Our purpose is to detect and count rectangular rail fastener clips which hold the rail in place. However, besides being spatially rotated for up to 30 degrees clips can be of slightly different geometric/visual appearances by manufacturers design. In a set of filters we put a filter for every possible filter appearance. Filters are created using OT-MACH algorithm with a rotation step of 3 degrees, and then applied sequentially to an image.
28
(a)
(b)
(c)
(d)
(e)
(f)
(g)
(h)
(i)
(j)
Fig. 4. The results of implementation of visual hull intersection method for reconstructing object model using plane homographies implemented on GPU. (a-b) solid figure on a rotating pad; (c) its homography transformation; (d-g) resulting slices: feet, legs, lower torso, upper torso; (h-i) two views; (j) their slice at feet level.
Time Per Slice MATLAB (for loops) MATLAB (built-in functions) C++ CPU GPU 10.5 min 2 min 35 s 12 s 0.02 s
Total Time Hours Hours about 25 min 3.5 s (including 1.5s file read)
29
Fig. 5. Time versus image size for visual hull intersection using homography transformations.
30