Sorting data in two clock cycles -

Sorting data in two clock cycles

Image recognition can be a tricky and math-intensive problem. This research uncovered a way to do much of the work using an FPGA . . .and just the right sort of algorithm.

Sorting data efficiently is a problem that many mathematicians have investigated by trying to minimize the complexity of algorithms. The need for efficiency grows with the number of elements or the number of repetitions in the data sort. You can find a large number of published sorting algorithms in literature1 or by searching the Web, but it's difficult to decide which algorithm best suits a specific problem. Each algorithm has its advantages and disadvantages.

A sorting algorithm's efficiency depends mostly on the system architecture of the hardware the algorithm will run on and the number of elements to be sorted. This article compares two completely different system architectures, such as mainframes and ASICs, and will show that on FPGAs you can implement very efficient sort algorithms with less granularity. Our results will be useful to hardware developers who want to take advantage of parallelism that's been mainly the purview of mainframes for decades. We use a system that recognizes images of traffic signs to illustrate some examples.

The research
The architectural aspect of our research was conducted through the Competence Network for Technical, Scientific High-Performance Computing in Bavaria.2 This project, entitled “Exploration of the Possibilities for the Direct Synthesis of Concurrent C Programs on High-performance Computers in FPGAs,” was conducted through the Georg-Simon-Ohm University of Applied Sciences in Nuremberg.3 The project examined methods of converting mainframe programs to run on FPGAs and which methods were possible; the level of effort required for conversion; and the overall quality of the converted program.

We used a mainframe computer as development platform and for simulation. The purpose of this research was to convert concurrent mainframe software running on several processors into prototypical hardware designs on FPGAs. Both platforms could run software in parallel, and we had several libraries that supported development of parallel software for the platforms.

The choice of the programming model for mainframes depends on the system architecture, with memory distribution one of the main aspects. Distributed-memory machines have different programming models than do shared-memory machines. Independent processes running on distributed-memory machines communicate with others by sending messages. Those of shared-memory machines communicate with each other by accessing the same (shared) memory.

FPGAs have similarities with both architectures. There are ways to access the same internal or external memory and to communicate between processes by sending messages. We decided to follow the shared-memory programming model of mainframes because it was most similar to the way FPGAs accessed the same memory. Furthermore, the shared-memory model had less communication overhead compared to distributed-memory machines, and the minimal overhead made it easier to forecast the performance of the target hardware. The OpenMP C and C++ library were used as the development environment for the mainframe.4 For the FPGA we used the Handel-C language, which was developed at the University of Oxford and commercialized by Celoxica.5,6,7 The underlying C syntax of both tools gave us a good basis for comparison. Both programming environments offer some extensions for parallel processing.

From mainframe to FPGA
The idea was to run the same application on both the mainframe and the FPGA. Reusing the code shortens our development time and that of real-world programmers. To run the whole application on an FPGA is something of a new concept for many application areas, particularly in image processing, which has a lot of data. In this context FPGAs were used mostly just for preprocessing the captured data.

In addition to efficiency, parallelism was an important factor in enhancing performance. Parallel processing constructs might also be reused in the same way that pipelining mechanisms are reused. The possibility of providing both mechanisms presupposes separate memory access.

If more than one process has write-access to the same memory at the same time, it's impossible to predict the results. Therefore, it's important to ensure that only one process has write-access to the same memory. If the system provides more than one memory, it's possible to process in parallel, if the processes are not data-dependent (for example, if Output data of process 1 is Input data for process 2). The two memories must also be separately addressable. If you have two memories with two separate address buses, you can address two memory cells at the same time.

Parallel computing means dividing a problem into smaller chunks and processing them in parallel with a group of tasks. Then the chunks are put together after each task has finished its work.

Figure 1: Pipelining in an assembly line

Click here for a full-size version of this image

Pipelining could be compared with an assembly line in a car manufacturing. Many assembly groups work in parallel but in a different sequence, as shown in Figure 1. The latency period of this parallel execution model will be the same as for the sequential execution model but the throughput will increase linearly with the number of working groups. If the pipeline is filled, then each clock period a car will be completely assembled.

These ideas show that it's best to speed up an application using both mechanisms. The comparison was helpful to provide portability. But because some differences occur in how FPGAs and mainframes handle parallel processes, it's important to think twice about algorithm design. The theory of parallelism is the same for both but the decisive difference between them lies in their communication model and task management. Parallelism on high-performance computers needs more overhead for these functions because parallel processes are started at run time, just as the division of the problem into smaller chunks and their distribution to other processors are.8

That's why most parallel sort algorithms are useful predominantly for large amounts of data. To be efficient, each task needs to process on its own. The efficiency of an algorithm can be described in terms of its complexity.9 An algorithm's complexity describes the amount of “hardware use” or the process time as a function of problem size. A hardware-efficient algorithm should therefore have low complexity. But what about small amounts of data? What if only a few elements need to be sorted in as little time as possible? These questions make it clear that in our case, execution time was essential. We had to handle repeating problems. Each period of time the camera sensor took an image that should be classified in as short a time as possible. A shorter processing time will have a positive effect on how fast, and the faster the camera image could be classified the faster the system could react. A higher rate of hardware use might be acceptable if it minimized processing time. On high-performance computers adding parallelism might not be worth the benefit because of communication overhead it requires. Sometimes it's better to process sequentially.

This is the main difference between computers and chips like FPGAs. The communication overhead could be (almost) totally eliminated because of the FPGA's quick one-cycle access to its internal memory. The sorting problem can then be divided into chunks with smaller granularity. In addition to this, some parallel algorithms are useful only for FPGAs. High-performance computers are generally best for solving big and varying problems using parallelism. They are designed for scalability. In contrast, chip devices are best for solving small and repetitive problems using parallelism. Next we'll concentrate on algorithms for FPGAs with some examples using Handel-C constructs.

Traffic sign recognition
The question of solving small problems efficiently was also an essential objective of our project. A system to recognize traffic signs was built up from an FPGA as an evaluation platform. This device has two memories that can be accessed over two separate buses. This makes it possible to switch from one memory to the other in two clock cycles, similar to its ability with internal memory. This allowed us to use the pipelining techniques described above. Data taken from the image sensor is stored in one of the two memories. The next image is stored in the other memory, and so on. New data from the image sensor is stored in one memory, while some processes compute the data in the other memory. These processes directly compute the desired information from stored image data; no other system resources are involved in recognizing traffic signs. The system sends out information about identified traffic signs over a serial or CAN interface. We designed our “Intelligent Camera,” pictured in Figure 2, to demonstrate the portability of code.10

Figure 2: The “Intelligent Camera”; scale in cm

The recognition algorithm uses standard image-processing operators.11,12 The whole application processes images taken from the image sensor in three steps. The first one encapsulates the preprocessing work. The second step encapsulates all algorithms that are grabbing for information in the color-classified picture. The output of this step is information about the properties of regions, like their form, size, color, and position relative to other regions. The last step compares that information with models of traffic signs we provided during development.

We want to focus on the first step because it sometimes consumes a large part of the overall runtime. In this step, color-depth is being reduced. For traffic signs, only some color information (in other words, red, blue, yellow) or so called noncolor information (black or white) are relevant. Each pixel comes from the image sensor with a color depth of 24 bits (RGB coded). After reduction the color depth is just three bits. This speeds up the application because we have to handle less data. Afterwards, it's helpful to enhance image quality using filter operators.

These operators are partially used on every pixel and their nearest neighbors. The median operator, for example, improves the quality of the image by reducing noise. Using it on the whole image makes it necessary to sort nine data values for each pixel, as shown in Figure 3. That means that nine data values have to be sorted perhaps a million times or more.

Figure 3: Use of median filter

To tame this seemingly intractable problem, we used an algorithm that takes only two clock cycles to sort data, regardless of size. Recall that Handel-C tries to convert each line of source code into hardware that executes in a single clock cycle. Each line of code therefore should not be too complex or the system might not meet outside timing constraints.

The whole process of sorting n elements could be divided into several tasks. The question is, which of these tasks can be processed in parallel with the others? For starters, they must be completely independent of data from any other tasks. No task's output should be required as input data for another. And as mentioned above, it's important to inspect the number of elements before choosing an algorithm. It's not worth bothering with a complex algorithm when there are only two elements to be sorted. In most cases a procedure like this, shown in Listing 1, would be fine.

Listing 1: Simple way of sorting two elements

if(x<=y) {  data[0]=x;  data[1]=y;} else {  data[0]=y;  data[1]=x;}  

This procedure will become more complex with more elements, however. For three elements, six branches are needed, as shown in Listing 2.

Listing 2: Complexity of hardware use is n !

if((x<=y) & (x<=z) & (y<=z)) {  data[0]=x;  data[1]=y;  data[2]=z;} else if((x<=y) & (x<=z) & (y>z)){  data[0]=x;  data[1]=z;  data[2]=y;} else if ...  

These constructs become badly organized as more elements need to be sorted. But it's interesting how FPGAs can process this algorithm. Listing 3 uses Handel-C par constructs.

Listing 3: Sorting three elements in only one clock cycle

if( (x<=y) & (x<=z) & (y<=z) ) {  par {   data[0]=x;   data[1]=y;   data[2]=z;  }} else if((x<=y) & (x<=z) & (y>z)){  par {   data[0]=x;   data[1]=z;   data[2]=y;  }} else if ...  

A par construct instructs the synthesis tool to process the statements in parallel. Therefore, this whole example is executed in only one clock cycle. The problem, however, is that the complexity of hardware use for n elements is n ! . For the median filter sorting nine elements, that works out to 362,880 branches. The condition of each decision will be very complex, and it's likely that timing constraints could not be met. This situation is not satisfying and makes the algorithm useless for more than three elements. But this procedure shows that it's possible to sort data independent of size in constant time.

There must be a way to get the same results with less hardware use. Indeed, only a small amount of data needed by the median filter actually has to be sorted, but it's too complex to compare each possible constellation. A feasible step is to treat every element in parallel, as shown in Figure 4. Then the complexity will still be linear. To split every task into more tasks is also practicable. These tasks should determine the position of each element, which makes it possible to put the elements into order.

Figure 4: Idea of sorting some more elements

Figure 5: Convention to prevent conflicts of positioning elements

To find out an element's position, each element must be compared with each of the other elements. This implies n -1 comparisons for each of n elements. In this case the hardware complexity is n *(n -1). For nine elements there will be 72 comparisons of two elements. This is a clear improvement over the first procedure. The timing constraints would not be at risk. To prevent calculating the same position for two elements with the same size, we adopted the following convention, shown in Figure 5:

  • When comparing an element on the left side, use the <= operator.
  • When comparing an element on the right side, use the < operator.

To prevent race conditions using the same variable, the result of each comparison has to be stored separately. These results are binary values. We must therefore declare an array of binary values to store all the results. The sum of digits determines the desired position of the element. Figure 6 shows an illustration. The highlighted value is being compared with all other values. If a value is less than 6, a flag is set otherwise a flag of 0 is cleared in the result array.

Figure 6: Reducing a problem of sorting to a problem of determining the “sum of digits”

All comparisons will be done at the same time. Now it's time to present an implementation of this procedure, shown in Listing 4.

Listing 4: Implementation of replication sort

1  par (element=0; elementelement2) {5          if(uList[element] > uList[element2])6            comp[element][element2] = 1;7        } else ifselect (element<=element2) {8          if(uList[element] >= uList[element2+1])9            comp[element][element2] = 1;10        }11      }13      position[element] = SUM_OF_DIGITS(comp[element]);14      sList[position[element]]=uList[element];15    }16  }  

This example uses an unsorted list as input data (variable uList). The output is stored in a sorted list (variable sList). A par statement, such as in line 1, replicates all included statements. We'll call this a par replication . These statements will be processed in parallel. In this case, there is a parallel branch for every element. In line 3, a second par replication is used. It includes the comparisons with all other elements. This duplicated replication has the effect of 72 comparisons at the same time when SIZE has the value of 9. That's why we called the algorithm replication sort . The result of each comparison is being stored in an array of binary values (variable comp). An array of nine values is needed for each element. That's why a two-dimensional array is being used in this example. The statement in line 13 computes the sum of digits. As mentioned above, this sum represents the position of the elements for bringing them into order.

With this procedure we reduced the problem of sorting into a problem of computing the sum of digits. There are several ways to compute it. Considering timing constraints of the desired design, it might be possible to summarize all data values in one clock cycle like this:

#define SUM_OF_DIGITS(data) = data[0] + data[1] + data[2] + …

If the number of elements is too large, it's possible that external timing constraints might not be met. In that case, another way to compute the sum of digits uses a lookup table. For nine elements, use a table with 256 entries. Only eight bits are needed because comparing an element with itself is superfluous. The lookup table stores the sum of digits for every possible value. Then the determination could be done by accessing this table in only one clock cycle. This procedure has the disadvantage of wasting some memory. The data type for coding nine possible positions requires four bits, so the amount of memory is 128 bytes.

For a greater number of elements, the amount of memory needed might be unacceptable. In that case, the sum of digits could be determined by accessing the table twice or more with several parts of the data type. For example, to determine the sum of digits of a 16-bit value with the 8-bit table above, make two accesses to the table, first with the high-order 8 bits and then the low-order 8 bits. The sum of both results represents the searched value. So it's possible to combine both procedures and still meet your timing constraints.

In our particular application, it was possible to combine lines 13 and 14. That's why the whole process of sorting nine elements could be done in only two clock cycles.

Beyond two clocks
The usefulness of an algorithm depends on the underlying system architecture and the number of elements to sort. Parallel sorting will speed up the application, but it would need more computing power. This article shows a simple way to sort small amounts of data in only two clock cycles using an FPGA. This solution reduced the problem from a sorting problem into simply summing digits. If it's not acceptable to spend so much computing power, there are ways to sort data in three or more clock cycles with less hardware required.

Stefan May is an electrical engineer at Audi in Ingolstadt, Germany, where he analyzes and designs end-of-line test systems for electronic devices. He was involved in the project “Exploration of the Possibilities for the Direct Synthesis of Concurrent C Programs on High-performance Computers in FPGAs” during completion of his master's thesis at the University of Applied Sciences in Nuremberg between 2003 and 2004. You can reach him at .

Prof. Dr. Peter Urbanek has 10 years of industrial experience in the development of LAN testers at Siemens AG in Nuremberg, Germany before his convocation in 1994 to the Georg-Simon-Ohm University of Applied Sciences in Nuremberg. His convocation areas are digital and microcomputer technology. He is also a director for the Bavarian Universities of Applied Sciences in the Competence Network for Technical, Scientific High-Performance Computing. You can reach him at .


  1. Leighton, Frank Thomson. Introduction to Parallel Algorithms and Architectures — Arrays, Trees, Hypercubes . San Mateo, California: Morgan Kaufmann Publishers, Inc., 1992.
    Project homepage of the Competence Network for Technical, Scientific High Performance Computing in Bavaria, October 16, 2004.
  3. Urbanek, Peter and Stefan May. Final KONWIHR Research Report: “Exploration of the Possibilities for the Direct Synthesis of Concurrent C Programs on High-performance Computers in FPGAs,” Berlin/Heidelberg: Springer, expected in Spring 2005.
  4. OpenMP Architecture Review Board, “OpenMP C and C++ Application Program Interface Version 2.0,” 2002.
  5. Bowen, Matthew, et al. Handel-C Language Reference Manual Version 2.1 , Embedded Solutions Limited, 2001.
  6. Celoxica Limited. “Handel-C Application Note— Optimization,” 2001.
  7. Stöcklein, Thomas. “Programmiersprache Handel-C zur effektiven Entwicklung von synthetisierbaren Systemen und Algorithmen,” diploma thesis at the department of communications and precisions engineering of the University of Applied Sciences in Nuremberg, summer term 2000.
  8. Rauber, Thomas and Gundula Rünger. Parallele und verteilte Programmierung , Berlin/Heidelberg: Springer, 2000.
  9. Reß, Harald and Günter Viebeck. Datenstrukturen und Algorithmen— Objektorientiertes Programmieren in C++ , München/Wien: Carl Hanser, 2000.
    Project site of the “Intelligent Camera,” Institut für Elektronische Systeme (ELSYS), October 16, 2004.
  11. Ritter, Werner: Automatische Verkehrszeichenerkennung , Koblenz: Fölbach Verlag, 1997.
  12. Zimmer, Wolf D. and Eberhard Bonz. Objektorientierte Bildverarbeitung— Datenstrukturen und Anwendungen in C++ , München/Wien: Carl Hanser, 1996.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.