Doing Hartley Smartly - Embedded.com

Doing Hartley Smartly



To read original PDF of the print article, click here.

Doing Hartley Smartly

Robert Scott

Think the fast Fourier transform is fast enough?Think again. Meet the fast Hartley transform.

Transforming a series of samples taken in the time domain into a frequency domain representation is useful in a wide variety of signal processing applications. This includes direct analysis of the frequency spectrum as well as operations like convolutions, which are used in digital filtering. The usual tool for performing this transformation is the Fourier transform, or to be more specific, the fast Fourier transform (FFT). But for most applications, there is an even faster method called the Hartley transform. This article reviews some published data on the fast Hartley transform (FHT) and then develops an optimized implementation of this algorithm.

A typical FFT application starts with a series of samples taken in the time domain. For example, suppose you want to determine the vibration characteristics of an automobile engine. You could attach an accelerometer to the engine block and sample the output of the accelerometer at regular intervals. Suppose you take 4,096 samples over a one-second period. These 4,096 samples would then be transformed, using an FFT, into an array of 4,096 elements where each element specifies the vibrational energy at a particular frequency. The frequencies represented in the FFT array run from 0Hz to 4,095Hz in steps of 1Hz. So if you needed to determine the total energy in the frequency band from 200Hz to 300Hz, you could add up the amplitudes of FFT entries 200 through 300. Although this example is described using the FFT, the FHT works just as well.

When I first saw an article about the Hartley transform, I must admit that I didn't give it serious consideration. The formulae looked similar to the ones for the Fourier transform, except that the imaginary, i, was missing. I thought there must be some catch. But then, years later, I ran across that article again and decided to delve into it to see what the story was. I was so impressed with what I learned that I immediately wrote a little freeware utility for general aviation enthusiasts-a PC program to measure the RPM of an airplane engine based on the sound of that engine. The program (available at www.wwnet.net/~rscott/rpmsound.html) uses the Hartley transform to find the peaks in the frequency spectrum. The pattern and location of the frequency peaks determine the RPM.

Hartley transform defined
The Hartley transform was first proposed in 1942 by Ralph Hartley. This is the same fellow, by the way, who invented the well-known Hartley oscillator circuit. Just as in the Fourier transform, the Hartley transform starts with a sequence of samples in the time domain. Let X(t) for t = 0 … N–1 be such a sequence. The Hartley transform of this sequence is another sequence, H(f) for f = 0 … N–1 given by:

If, for some reason, you want to undo this transform, the formula is almost identical, except for the absence of the 1/N factor:

The first thing that made me suspicious about these formulae was that the sum of the cosine and the sine of an angle is just another phase-shifted sine function (shifted by 45 degrees, to be exact). What could be so magical about correlating against a sine function that was shifted by an arbitrary amount? Well, the answer is “nothing.” In fact, it is possible to define a whole class of reversible transforms based on other phase-shifted sine functions. But the specific sine function used in the Hartley transform exhibits a certain symmetry between the transform and its inverse. And when it comes to implementing the transform, we will see that we end up taking pure sines and cosines anyway and not their sums.

Comparisons with the FFT
In a 1988 article in Byte,1 Mark O'Neill gave formulae similar to the ones shown above; but note the typo in his Equation 5, which I have corrected. In addition to introducing the Hartley transform, O'Neill went on to develop the fast Hartley transform, based on a 1984 article by Ronald Bracewell in Proceedings of the IEEE.2While the Fourier transform works in the complete generality of complex-valued sequences, the Hartley transform only works with real-valued sequences. This should not be a problem for most real-world applications, which are limited to real-valued time domain samples anyway. To show that nothing is lost in using the Hartley transform, O'Neill gave the following formulae for the real and imaginary parts of the Fourier transform in terms of the Hartley transform:

It turns out that using the Hartley transform to compute the Fourier transform is faster than computing the Fourier transform directly. This is an important use of the Hartley transform because the Fourier transform has immediate natural interpretation while, as far as I know, the Hartley transform does not. Each complex value in the Fourier transform represents the amplitude and phase of a particular frequency component in the frequency spectrum of a time domain sequence. Often only the amplitude is interesting, so applications will then form the array of magnitudes of the complex-valued Fourier series. The implementation of the Hartley transform described here provides for a direct calculation of this array of amplitudes.

Note that Equation 2 leads to out-of-range indices when f = 0, since there is no H(N). The interpretation of the indices must, therefore, be extended modulo-N. Therefore, H(N) takes the value H(0). And, in fact, this modulo-N interpretation of the indices is supported by the formulae that define the Hartley transform, which give the same values if you add N to the indices.

If you have ever used the Fourier transform to process real-valued data, you may have been aware of a sense of wasted effort. To begin with, you probably stuffed your real-valued sequence into the real parts of an array of complex numbers and zeroed out the imaginary parts. Having all those zeroes using up space but providing no information always bothered me. Then, when the Fourier transform was done, the resulting array of complex numbers had a lot of redundancy. In particular, you never had to look past the midpoint of the array (the so-called Nyquist frequency) because the frequency domain data in the second half of the array was a kind of mirror image of the first half. The complex-valued F(N–f) is just the complex conjugate of F(f). (This is only true for the Fourier transform of real-valued sequences. In general, the Fourier transform has no such redundancy.) When I saw how the power of the Fourier transform was being “wasted” on real-valued sequences, I finally understood how something like the Hartley transform could be more efficient.

Butterflies
Now let's get down to business and see how the Hartley transform can be implemented efficiently as the FHT. The development is similar to that of the FFT, which is not presented here but can be found in any textbook on digital signal processing.

Figure 1: A single FHT butterfly

The heart of the algorithm is a computation whose data-flow diagram looks like a butterfly. But unlike the symmetric butterfly of the FFT, the FHT butterfly, as shown in Figure 1, is a bit ugly. It has a vestigial third wing.

From now on we will be assuming that N is a power of two. Since N is even, the time sequence, X(t), which appears in the definition of the Hartley transform (Equation 1), can be divided into two intertwined sequences, X0 and X1, given by:

That is, sequence X0 consists of the even-indexed values from sequence X and sequence X1 consists of the odd-indexed values. As before, we will consider all indices to be interpreted modulo-N. But since X0 and X1 are defined in terms of 2t, they actually repeat modulo N/2. So consider X0 and X1 as sequences of length N/2 and let H0 and H1 be their Hartley transforms. Let M stand for N/2. Then, by the definition of the Hartley transform in Equation 1 we have:

I leave it as an exercise for the reader to verify that:

(If you are reading the Byte article, note that this equation is incorrectly shown with Ns instead of N.) It took me about half an hour to verify this equation with paper and pencil, using only the trig identities:

If you want to develop an appreciation of the basis of the Hartley transform, I recommend performing the exercise.Keeping in mind that the Hartley transforms on the right side of Equation 3 repeat modulo N/2, we can recognize the following symmetry:

Equations 3 and 4 taken together define the complete FHT butterfly computation, as diagrammed in Figure 1. These two equations express the Hartley transform of length N in terms of two smaller (half-length) Hartley transforms. This is the beginning of a recursive process. At each step in the recursion we recognize that all of the Hartley transforms from the previous step can be further broken down into Hartley transforms of half the length. The process continues until we have only Hartley transforms of length 1, which can be trivially computed from Equation 1.

Figure 2: FHT data flow

At this point I will abandon mathematical rigor and proceed by example alone. Suppose N = 16. Figure 2 shows a data-flow diagram that illustrates the recursive decomposition of a Hartley transform of length 16. Although the data-flow diagram is computed from left to right, we shall analyze it from right to left. On the far right, the numbers 0 to 15 represent the values of H(0) through H(15), which is the Hartley transform of the original sequence X(0) through X(15).

If you let f run from 0 to 7 in Equations 3 and 4, you can see how the 16 values of H can be computed from the two smaller Hartley transforms, H0 and H1, each of length 8. The right-most column of data-flow lines in Figure 2 illustrates this process as the overlay of eight instances of the butterfly diagram from Figure 1. The lines have been omitted where the corresponding trig factor is zero. In the column of numbers to the left of the final butterfly diagrams, the two sets of indices, 0 through 7, represent the value of H0 and H1 with H0 on top. For example, the first three instances of Equations 3 and 4 in this column of butterflies is:

Note that the indices of H0 and H1 have been reduced modulo 8, since they are Hartley transforms of length 8.

The decomposition then continues with the next column to the left, showing how H0 and H1 are each represented in terms of Hartley transforms of length 4. H0 is decomposed into H00 and H01, while H1 is decomposed into H10 and H11. The numbers 0 through 3 are indices into these smaller Hartley transforms.

The recursion process continues all the way back to the first column, where we would have 16 Hartley transforms of length 1. Equation 1 shows that a Hartley transform of a time sequence of length 1 is just the original time sequence itself. So the first column represents values from the original time sequence X(t). The numbers in that column are the indices into the original time sequence, X(t). But how did they get into that strange order? To answer this question we must determine exactly which time sequence is transformed into each of the Hartley transforms in Figure 2. Starting again at the right, we know that the full Hartley transform is derived from the original time sequence X(0) through X(15) in normal order. Going back one step in the recursion and recalling how the smaller Hartley transforms in Equations 3 and 4 were defined, we see that H1 is the Hartley transform of the sequence X(0), X(2), …, X(14) and H1 is the Hartley transform of the sequence X(1), X(3), … , X(15). Purely as a computational convenience, we place the time sequence for H1 after the time sequence for H0. We have effectively sorted the sequence, X, according to the least-significant bit of its indices.In going from Hartley transforms of length 8 to Hartley transforms of length 4, we further refine the sort on X based on the next-least-significant bit of its indices. That is:

H00 is based on {X(0), X(4), X(8), X(12)}H01 is based on {X(2), X(6), X(10), X(14)}H10 is based on {X(1), X(5), X(9), X(13)}H11 is based on {X(3), X(7), X(11), X(15)}

At each stage in this progressive sort, the results of the previous sort are preserved. Therefore when we get back to the beginning of the data-flow diagram, we have the original sequence, X, listed in an order where the significance of the bits in the indices has been reversed. This is called bit-reversed order, and it is just as applicable to the FFT as it is to the FHT. Thus, to perform the FHT, we must first reorder the sequence, X, into bit-reversed order. Then we perform the butterfly computations as they are diagramed in Figure 2.

Implementation
If you are familiar with the structure of the FFT, you will recognize similarities in the FHT. One important difference is that you cannot do the FHT “in place.” Computing the transform “in place” means that your initial array of time domain values is used to hold all the intermediate calculations and the resulting transform ends up in the same array. This is possible with the FFT because the butterfly operations all have two inputs and two outputs which are in the same array location as the inputs. You can load your time domain sequence into an array and then perform butterfly calculations two elements at a time. The reason this is not possible with the FHT is that the butterfly operation takes three inputs to produce two outputs. If the outputs are immediately stored in the original array, then an input to some other butterfly operation will get overwritten before it is used.

When we calculate the FHT, we must have two arrays. Each round of calculations uses one array for input (the FromArray in the code) and the other array for output (the ToArray in the code). Which array the final transform ends up in depends on whether you had to do an even or an odd number of butterfly calculation rounds. In any case, this inability to do a transform in place contradicts the opening claim of O'Neill's article that only half the memory is needed for the FHT. It is true that the FHT uses only real-valued arrays instead of the complex-valued arrays needed by the FFT, and so the arrays use half as much memory. But you need two of those arrays with the FHT and only one array with the FFT, therefore the memory usage is a wash.

In general, each round of calculations requires N/2 evaluations of Equations 3 and 4. Each such evaluation involves a sine, a cosine, two multiplications, and two additions. To make things quick, the trig values can be in pre-computed lookup tables. Additionally, we recognize that in each butterfly calculation, the terms in Equation 3 are the same as the terms in Equation 4, except for a difference in signs. Referring to the example of N=16, this allows us to calculate terms H(0) through H(7) and get H(8) through H(15) almost for free. This also means that we only need values of sine and cosine for angles from zero to just under 180 degrees. In fact, I choose to implement the smallest possible trig lookup table, namely, the values of sin(x) in the first quadrant (zero to 90 degrees). The values of sine in the second quadrant as well as the values of cosine in both quadrants will come from this table using the trig identities:

Thus to perform a Hartley transform of length N, we only need to pre-compute the N/4 values of sin(x) in the first quadrant. In order to simplify accessing this table, I have separate code for the first and second quadrants. The third and fourth quadrants are done as a byproduct using the symmetry noted in Equation 4.

Let's jump right to the heart of the implementation. The inner-most loop is essentially:

while(in first quadrant){  From++; To++; FromRetrograde--;  SinPtr += ButterflyBlocks;  CosPtr -= ButterflyBlocks;  CSTerm = *CosPtr *         From[ButterflyWidth]      + *SinPtr * FromRetrograde[0];  To[0] = From[0] + CSTerm;  To[ButterflyWidth] = From[0] -       CSTerm;}
Listing 1: Working implementation of FHT

// Fast Hartley Transform - optimized for speed#include typedef float DataType;void InitTrigTable(DataType *TrigPtr, int nBits)// Pre-compute sine function for first quadrant only{  unsigned int dataNdiv4,i;  DataType RadianIncrement;  dataNdiv4 = 1 << (nBits-2);  RadianIncrement = 3.141592653589793238 / 2 / dataNdiv4;  for(i=0; i<#60;dataNdiv4; i++)    TrigPtr[i] = sin( RadianIncrement * i );}DataType *FastHartleyTransform( DataType *DataArray,                 DataType *WorkArray,                 DataType *TrigPtr,                 int nBits ) // DataArray[0...N] are to be transformed. Results will be in either // DataArray[] or WorkArray[]. This function returns a pointer to // whichever array it is. TrigPtr is a pointer to a pre-computed // table of sines for angles in the first quadrant. If there are // dataN elements in the first two arrays, then there are dataN/4 // elements in TrigPtr[], and nBits is the log base 2 of dataN. // dataN must be a power of 2 and be at least 16.{  DataType *FromArray,*ToArray,*SwapTemp;  DataType *From,*To,*FromRetrograde;  DataType *To_ArrayEnd,*To_QuadrantEnd;  DataType *SinPtr,*CosPtr;  unsigned int ButterflyBlocks,ButterflyWidth;  unsigned int dataN,baseN,TrigArraySize;  unsigned int RegularBase,ReversedBase;  dataN = 1 <#60;<#60; nBits;  TrigArraySize = dataN/4;  From = DataArray;  To = WorkArray;  // Reorder the From[] array and place it in the To[] array...  // according to bit-reversed indices.  baseN = dataN/16;  for(RegularBase=0; RegularBase<#60;baseN; RegularBase++)  {  unsigned int RemainingRegularBase;  unsigned int Regular,Reversed;  int BitsToMove,nibble;   static const unsigned char bitrev16[] =    {0,8,4,12,2,10,6,14,1,9,5,13,3,11,7,15};  ReversedBase = 0;  RemainingRegularBase = RegularBase;  BitsToMove = nBits - 4;  while(BitsToMove)  {   if(BitsToMove >#62;= 4)   {    ReversedBase = (ReversedBase<<4)     + bitrev16[RemainingRegularBase & 15];    RemainingRegularBase >>= 4;    BitsToMove -= 4;   }   else //(less than 4 bits to move...)   {    ReversedBase = (ReversedBase<<1)     + (RemainingRegularBase & 1);    RemainingRegularBase >>= 1;    BitsToMove--;   }  }  ReversedBase <<= 4;  Regular = RegularBase;  for(nibble=0; nibble<16; nibble++)  {   Reversed = ReversedBase + bitrev16[nibble];   if(Regular <= Reversed)   {    To[ Regular ] = From[ Reversed ];    To[ Reversed ] = From[ Regular ];   }   Regular += baseN;  } }// Perform the first two rounds of FHT at once From = WorkArray; To = DataArray; To_ArrayEnd = To + dataN; while(To != To_ArrayEnd) {   DataType a,b,c,d;  To[0] = (a=From[0] + From[1]) + (b=From[2] + From[3]);  To[1] = (c=From[0] - From[1]) + (d=From[2] - From[3]);  To[2] = a - b;  To[3] = c - d;  To += 4;  From += 4; }// Start general butterfly calculations with round 3... FromArray = DataArray; ToArray = WorkArray; ButterflyWidth = 4;     // first two rounds are done... ButterflyBlocks = dataN/8; // these parameters are set for round 3. while( ButterflyBlocks ) {  From = FromArray;  To = ToArray;  To_ArrayEnd = ToArray + dataN;  while(To != To_ArrayEnd)  { // This while loop does a single butterfly block    DataType CSTerm;     // Special 0 degrees case...   To[0]  = From[0] + From[ButterflyWidth];   To[ButterflyWidth] = From[0] - From[ButterflyWidth];     // ..then the rest of the first quadrant...   To_QuadrantEnd = To + ButterflyWidth/2;   SinPtr = TrigPtr;   CosPtr = TrigPtr + TrigArraySize;  // beyond end of trig array   FromRetrograde = From + ButterflyWidth + ButterflyWidth;   while(1)   {    From++; To++; FromRetrograde--;    if( To == To_QuadrantEnd ) break;    SinPtr += ButterflyBlocks;    CosPtr -= ButterflyBlocks;    CSTerm = *CosPtr * From[ButterflyWidth] +            *SinPtr * FromRetrograde[0];    To[0]  = From[0] + CSTerm;    To[ButterflyWidth] = From[0] - CSTerm;   }     // Special 90 degrees case...   To[0]  = From[0] + FromRetrograde[0];   To[ButterflyWidth] = From[0] - FromRetrograde[0];     // ..then the rest of the second quadrant...   To_QuadrantEnd = To + ButterflyWidth/2;   while(1)   {    From++; To++; FromRetrograde--;    if( To == To_QuadrantEnd ) break;    CSTerm = *SinPtr * FromRetrograde[0] -            *CosPtr * From[ButterflyWidth];    To[0]  = From[0] + CSTerm;    To[ButterflyWidth] = From[0] - CSTerm;    SinPtr -= ButterflyBlocks;    CosPtr += ButterflyBlocks;   }   From += ButterflyWidth;   To += ButterflyWidth;  }  ButterflyWidth <<= 1;  ButterflyBlocks >>= 1;  SwapTemp = FromArray;  // Swap “from” and “to” arrays  FromArray = ToArray;  ToArray = SwapTemp; }   // end of while( ButterflyBlocks ) loop return FromArray;}void PowerSpectrum(DataType *power, DataType *dht, int nBits) // converts Hartley transform dht[0...N-1] into power spectrum // power[0...N/2]. The power spectrum array is one entry longer than // half as long as the DHT because it is symmetrical after the // mid-frequency anyway. It is OK to call with power=dht and use // only the first N/2+1 elements of dht on return.{  unsigned int N,i,NminusI; N = 1 << nBits; for(i=0; i <= N/2; i++) {  NminusI = (i==0)? 0 : (N-i);  power[i] =   sqrt( (dht[i] * dht[i] + dht[NminusI] * dht[NminusI]) / 2 ); }}void ScaleHartleyTransform(DataType *data, int nBits){  unsigned int i,N;  DataType Nrecip; N = 1<		

This has been extracted and simplified from Listing 1. The actual implementation of Equation 3 is in the calculation for To[0]. The calculation for To[ButterflyWidth] shows the use of the symmetry of Equation 4. As you can see, CSTerm only needs to be calculated once for each pair of Hartley transform values. Both SinPtr and CosPtr point into the pre-computed sine lookup table, but they run in opposite directions to give the sine and cosine. The values for ButterflyWidth and ButterflyBlocks change for each round.

ButterflyWidth is the index offset between the two outputs of the butterfly. It starts at four for round three and doubles in each round. ButterflyBlocks is the number of Hartley transforms on the output side of each round. It is cut in half after each round, ending up at one for the last round.

I should point out here that the code actually computes double the Hartley transform at each row because we ignore the factor of 1/2 in Equations 3 and 4. This makes our final result too large by a factor of N. In many applications this error is irrelevant since only relative values are used. But if correct scaling is required, a separate scaling function, ScaleHartleyTransform(), is provided in the code. Without calling this function you actually get the reverse Hartley transform.

You will notice that two of the three terms in Equation 3 involve indices that increment as the target index increments. But the third term in that equation uses a backwards-running index. This has been called the “retrograde index” in the literature. It is what I call the ugly third wing of the FHT butterfly. It is represented by FromRetrograde in the code.

Bit reversal and special cases
The transform code starts by permuting the input array according to the bit reversal pattern described earlier. This could take a lot of time if done in a brute-force fashion, peeling off one bit at a time. I have chosen to do it four bits at a time by using a pre-computed table of 16 bit-reversed nibbles. I generate all N bit-reversed pairs and, for each pair I interchange and transfer the input array values to the other array if the pair of indices is increasing or equal. If I did not check the relative magnitude of the paired indices, we would be transferring most values twice, which would not be bad but would waste some time. Note that the bit reversal could have been done in-place, but the other work array was available so I used that instead. Because of the assumptions made in this optimized bit reversal, the implementation is limited to transforms of sequences whose length is at least 16.

The first two rounds of the FHT are so simple that I do them all at once as a special case. They involve only degenerate trig values (0 or 1). Also, the first computation of every set of butterflies is special because it has only two terms and no trig lookups. I have, therefore, structured the code to compute the butterflies for zero degrees and 90 degrees as special cases.

Interface detailsIn order to use the FHT code, the calling program must first generate a trig table of the appropriate size. The function InitTrigTable() is provided for just that purpose. Thereafter the client code may call FastHartleyTransform() by providing the appropriate parameters. Notice that the client code is responsible for allocating all the arrays, and that the resulting transform may be in either the original input array or the work array. Also note that no static variables are used, so this implementation is re-entrant. The return value of FastHartleyTransform() is a pointer to the correct result array.

As noted earlier, unlike the Fourier transform, the Hartley transform has no natural interpretation as a frequency spectrum. The most natural use for the Hartley transform is as a means to develop the Fourier transform, as shown by Equation 2. The magnitude of the first half of the Fourier transform is calculated by the PowerSpectrum() function.

Performance
Classically, FFT performance has been evaluated by counting the number of multiplications, additions, and subtractions that are involved. In these terms, the FHT does very well. If we disregard the simplicity of the first two rounds, we have each round requiring N multiplications and 2N additions or subtractions. The number of rounds is log2N. By comparison, the FFT requires 2N multiplications and 7N/2 additions or subtractions in each round. In both cases these numbers assume that no redundant operations are performed.But this may not tell the whole story. On modern processors with floating point capability, the time required to do arithmetic may not dominate the overall running time. Time spent in computing indices and other “overhead” operations may actually take more time than all the multiplications, especially in some of the demonstration programs that are given as examples of a Hartley implementation (as in O'Neill1 ).

The implementation in Listing 1 is meant as a real

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.