Develop FFT apps on low-power MCUs

Paul Holden

October 19, 2005

Paul HoldenOctober 19, 2005

Now that low-power microcontrollers are starting to include peripherals that were formerly the reserve of larger microprocessors, ASICs, or DSPs, you've got new opportunities to compute complex algorithms at low power levels. This article describes a Fast Fourier Transform (FFT) application developed using a low-power microcontroller that includes a single-cycle hardware multiplier. The application computes, in real-time, the spectrum of an input voltage (VIN in Figure 1). To accomplish this, an analog-to-digital converter (ADC) samples VIN and transfers the sample data to the microcontroller. The microcontroller then performs a 256-point FFT on the samples to obtain the spectrum of the input voltage. For testing purposes, the microcontroller calculates the magnitude of the spectrum and transfers the results to a PC where they are displayed (again in real-time).

Figure 1: FFT application-system overview

The FFT application's firmware is written in C for a 16-bit low-power microcontroller from the Dallas/Maxim Semiconductor's MAXQ2000 family.

To determine the spectrum of the sampled input signal, we need to compute the Discrete Fourier Transform (DFT) of the input samples. The DFT is defined as:


where N is the number of samples, X(k) is the spectrum, and x(n) represents the input samples. Expanding this summation using Euler's identity and separating the input samples and spectrum into their real and imaginary components yields the following equations:



The second term in the summation of Equations 2 and 3 disappears because the input samples consist entirely of real numbers. Assuming we have N samples, computing Equations 2 and 3 directly requires 2N2 multiplications and 2N(N-1) additions. Therefore, our DFT with 256 input samples would require a whopping 131,072 multiplications and 130,560 additions. Enter the FFT!

Many different FFT algorithms exist. The common radix-2 algorithm, used in this implementation, continuously decomposes the DFT into two smaller DFTs. For this to be possible, N must be a power of 2. The steps involved in the radix-2 FFT algorithm can be summarized with the butterfly computations illustrated in Figure 2. Observing these butterfly computations, we can determine that the radix-2 algorithm requires only (N/2)log2N multiplications and Nlog2N additions. The values of WN used in Figure 2 are commonly known as the "twiddle factors" and can be computed before the algorithm is performed.

Figure 2: Butterfly computation used to perform an FFT for N=8

In Figure 2, the input to the FFT is shown in a peculiar order. It's the original order with the indices bit-reversed. Therefore, computing the radix-2 FFT algorithm with N=8 requires the input data to be reordered from 0 (000b), 1 (001b), . . . to 0 (000b), 4 (100b), . . . .

The FFT output appears in the correct order. Figure 2 also reveals that the results of the individual butterfly computations are the only data required for the next stage of the FFT. Because the computations are done "in place," old values can replace new values and only 2N variables are required to compute an FFT with N samples (2N variables are required because each value has a real and an imaginary part).

When the FFT is complete, the results are in complex notation. Equations 4 and 5 convert the results into polar notation:



The DSP literature includes many optimizations for the DFT/FFT algorithm described above to make it faster and smaller. One of the most important optimizations (and also perhaps the easiest to implement) arises from the realization that the DFT of a real-valued signal (with even N) is symmetric around X(N/2), therefore:


Implementation issues
Writing code to solve a DFT isn't simple and the many realities about low-power microcontrollers further complicate writing DFT algorithms for these devices. For example, these microcontrollers generally have:

Limited memory. The microcontroller chosen has just 2KB of RAM. Knowing that the algorithm requires 2N 16-bit variables for FFT data, our microcontroller can't perform FFTs for values of N greater than 512. Even this is unrealistic because other parts of the firmware will also require a few bytes of RAM. For our implementation, we therefore limit N to 256. Using 16-bits variables to represent the real and imaginary parts of every value, 1,024 bytes of RAM will be required for FFT data.

Limited speed. Low-power microcontrollers, despite having a high MIPS/mA rating, still require some optimization to reduce the number of instructions required to run the FFT. Thankfully, the C compiler used for this application includes a number of optimization levels and settings. Careful use of the chip's hardware multiplier will also allow the code to be optimized to an acceptable level.

No floating-point capability. The microcontroller we've chosen doesn't have floating-point capability, typical of devices from low-power product lines. Fixed-point arithmetic is therefore required for all computations. To represent fractional numbers, the firmware will use signed Q8.7 notation. The firmware therefore assumes:

• Bits 0 to 6 represent the fractional part of every number

• Bits 7 to 14 represent the integer part of every number

• Bit 15 holds the two's complement sign bit

This format has no effect on additions and subtractions but care must be taken during multiplications to align the numbers to Q8.7 format. For example of Q8.7 multiplication, see Equation 7.


The Q8.7 format to be consistent also accommodates the largest number the FFT algorithm may encounter while providing the highest accuracy. For example, our ADC provides signed 8-bit samples in two's complement format. If our input is a DC voltage with maximum amplitude (+127 for signed 8-bit samples), the spectrum would be entirely contained in X(0) and equal to 32,512 in Q8.7 notation. This number conveniently fits into a single signed 16-bit value.

The firmware
The following sections describe the firmware that computes a radix-2 FFT. When the samples are read from the ADC, they are stored in the x_n_re array. This array represents the real values of x(n). The imaginary values, initialized to zero before the FFT begins, are stored in the x_n_im array. When the FFT is complete, the spectrum results have replaced the original sampled values and are stored in x_n_re and x_n_im.

Gathering samples
The FFT algorithm assumes that the samples are taken at a fixed sampling frequency. Although beyond the scope of this article, writing code to gather samples for an FFT can cause problems if not done carefully. For example, jitter in the sampling frequency will cause errors in the FFT results and should be minimized.

Any loop in the source code that samples an ADC and includes a decision statement can potentially cause jitter in the sampling frequency. For example, our system reads signed 8-bit samples from an ADC and stores them in an array of 16-bit variables. Two pseudo-code algorithms for performing this ADC read and store function are shown in Listing 1. The first method, presented in Algorithm 1, will cause jitter in the sampling rate because a negative sample requires more time to read and store than a positive sample. Interrupts must also be disabled to ensure that they do not disrupt the sampling code.

Listing 1: ADC sampling pseudo-code

// sample[] is an array of 16-bit variables
for i = 0 to (N-1)
   doADCSampleConversion()                 // Instruct ADC to sample Vin
   sample[i] = read8BitSampleFromADC()    // Read 8-bit sample from ADC

   if (sample[i] & 0x0080)                // If the 8-bit sample was negative
     sample[i] = sample[i] + 0xFF00       // Make the 16-bit word negative

// sample[] is an array of 16-bit variables
for i = 0 to (N-1)
   doADCSampleConversion()               // Instruct ADC to sample Vin
   sample[i] = read8BitSampleFromADC()  // Read 8-bit sample from ADC

for i = 0 to (N-1)
   if (sample[i] & 0x0080)           // If the 8-bit sample was negative
     sample[i] = sample[i] + 0xFF00 // Make the 16-bit word negative

Trigonometric look-up tables
Rather than calculate the value of cosine or sine, the FFT algorithm uses look-up tables (LUTs). The declarations for the sine and cosine LUTs are given in Listing 2. Comments within the firmware include source code for the program that automatically generates these tables. Both LUTs have N/2 elements because the indices of the "twiddle factors" vary from 0 to N/2-1 (see Figure 2).

Listing 2: LUTs for sine and cosine functions

const int cosLUT[N/2] = {+128,+127,+127, ... ,-127,-127,-127};
const int sinLUT[N/2] = {+0 ,+3 , +6, ... ,+9 , +6, +3};

The arrays containing these LUTs are declared as const, forcing the compiler to store them in code space instead of data space. Doing this is important due to the RAM limitations of the microcontroller. Because the LUT values must be in Q8.7 notation, they correspond to the actual cosine and sine values multiplied by 27.

Bit reversal
The bit reversal order (where N is known) can be calculated at run time, indexed using a look-up table or written directly with an unrolled loop. Each method has its trade-offs in size and execution speed. This project performs bit reversal using an unrolled loop, which results in longer source code but faster execution. The code in Listing 3 shows the implementation of this unrolled loop. Comments in the actual firmware include source code to automatically generate this unrolled loop.

Listing 3: Bit reversal using an unrolled loop with N=256

i=x_n_re[ 1]; x_n_re[ 1]=x_n_re[128]; x_n_re[128]=i;
i=x_n_re[ 2]; x_n_re[ 2]=x_n_re[ 64]; x_n_re[ 64]=i;
i=x_n_re[ 3]; x_n_re[ 3]=x_n_re[192]; x_n_re[192]=i;
i=x_n_re[ 4]; x_n_re[ 4]=x_n_re[ 32]; x_n_re[ 32]=i;
...i=x_n_re[207]; x_n_re[207]=x_n_re[243]; x_n_re[243]=i;
i=x_n_re[215]; x_n_re[215]=x_n_re[235]; x_n_re[235]=i;
i=x_n_re[223]; x_n_re[223]=x_n_re[251]; x_n_re[251]=i;
i=x_n_re[239]; x_n_re[239]=x_n_re[247]; x_n_re[247]=i;

The radix-2 FFT algorithm
After we've reordered the samples using bit reversal, we can compute the FFT. The firmware for this implementation of the radix-2 FFT performs the butterfly computations (shown in Figure 2) with three main loops. The outside loop counts through the log2N stages of the FFT computation. The inner loops perform the individual butterfly computations of each stage.

The heart of the FFT algorithm is the short block of code that performs each butterfly computation. This block, shown in Listing 4, is unfortunately the only nonportable firmware in this implementation. The macros MUL_1 and MUL_2 use the microcontroller's hardware multiplier to perform multiplications in a single instruction cycle. These macros can be replaced with equivalents for the particular microcontroller you're using.

Listing 4: Butterfly Computation in C

/* (1) Macro MUL_1(A,B,C): C=A*B   (result in Q8.7)*/
/* (2) Macro MUL_2(A,C) : C=A*last_B (result in Q8.7)*/

x_n_re[b] = x_n_re[a]-resultMulReCos+resultMulImSin;
x_n_im[b] = x_n_im[a]-resultMulReSin-resultMulImCos;
x_n_re[a] = x_n_re[a]+resultMulReCos-resultMulImSin;
x_n_im[a] = x_n_im[a]+resultMulReSin+resultMulImCos;

Complex-to-polar conversion
To determine the magnitude for the spectrum of VIN, we must convert the complex values of X(k) into polar notation. The firmware that implements this conversion is shown in Listing 5. The magnitude values replace the original results of the FFT that are no longer needed by the firmware.

Listing 5: Converting FFT results from complex notation to polar notation

const unsigned char magnLUT[16][16] = { {0x00,0x10,0x20, ... ,0xd0,0xe0,0xf0}, {0x10,0x16,0x23, ... ,0xd0,0xe0,0xf0}, ... {0xe0,0xe0,0xe2, ... ,0xff,0xff,0xff}, {0xf0,0xf0,0xf2, ... ,0xff,0xff,0xff} }; ... ... /* Compute x_n_re=abs(x_n_re) and x_n_im=abs(x_n_im) */ ... ... x_n_re[0] = magnLUT[x_n_re[0]>>11][0];

for(i=1; i>11][x_n_im[i]>>11];

x_n_re[N_DIV_2] = magnLUT[x_n_re[N_DIV_2]>>11][0];

A two-dimensional LUT determines the magnitude instead of the computation from Equation 4. The first index is the four MSBs (most significant bits) of the real part of the spectrum while the second index is the four MSBs of the imaginary part of the spectrum. To obtain these four MSBs the signed 16-bit values are right-shifted 11 times. Before the real and imaginary parts of the spectrum can be used as indices, they are replaced by their absolute values (therefore the sign bit will be zero).

Because we know from Equation 6 that the spectrum is symmetric with respect to X(N/2), only the magnitudes of the first N/2+1 spectrum values are converted to polar notation. Also, it can be shown that the imaginary parts of X(0) and X(N/2) are always zero for real-valued input samples. These two magnitudes are therefore calculated separately. Comments in the firmware include source code used to automatically generate the LUT for the magnitude of X(k).

Hamming or Hann Windows
The firmware for this project includes LUTs (in Q8.7 format) to apply either a Hamming window or a Hann window to the input samples. Windowing is useful to reduce spectral "leakage" that can result from truncating x(n) in the time domain. The equation for the Hamming window function is shown in Equation 8 while the equation for the Hann window function is shown in Equation 9.



Listing 6 shows the code for these functions. Again, comments in the actual firmware include source code to automatically generate the LUTs for these windowing functions.

Listing 6: LUTs for Hamming and Hann Windows

const char hammingLUT[N] = {+10, +10, +10, ... ,+10, +10, +10};
const char hannLUT[N]    = { +0,  +0,  +0, ... , +0,  +0,  +0};
   for(i=0; i<256; i++)="" {="" mul_1(x_n_re[i],hamminglut[i],x_n_re[i]);="" }="" #endif="">

   for(i=0; i<256; i++)="" {="" mul_1(x_n_re[i],hannlut[i]),x_n_re[i]);="" }="" #endif="">

Testing the results
To test the result of the FFT application, the firmware uploads the magnitude of X(k) to a PC using the microcontroller's UART port. FFTgraph, a Windows application written specifically to read these magnitude values from the PC's serial port and included with the firmware for this project, graphs the magnitude of the calculated spectrum in real-time. Figure 3 shows FFTgraph's results for four different input signals with the microcontroller sampling the input voltage at 200kbps:

a) 4.3V DC signal
b) 50KHz sine wave
c) 70KHz sine wave
d) 6.25KHz square wave

Figure 3: Spectra calculated by a low-power microcontroller

What's next?
You can spend an unlimited amount of time optimizing and configuring this FFT implementation. Although I chose to use the radix-2 algorithm, others algorithms exist that can dramatically reduce the number of additions and multiplications required. Many optimizations not presented in this article also exist for increasing the speed of an FFT. For example, with real-valued input samples, the imaginary part of the input samples is always zero and only the first half of the spectrum is significant. Using this information, the first and last stages of the FFT can be optimized for faster execution (but more program space may be required).

The algorithm presented here is, however, a good starting point for an FFT algorithm written specifically for a low-power microcontroller. For more information and implementation details I encourage you to review the heavily commented firmware for this application.

Paul Holden is a member of the technical staff at Maxim Integrated Products. You can reach him at

Cooley, J. W. and J. W. Tukey. "An Algorithm for the Machine Computation of Complex Fourier Series," Mathematics Computation, Vol. 19, pp 297-301, 1965.

Lemieux, Joe. "Fixed-point math in C", Embedded Systems Programming, October, 2003.

Proakis, John G. and Manolakism Dimitris G. Digital Signal Processing Principles, Algorithms, and Applications, 3rd Edition. Prentice Hall, 1996.

Smith, Steven W. The Scientist and Engineer's Guide to Digital Signal Processing 2nd Edition. California Technical Publishing, 1999.

Loading comments...