Perhaps the most fundamental building block of all communication equipment is the sine wave generator. We have ample knowledge of how to generate sine waves using analog electronic hardware. In modern electronics, however, more and more functions once performed by analog circuitry are now performed by digital signal processors or field-programmable gate arrays. Those two types of ICs are the main approaches for developing and implementing signal processing: either the programmable DSP or the ASIC/FPGA.
Lately, high-performance DSPs are using concepts, such as execution pipelining and parallelism of operations, which were unknown to their ancestor the CPU but routine to the FPGA world. DSPs are starting to resemble FPGAs more and more, and although some fundamental differences remain, this hybridization makes it possible to use algorithms originally developed for one environment in the other.
For instance, you can use a very efficient algorithm for computing the sine for an arbitrary phase in a programmable DSP. The method is not new in itself but adapts the original ideas applied to the design of direct digitalfrequency synthesizer (DDFS) ICs over two decades ago.
It's important to highlight at this point a crucial difference between algorithms used to compute the sine of a random input phase and those that generate a sine wave. While a sine computation algorithm can be easily converted into a sine wave generator, by providing to its input a monotonically increasing phase sequence, it's absolutely impractical, if not impossible, to use a sine wave generator to calculate the sine of a random phase input.
We'll differentiate these algorithms as sine computation and sine generation with the understanding that a sine computation is a more general method and hence can be easily converted into the latter without significant additional complexity.
In this article, we'll review the most common methods for the computing and generating a sine wave on a programmable DSP, as well as provide a brief introduction to direct digital frequency synthesis (DDFS). We'll then describe a new approach to DDFS based on an algorithm created by combining lookup tables and trigonometric identities. The approach lends itself well to implementation on DSPs such as the Texas Instruments' C6x.
Sine generation in a programmable DSP
Implementing a high-speed sine-wave synthesizer in a programmable DSP is hardly a trivial task; it often requires a careful trade-off between memory and implementation complexity. The problem arises because the standard algorithms used to calculate Sin(x ), as well as any other transcendental function of x , are based on the Taylor series expansion of the function in question, which is a time-consuming procedure. Hence, for a given finite number of CPU cycles, only a limited number of results can be rendered in real time. Several alternative algorithms have been developed to boost the performance of the sine-wave generation. Table 1 presents some of the most popular sine generation and computation algorithms used in a DSP.
The Taylor series expansion of Sin(x ) is shown in Equation 1. Implementing this equation in a high-level language like C is straightforward. One advantage of the method is that the desired precision of the final result can be controlled by choosing the number of terms to include in the calculation, more terms render more precision at the expense of additional computations. The series converges rather fast in the interval and remarkably slow from [–π, π] hence, given the same desired precision, there are two implementation choices:
(1) exploiting the symmetry relations in Equation 2 and performing the expansion only in the interval with fewer terms,
(2) using more terms (see Table 2 ) applying the expansion in interval [–π, π]. Since both choices require extra computations, relative advantages should be judged on a case by case basis.
Table 2 shows maximum error as a function of the number of terms used in the expansion for each of the intervals and the maximum number of bits that can be used in order to keep the error less than one least significant bit.
The lookup table approach symbolizes the opposite to the Taylor Expansion, in the sense that no calculations are performed in real time. The values of the sine function are precalculated at evenly spaced phase points in the intervals [0,2π] and stored in memory during the initialization phase of the algorithm. In this way, a one-to-one mapping is established between each phase value and the address in memory where the table values are stored. The computation of the sine wave in real time entails approximating the input phase to one of the table indexes and retrieving the value stored at that index/address. The phase to index approximation constitutes one of the main sources of error of the algorithm. This error is in the order of the inverse of the table size, which is almost universally chosen to be a power of two.
Thisalgorithm is the fastest sine computation algorithm executing in only one instruction–the memory-read from the index corresponding to the input phase. Its drawback is that the table size becomes impractically large even for modest error goals; for instance an error of the order of 2–12 requires a table of 4,096 elements totaling 16 kbytes for 32-bit elements. Obviously this size can be reduced by a factor of four by exploiting the symmetry relations for Sin and Cos stated in Equation 2 but at the expense of the additional complexity of the algorithm needed to check the phase subinterval prior to lookup table addressing. The amount of overhead associated with this additional complexity more than doubles the execution time of the algorithm, making it a costly proposition. Another way to decrease the table size is to interpolate between the end points of the subinterval where the phase resides; this technique, however, again suffers the aforementioned performance drawbacks.
The recursive iteration algorithm is based in the application of the known trigonometric identities Equation 3 in a recursive manner that is achieved by making α[k +1]=α[k ]+Δα and rewriting Equation 3 as in Equation 4. The algorithm is suited only for generation and not computation of the sine function. Another drawback is that, due to the recursive nature of the method, numerical errors accumulate with time producing fluctuations in the output value of the sine wave. A solution to this problem is proposed in John Edward's article.1
The resonator implementation takes advantage of the fact that an infinite impulse response (IIR) filter with a pair of complex poles situated in the unit circle generates a constant-amplitude sine wave if excited with an impulse function. The frequency of the oscillation depends of the location of the poles in Z plane. The transfer function and time domain recursive equation for the IIR filter are shown in Equation 5 along with the IIR block diagram in Figure 1 . The IIR has a very efficient implementation in terms of speed and memory usage. Its main drawback is being a sine generation instead of a sine computation algorithm. It also exhibits the same sensibility to amplitude due to the accumulation of numerical errors with time than the recursive iteration.
And initial conditions: y (-1)=0 y (–2)=0 x (0)=1
Sine computation in direct digital frequency synthesizers
Direct digital frequency synthesizer (DDFS) ICs were developed over 20 years ago to digitally generate a sine and/or cosine wave with high frequency resolution and low distortions as a response to the increase demands in the communication field. A popular DDFS architecture originally designed by Tierney consists of a phase accumulator and a phase to amplitude converter as shown in Figure 2 .2 At every clock cycle, the phase accumulator is incremented by a programmable quantity F r L-1 bits long. The total phase is then used as input to the second block where the sine and/or cosine are calculated. The frequency of the synthesized sine wave F 0 is calculated according to Equation 6.
If the analog representation of the sine is desired the last block is followed by a digital-to-analog converter.
The phase-to-amplitude converter block consists of an algorithm that computes an approximate value to the sine of the input phase. The number of bits W used as input to the phase to amplitude converter is usually less than L . The width reduction (truncation of the least significant bits) of the phase-accumulator output is necessary in order to simplify the phase to amplitude converter. This truncation, along with imprecision in the phase-to-amplitude converter, constitute the two biggest source of errors of the DDFS.
The effect of the truncation in the spectral purity of the generated sine wave was shown in Nicholas et al.,3, 4 to be described by Equation 7, which represents the value of the maximum spectral spurious component of the DDFS normalized such that 2W represents a sine of amplitude 1. According to Equation 7, the maximum spurious component in the generated spectrum is related to the truncation length B in a quasi exponential form and could grow up to 3.92 dB higher than 2–W if the greatest common divisor of <Fr, 2B > = 2B –1 , situation that can be avoided by selecting Fr and 2B relatively prime to each other. Since the main use of the DDFS is the generation of spurious-free sine waves, this performance parameter is of paramount importance. Coincidentally, a closely related term has been coined by the A/D manufacturers: spurious-free dynamic range (SFDR) . This term is calculated according to Equation 8 and will be used in the rest of the paper.
The holy grail of DDFS
The phase-to-amplitude converter is the most complex function of the DDFS and has been the target of extensive optimization efforts. In reviewing the methods used in the design of phase-to-amplitude converter of the DDFS, we can expected to find possible candidates for implementation in a DSP. The goal is to find an implementation that could be translated to a programmable DSP, specifically a high-performance TI C6x DSP, and compare its performance with the algorithms previously discussed.
As a reference for comparison, let's chose the lookup-table algorithm because this was the fastest technique analyzed so far. We can assume that while executing in a single instruction, the algorithm simultaneously computes both the sine and cosine of the input phase, since this function is of widespread use in communications. Both the input and outputs are quantized in 16-bit fixed point format.
Say the size for the uncompressed table is 4,096 32-bit elements (16 kbytes total), with each 32-bit table element containing two contiguous 16-bit groups, the sine and cosine for each of the 4,096 angles from 0 to 4,095/4,096 2π. With the phase quantized in 12 bits the SFDR is, according to Equation 8, approximately 70 dB. To compare performance between algorithms, let's consider only the number of instructions in the loop kernel, effectively ignoring the effect of the loop's prologue and epilogue. (See Texas Instruments' documentation for an indepth discussion of C6x assembly and optimization.5, 6) The loops are fully pipelined and noninterruptible.
Several implementations of phase-to-amplitude converters use techniques similar to those previously discussed for DSP. Langlois,7 for instance, proposed a linear approximation with interpolation in the interval from . Palomaki and Niitylahti8 have used different methods based on recursive iteration, Taylor and Chebyshev series expansions, the last performing at around 70 dB of SFDR for a series with only five terms and CORDIC algorithms, but none of these are suited for DSP implementation due to their iterative nature.
In a seminal paper published in 1984, Sunderland proposed the novel idea of combining lookup tables and trigonometric identities as a means of reducing the table size.9 The method works as follows: instead of using a single lookup table, the phase is split into several components, and trigonometric identities are used to assemble the desired result. The most straightforward of such identities are precisely Equation 3. Once the input phase has been quantized as an integer between 0 and 4,095, this value is split into two groups of bits of U upper bit and L lower bits, each representing the angles and respectively. This corresponds to the graph shown in Figure 3 .
The value of sin(α), cos(α), sin(β), and cos(β) are then retrieved from four tables, two tables of size 2U and two tables of size 2L elements. These values are used to compute the final sine and cosine using Equation 3.
Sunderland didn't apply the method exactly as described, but instead to avoid multiplications altogether (a very expensive proposition back in 1984), settled for splitting the phase in three groups and using Equation 9 as an approximation of Equation 3.
At first glance, the original idea suggested by Sunderland's technique seems to lend itself well for implementation in a programmable DSP. It succinctly proposes a reasonable trade-off: a sizeable decrease in the lookup table size at the expense of a modest increase in the number of calculations. The following reasons clarify why Sunderland's original works well with the specific TI C6x DSP family architecture.
1. Sunderland's original requires, according to Equation 3, two multiplication and two additions per every output value: a C6X DSP has a total of eight execution units, among them two adders and two multipliers, hence it's not a stretch to expect that all operations could be performed in a single clock cycle.
2. The lookup table has a small footprint: in order to achieve maximum throughput, the lookup table must be stored in (precious) internal DSP memory, hence small size is highly desired.
3. Conditional instructions are absent: per every input-phase value, the same operations are executed in a serial fashion. In other words, memory lookup is followed by the operations corresponding to Equation 3, which is very well suited for C6x architecture since any conditional operation inside a pipelined loop will force it to be at least six instructions long (see TI's TMS320C6000 Programmer's Guide 5).
At this point, we should analyze the implementation of this algorithm for 16 signed short integer format in both input and outputs. The C code for the algorithm is shown in Listing 1 with the corresponding generated assembly code. Note that the loop executes in three instructions compared with one for the lookup table. Furthermore, if the loop is unrolled by a factor of two, it will execute in five instructions generating two pairs of sine/cosine outputs per iteration. This level of performance compares very well with the lookup table implementation; although still 2.5 times slower, it uses 1/32 of the memory. Note as well that while the execution speed remains constant with the table sizes, the saving in memory use continues to growth at a rate of . This algorithm can be easily modified to operate as a sine/cosine generator. As a generator, there is no need to input an array with phase values; just input the constant phase increment Fr and let a persistent memory variable maintain the phase between calls. A 32 bits Int should suffice even for the most demanding frequency-precision requirements. In this type of application, Equation 7 must be taken into account.
One source of error in this implementation is the truncation of the least significant bit in the calculation of Equation 3. However, since, for the chosen table size, this error is relatively small compared with the error in the phase approximation, I ignored the error. If the application calls for an SFDR such that the uncompressed table size needs to be more than 14 bits, the truncation error must then be considered and it might be necessary to perform rounding of the results at the expense of lower throughput.
Note that the same kind of performance (three cycles/loop without unrolling and five cycles/loop with unroll of two), can be obtained for a floating-point implementation of the same algorithm, using a C67x DSP. Although floating-point DSPs are slower than fixed-point DSPs, in a floating-point unit all numerical errors are truly negligible.
One further modification to the algorithmwould be to divide the phase in three groups of bits U , M , and L corresponding to three different angles and performing the Equation 3 in a recursive fashion as follows:
This implementation is capable of reducing the table footprint much more than what is possible with the Sunderland's original algorithm, but Equation 10 requires twice the number of operations and suffers from accumulative truncation errors in the fixed-point implementation that degrade its performance. For these reasons, it's only practical in floating-point implementations where design constraints (such as an SFDR) call for the use of huge table sizes.
The algorithmhas been successfully deployed in the field. Finally, some of the initialization functions used in the algorithm have been omitted due to space constraints. For those interested in the full version of the algorithm or in either the floating-point or the 32-bit fixed-point versions, please send an email to firstname.lastname@example.org.
Carlos Abascal is signal processing and communications engineer in the Government Communications Division of Harris Corp. He has extensive experience in embedded microprocessor and DSP software development and has expertise in the theory and the implementation of linear and nonlinear adaptive predistortion for RF power amplifiers, digital modems, and software-defined radios in DSPs. He may be contacted at .
2. Tierney, C.M. Rader, and B. Gold, “A digital frequency synthesizer,” IEEE Transactions on Audio and Electroacoustics , vol. Au-19, No. 1. March 1971, pp. 48-57.
3. Nicholas III, H.T., H. Samueli, and B. Kim. “The optimization of direct digital frequency synthesizer performance in the presence of finite word length effects,” Proceedings of the 42nd Annual Frequency Control Symposium , 1988, pp. 357-363.
4. Nicholas III, H. T. and H. Samueli. “An analysis of the output spectrum of direct digital frequency synthesizers in the presence of phase-accumulator truncation,” in Proc. 41st Annual Control Symposium USERACOM (Ft. Monmouth, NJ), May 1987, pp. 495-502.
5. Texas Instruments Inc. “TMS320C6000 Programmer's Guide,” SPRU198i, Texas Instruments Inc., March 2006.
6. Texas Instruments Inc.. “TMS320C67x/C67x+ DSP CPU and Instruction Set Reference Guide,” SPRU733, Texas Instruments Inc., May 2005.
7. Langlois, J.M.P. and D. Al-Khalili. “Hardware optimized direct digital frequency synthesizer architecture with 60 dBc spectral purity,” 2002 IEEE International Symposium on Circuits and Systems (ISCAS 2002), May 2002.
8. Palomaki , K.I. and J. Niitylahti. “Direct digital frequency synthesizer architecture based on Chebyshev approximation,” Proceedings of the 34th Asilomar Conference on Signals, Systems and Computer s, Oct. 29th–Nov. 1st., 2000, pp. 1639-1643.
9. Sunderland, D. A. et al, “CMOS/SOS frequency synthesizer LSI circuit for spread spectrum communications,” IEEE J. Solid State Circuits , vol. SC-19, No 4, Aug. 1984, pp 497-505.
Kua , Francis. “Generation of a Sine Wave Using a TMS320C54x Digital Signal Processor,” TI Application Report SPRA819 –Texas Instruments Inc. May 2002
Essenwanger, K. A. and V. S. Reinhardt. “Sine Output DDSs A Survey of the State of the Art,” Proceedings of the 1998 Frequency Control Symposium (Pasadena) , May 27-29, 1998, pp. 370-378.