# Speeding up the CORDIC algorithm with a DSP

Digital signal processors (DSPs) crunch the numbers for applications that require fast analog-to-digital-to-analog conversion, such as software-defined radio and radar. If you weren't using a DSP, you might use the CORDIC algorithm to perform similar calculations on your IC. The strength of the CORDIC algorithm is its ability to solve vector rotation without using a multiplier.

Despite the CORDIC's well-documented properties, you won't often find it implemented on a DSP because the CORDIC was conceived 49 years ago when the cost of multiplier hardware was prohibitive. (DSPs are typically equipped with multipliers.) But what happens when you mix the two? When a CORDIC algorithm is implemented on a DSP processor, can the multipliers improve CORDIC's performance, or can they be left idle?

We're going to answer this question positively by proposing a novel way to map the CORDIC algorithm onto DSP hardware. This new method uses the DSP's MAC (multiply/accumulate) units and eliminates the conditionals, thus preserving the machine pipelining.

CORDIC, an acronym for COordinate Rotation DIgital Computer, is a class of shift-add algorithms that rotate a vector in a plane. CORDIC has become a commonly used method in memory- and CPU-constrained embedded systems because it's a simple and efficient way to calculate the hyperbolic and trigonometric functions found in every scientific calculator.

Most often the algorithm is used when no hardware multiplier is available, such as in microcontrollers and FPGAs, because the only operations it requires are addition, subtraction, bitshift, and table lookup.

CORDIC is widely used due to its simplicity and its property of relatively fast convergence. It has many applications, including computing trigonometric functions and converting Cartesian coordinates to polar coordinates (and vice versa).

Basically the CORDIC algorithm chooses special angles of rotation such that it can perform the rotation operations by simple shifts and additions, rather than the multiply functions that are required in the general case. Thus, design teams can use the CORDIC algorithm instead of hardware multipliers, which require a higher gate count and cost more to build.

Before we get into the details of how to map the CORDIC algorithm onto a DSP engine, we'll briefly review the CORDIC algorithm that calculates the magnitude and angle of a vector from its Cartesian coordinates. We'll then describe the implementation of the algorithm on a DSP processor, which uses addition and shift operations only.

First, however, let's review the traditional approach to implementing CORDIC.

Let **v** be a vector with Cartesian coordinates (*x, y* ). To simplify the description, we consider the right half plane of a unit circle only, in other words, we assume 1>*x* >0, 1>*y* .

The objective is to find the magnitude

If we can somehow rotate vector **v** = (*x* , *y* ) to **v** _{e } = (*x* _{e } , *y* _{e } ) such that *y* _{e } =0, the magnitude *r* will be the *x* -coordinate *x* _{e } and the angle Ï† will be the rotated angle. This rotation is actually achieved by a number of successive rotations (called *subrotations* ). For each subrotation, the angle of rotation is properly chosen such that:

â€¢ The computation can be accomplished by addition and shift operations only (avoiding the use of multiplication).

â€¢ The set of subrotations will drive vector **v** to the *x* -axis, in other words, makingcoordinate *y* equal to 0; this can be guaranteed if the current rotation is half of the previous one.

Mathematically, the CORDIC algorithm can be described as follows:

Initially let:

*x* _{0} =*x*

*y* _{0} =*y*

Ï†_{0} =0

The first operation will rotate v_{0} = (*x* _{0} , *y* _{0} ) by Î±_{0} =45Â° to get v_{1} = (*x* _{1} , *y* _{1} ).

*x* _{1} = *x* _{0} cos(Î±_{0} ) â€“ *d* _{0} *y* _{0} sin(Î±_{0} )Â Â Â Â Â Â Â Â (1.3)

*y* _{1} = *y* _{0} cos(Î±_{0} ) + *d* _{0} *x* _{0} sin(Î±_{0} )Â Â Â Â Â Â Â Â (1.4)

Ï†_{1} = Ï†_{0} â€“ *d* _{0} Î±_{0} Â Â Â Â Â Â Â Â (1.5)

where *d* _{0} is the direction of rotation:

*d* _{0} =1 if *y* _{0} <0Â Â Â Â Â Â Â Â (1.6)

*d* _{0} =-1 if *y* _{0} â‰¥0 Â Â Â Â Â Â Â Â (1.7)

The Equations 1.3 and 1.4 are the same as:

*x* _{1} = cos(Î±_{0} )[*x* _{0} â€“ *d* _{0} *y* _{0} tan(Î±_{0} )]Â Â Â Â Â Â Â Â (1.8)

*y* _{1} = cos(Î±_{0} )[*y* _{0} + *d* _{0} *x* _{0} tan(Î±_{0} )]Â Â Â Â Â Â Â Â (1.9)

In general, the *i* -th rotation is:

*x* _{i } _{+1} = cos(Î±_{i } )[*x* _{i } â€“ *d* _{i } *y* _{i } tan(Î±_{i } )Â Â Â Â Â Â Â Â (1.10)

*y* _{i } _{+1} = cos(Î±_{i } )[*y* _{i } + *d* _{i } *x* _{i } tan(Î±_{i } )]Â Â Â Â Â Â Â Â (1.11)

Ï†_{i } _{+1} = Ï†_{i } â€“ *d* _{i } Î±_{i } Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â (1.12)

where *d* _{i } is the direction of rotation:

*d* _{i } =1 if *y* _{i } <0Â Â Â Â Â Â Â Â (to rotate upwards)Â Â Â Â Â Â Â Â (1.13)

*d* _{i } =-1 if *y* _{i } â‰¥0Â Â Â Â Â Â Â Â (to rotate downwards)Â Â Â Â Â Â Â Â (1.14)

for *i* =0,1,…

If chose a_{i } such that tan(a_{i } )= 2^{-i } , the Equations 1.10 through 1.12 become:

*x* _{i } _{+1} = *K* _{i } [*x* _{i } â€“ *d* _{i } *y* _{i } 2^{-i } ]Â Â Â Â Â Â Â Â (1.15)

*y* _{i } _{+1} = *K* _{i } [*y* _{i } + *d* _{i } *x* _{i } 2^{-i } ]Â Â Â Â Â Â Â Â (1.16)

Ï†_{i } _{+1} = Ï†_{i } â€“ *d* _{i } Î±_{i } Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â (1.17)

where:

*K* _{i } = cos(arctan(2^{-i } ))Â Â Â Â Â Â Â Â (1.18)

Î±_{i } = arctan(2^{-i } )Â Â Â Â Â Â Â Â (1.19)

Since the effect of constant *K* _{i } can be removed later (see Equation 1.23), the CORDIC algorithm in Equations 1.15 through 1.17 can be implemented as:

*x* _{i } _{+1} = *x* _{i } â€“ *d* _{i } *y* _{i } 2^{-i } Â Â Â Â Â Â Â Â (1.20)

*y* _{i } _{+1} = *y* _{i } + *d* _{i } *x* _{i } 2^{-i } Â Â Â Â Â Â Â Â (1.21)

Ï†_{i } _{+1} = Ï†_{i } â€“ *d* _{i } Î±_{i } Â Â Â Â Â Â Â Â (1.22)

where *d* _{i } and Î±_{i } are defined as in Equations 1.13, 1.14, and 1.19.

Suppose that, after *N* rotations, the desired precision is achieved, then the magnitude *r* and angle Ï† can then be found approximately by:

where *K* = *K* _{N } _{-1} â€¢ …â€¢*K* _{0} = cos(arctan(2^{-(} ^{N } ^{-1)} ))â€¢… â€¢cos(arctan(2^{-0} )), which can be precomputed precisely.

A typical pseudo code for CORDIC in Equations 1.20 through 1.24 is:

If (*y* _{i } <0)Â Â Â Â Â Â Â Â (1.25)

Â Â Â Â *x* _{i } _{+1} = *x* _{i } â€“ *y* _{i } 'Â Â Â Â Â Â Â Â (1.26)

Â Â Â Â *y* _{i } _{+1} = *y* _{i } + *x* _{i } 'Â Â Â Â Â Â Â Â (1.27)

Â Â Â Â Ï†_{i } _{+1} = Ï†_{i } â€“ Î±_{i } Â Â Â Â Â Â Â Â (1.28)

If (*y* _{i } â‰¥ 0)Â Â Â Â Â Â Â Â (1.29)

Â Â Â Â *x* _{i } _{+1} = *x* _{i } + *y* _{i } 'Â Â Â Â Â Â Â Â (1.30)

Â Â Â Â *y* _{i } _{+1} = *y* _{i } â€“ *x* _{i } 'Â Â Â Â Â Â Â Â (1.31)

Â Â Â Â Ï†_{i } _{+1} = Ï†_{i } + Î±_{i } Â Â Â Â Â Â Â Â (1.32)

where:

*x* _{i } '= *x* _{i } 2^{-i } = *x* _{i } >>*i* Â Â Â Â Â Â Â Â (1.33)

*y* _{i } '= *y* _{i } 2^{-i } = *y* _{i } >>*i* Â Â Â Â Â Â Â Â (1.34)

*i* =0,1,…

the notation “*z* >>*i* ” means that the bits in *z* will be shifted down by *i* bits (the shift in this document is always an arithmetic shift, not a logical shift).

**Figure 1** shows an illustration of the CORDIC algorithm. The pseudo code in Equations 1.25 through 1.32 is simple and requires only addition and shift operations. Since it doesn't require the multiply function, the hardware implementation is also simple, and for this reason the CORDIC algorithm is widely implemented in ASICs (application-specific integrated circuits) and FPGAs (field-programmable gate arrays).

**New CORDIC implementation for modern architectures**

It's been a widely held belief that DSP processors can't improve the performance of the CORDIC algorithm. At first glance, this seems plausible, since the major CORDIC advantage in most implementations is its avoidance of multiply functions. However, if a multiplier is available, can it improve the performance of the CORDIC algorithm? We propose a reformulation of CORDIC that takes advantage of DSPs' hardware architecture .

There is another significant consideration when implementing CORDIC on a DSP architectures–that is the pipeline. To increase the performance, modern processors are usually pipelined often with as many as 10 stages. This pipelined design allows the processor to perform a task in a time-sliced fashion. A *pipeline* is an efficient approach as long as the code doesn't break the instruction flow with branches. The traditional CORDIC algorithm requires a conditional execution (see the pseudo code in Equations 1.25 through 1.32) that will typically break the pipeline and result in stalls. Hence, the performance of the CORDIC algorithm may be slow in a modern pipelined architecture.

Our new approach gets around this difficulty by mostly using multiplication units in a DSP processor; it also eliminates the need for conditional executions. We begin by reformulating the algorithm such that it's more suitable for using multipliers.

The iteration formula for *x* _{i } _{+1} , *y* _{i } _{+1} , and Ï†_{i } _{+1} (see Equations 1.25 to 1.32) may be written as:

*x* _{i } _{+1} = *Ï†* _{i } (*x* _{i } â€“ *y* _{i } ')+(1- *f* _{i } ) (*x* _{i } + *y* _{i } ')Â Â Â Â Â Â Â Â (2.1)

*y* _{i } _{+1} = *f* _{i } (*y* _{i } + *x* _{i } ')+(1- *f* _{i } ) (*y* _{i } â€“ *x* _{i } ')Â Â Â Â Â Â Â Â (2.2)

f_{i } _{+1} = *Ï†* _{i } (Ï†_{i } â€“ a_{i } )+(1- *f* _{i } ) (Ï†_{i } + Î±_{i } )Â Â Â Â Â Â Â Â (2.3)

where:

*f* _{i } =1 if *y* _{i } <0Â Â Â Â Â Â Â Â (2.4)

*f* _{i } =0 if *y* _{i } â‰¥0Â Â Â Â Â Â Â Â (2.5)

*x* _{i } '= *x* _{i } 2^{-i } =*x* _{i } >>iÂ Â Â Â Â Â Â Â (2.6)

*y* _{i } '=*y* _{i } 2^{-i } =*y* _{i } >>iÂ Â Â Â Â Â Â Â (2.7)

The iteration in Equation 2.1 may be written as:

*x* _{i } _{+1} = *f* _{i } (â€“ 2*y* _{i } ')+ (*x* _{i } + *y* _{i } ')

Â Â = 2[-*f* _{i } *y* _{i } '+ *x* _{i } /2+ *y* _{i } '/2]

Â Â = 2[(-*f* _{i } + 1*/* 2)*y* _{i } '+ *x* _{i } /2]Â Â Â Â Â Â Â Â (2.8)

Â Â = 2[(-*f* _{i } + 1*/* 2)'*y* _{i } + *x* _{i } /2]

where:

(-*f* _{i } + 1*/* 2)' = (-*f* _{i } + 1*/* 2)2^{-i } = (-*f* _{i } + 1*/* 2)>>*i* Â Â Â Â Â Â Â Â (2.9)

Similarly, the iteration in Equations 2.2 and 2.3 may be written as:

*y* _{i } _{+1} = 2[-(-*f* _{i } + 1*/* 2)'*x* _{i } + *y* _{i } /2]Â Â Â Â Â Â Â Â (2.10)

Ï†_{i } _{+1} = 2_{i } [(-*f* _{i } + 1*/* 2)Î±_{i } +Ï†_{i } /2]Â Â Â Â Â Â Â Â (2.11)

Collecting the above together, we have the algorithm to implement:

*x* _{i } _{+1} = 2[(-*f* _{i } + 1*/* 2)'*y* _{i } + *x* _{i } /2]Â Â Â Â Â Â Â Â (2.12)

*y* _{i } _{+1} = 2[-(-*f* _{i } + 1*/* 2)'*x* _{i } + *y* _{i } /2]Â Â Â Â Â Â Â Â (2.13)

Ï†_{i } _{+1} = 2_{i } [(-*f* _{i } + 1*/* 2)Î± _{i } +Ï†_{i } /2]Â Â Â Â Â Â Â Â (2.14)

where:

*f* _{i } =1 if *y* _{i } <0Â Â Â Â Â Â Â Â (2.15)

*f* _{i } =0 if *y* _{i } â‰¥ 0 Â Â Â Â Â Â Â Â (2.16)

(-*f* _{i } + 1*/* 2)' = (-*f* _{i } + 1*/* 2)2^{–} ^{i } = (-*f* _{i } + 1*/* 2)>>iÂ Â Â Â Â Â Â Â (2.17)

The advantages of Equations 2.12 through 2.17 over Equations 1.25 through 1.32 are as follows:

1. The execution of Equations 2.12 through 2.17 is unconditional; hence, it gets around the difficulty of a broken pipeline, which was the case when implementing Equations 1.25 through 1.32.

2. The shift operation in Equations 2.12 through 2.17 will be done only once (to get the modified flag (-*f* _{i } + 1*/* 2)' ) while in Equations 1.25 through 1.32, two shift operations are needed (to get *x* _{i } ' and *y* _{i } '), it therefore saves one operation.

3. The flag has been chosen to have two possible values, 1*/* 2 or â€“1*/* 2. These values in Equation 1.15 fractional format are represented in hexadecimal notation as 0x4000 and 0xC000 respectively. Because of the particular values selection, the flags constants do not lose precision while shifting them down.

4. The new CORDIC formulation does not shift down the *x* _{i } and *y* _{i } coordinate values during the iteration process. Thus, no precision is lost on the original (*x* , *y* ) coordinate values.

5. The product of the flag, (-*f* _{i } +1*/* 2)', by *x* _{i } or *y* _{i } coordinate values is stored in a 40-bit accumulator. See Equations 2.8 and 2.10.

6. As a result of improvements 3, 4, and 5, the new CORDIC formulation achieves a higher precision of about 0.5 bit.

The formula in Equations 2.12 through 2.17 is particularly suitable to be implemented on a pipelined DSP architecture. When implemented on a Blackfin BF533 DSP processor from Analog Devices, each iteration took four cycles, while the traditional implementation in Equations 1.25 through 1.32 took seven cycles per iteration. For example, in a software radio application, we need to implement the CORDIC algorithm for two channels at 240 kHz. The number of iterations (subrotations) is 13. Our new method will consume 2×240,000x13x4 = 25 mips (million instructions per second) versus 2×240,000x13x7 = 44 mips.

This represents a saving of 43% in terms of mips. In addition, for the same number of iterations, the results of the new method in Equations 2.12 through 2.17 has, on average, a half bit more precision than the traditional method in Equations 1.25 through 1.32.

**C & Assembly Code for CORDIC**

In this section, we present three code samples:

â€¢ **Listing 1:** C code implementing original CORDIC.

â€¢ **Listing 2: ** Blackfin assembly code implementing reformulated CORDIC.

â€¢ **Listing 3:** Blackfin assembly code implementing original CORDIC.

The code in **Listing 1** and the complete code in **Listing 2** can be compiled and executed using ADI's VisualDSP++ tools. Note that the code in **Listing 2** is an excerpt–the complete code is available at www.embedded.com/code/2008code.htm. The code assumes that |*x* |, |*y* |<1 in 1.15 format; in other words, *x* , *y* are in the range of **[0x8000, 0x7fff]** . The output phase has 32 bit in 1.31 format; in other words, -p is represented by **0x80000000** and p by **0x7fffffff** . Assembly code may have some special requirement that may be seen in the comments.

For complete Listing 2 code, click here

The code in **Listing 3** is only a segment that shows how the original CORDIC may be implemented on ADI assembly. In **Listing 3** , we show the code for one iteration only (total seven lines of code, costing seven cycles). **ITER_NUM=i** is a variable *i* =0,…,14 is the iteration number. It should be in a register, which may be loaded from memory. We leave it as it is for clarity. The register r2 = Ï†_{i } _{+1} , *i* =0,1,…,14 with r2=Ï†_{0} =0 initially (see 1.32). Besides the slow performance of the following implementation, it has less precision due to the line **r1 = r0>>> ITER_NUM (v)** , which shifts the *x* _{i } and *y* _{i } coordinates down, losing precision.

**George Pan** is a software engineer in the general purpose DSP Division of Analog Devices. He earned his degree in electrical engineering from the University of Connecticut. You may reach him at .

**Fabian Lis** is the Automotive Digital Radio engineering manager within DSPS automotive product line at ADI. Fabian holds a BSEE from Technion Israel Institute of Technology and an MSEE from Tel Aviv University. You may reach him at .

* *