# Reducing Doppler Filtering Processing in STAP Implementations

The number of operations required to compute Space Time AdaptiveProcessing (STAP) solutions is one of the main factors requiringthe use of high-performance clusters for radar array processing. Weneed to introduce several heuristic optimizations to reduce clusterprice and physical size, hence reducing the number of processorsand the size of the cluster.

All pulsed radar STAP approaches require Doppler filteringduring the computation. Depending on approach, the Dopplerfiltering can be performed either with pulse compression or doneseparately. We consider the case where these two computations areseparate. In beamspace post-Doppler and beamspace pre-Dopplerapproaches, the Doppler filtering process is performed afterreduction in beamspace, leaving a small number of elements on whichto perform the operation. If we partition the data cube along thepulse-range dimension, and have a sufficient number of processors,each processor is responsible for an entire (or a part of)pulse-range plane. More importantly, each processor has data thatspans all the staggers per individual time sample. Using this setupand an important property of the Fast-Fourier Transform (FFT), wecan introduce substantial savings into the process. Over the entiredata cube the number of floating point operations (FLOPS) for thatphase is reduced by up to 40%.

In this article, we consider the pulsed Doppler processing,where the data is collected by several elements of the radar array.In particular, we examined Doppler filtering in beamspace pre- andpost-Doppler approaches. The input data to a radar processor is athree-dimensional data cube of dimensions L, M, and N. Thedimension L is the range gate dimension, M represents the number ofpulses, and N represents the number of elements. For the sake ofcalculations, we assign these dimensions some realisticvaluesbefore the Doppler filtering phase, let L=500, M=20,and N=50. Each data point in the data cube is a complexpoint representing real and imaginary (I and Q) components of onerange bin.

The physical machine that would be considered for the task ofdata processing is a multiprocessor machine consisting mostly ofDigital Signal Processors (DSPs). The reason for using DSPs,besides their FFT processing capabilities, is that the programmercan manage the DSP's cache. This important property enables us to introduce aheuristic optimization.

We use a property of the decimation-in-time FFTcircuitwhen the FFT over some vector is computed, the odd andeven points are grouped and separate intermediate results arecomputed and then joined to obtain the result. The key observationis that intermediate results are independent of each other.

In order to use the reduction of the operations in Dopplerfiltering, the data must be partitioned in the following fashion.Each processor receives the data from all pulses, from a particularbeam, and at some time sample. This data encloses all the staggers,meaning that the data has not been broken up into individualstaggers; notion of a stagger is explained in the HeuristicOptimization section of this article. Each processor can then breakup its data into individual staggers. Notice that the stagger dataoverlaps from one stagger to the next in such a way that the firsteven point of a first stagger is a first odd point of the nextstagger, hence pointing to the FFT property. It is important forthe data to be small enough to fit into the cache, otherwise, thesavings will not have a significant impact.

The basic savings from one stagger to another give us thefollowing percent savings:

However, it is usually the case that there are more then twostaggers; hence, the savings can be extended over the consecutivestaggers.

The FFT is a process that takes an input data in time domain andtransfers it into the frequency domain. Data needs to be in thefrequency domain before Doppler processing takes place.

You can evaluate the FFT using a parallel Cooley-Tukey FFTalgorithm. The decimation-in-time algorithm decomposes x[n] intosuccessively smaller subsequences. The algorithm we would use isRadix-2 Decimation-in-Time and can be represented by:

The equivalent parallel FFT circuit allows us to compute K/2independent operations up to log K-1 times, where the depth of thecircuit is log (K). The independent operations are performed in twogroupsthe first group includes only the even points and thesecond group only the odd points of the time-domain vector.**Figure 1** shows an implementation of the 16-point FFTcircuit.

**Figure 1:** Flow graph of 16-point FFT based ondecimation-in-time

Doppler processing is implemented by applying a discrete Fouriertransform of length K across M pulses of the preprocessed data fora given range cell and channel. K is the length of the stagger andthe number of staggers is equal to the number of possible shifts ofK across M. In our case, K=16, M=20; hence, we have fivestaggers. Therefore, the stagger is simply a different wayof looking at the data. At this point the data is still in the formof the data cube and it has not been broken up into individualstaggers. The data further requires frequency-domain weighting.

Each stagger is a small data cube of dimensions: 16 pulses, over50 beams, and 500 range gates long. Therefore, in the data cubebefore the Doppler filtering, the first stagger starts at pulse 1and ends at pulse 16, the second stagger starts at pulse 2 and endsat pulse 17, and so on.

In order for us to exploit odd/even point FFT and DSPproperties, we have to map data to the processors in a specificway. Each processor has to receive data from all staggers at theparticular range gate and beam. This means that the data has to bepartitioned in the [pulse x range gate] plane.

Each processor receives one or more of these vectors. Since wehave 50 x 500 vectors in the [beam x range gate] plane, we needabout 25,000 processors, if each would only receive a singlevector. However, assume that we have less then 25,000 processorsand that each processor will receive just enough vectors so thatthe vectors would fit in the cache of each processor. There are twoimportant observations: the data sent to each processor fits intothe cache and, at this point, the data cube was not broken up intoindividual staggers, hence the data to transfer remains small.

**Figure 2:** Stagger data overlap

Data from the first stagger overlaps with the second stagger byexactly K-1 points. Consider staggers numbers one and two. Firstpoint of the overlap between the two vectors is point one ofstagger one, which is point zero of the second stagger (**Figure2** ). The same pattern is repeated up to point K of stagger one,which is point K-1 of stagger two. Therefore, the odd points ofstagger one are the even points of stagger two. Hence, when wecompute the FFT of stagger one, we can save the intermediateresults and reuse them for the computation of stagger two aseven-point results (**Figure 3** ).

**Figure 3:** Odd/even point savings in consecutivestaggers

For simplicity, let us assume that the multiplication, addition,and subtraction take the same amount of time. With this assumption,we have 5K log K floating-point operations (FLOPS) per FFT. For thecase of a 16-point FFT, the total number of floating pointoperations is 320 and the floating-point operations in the gray boxof **Figure 1** (8-point FFT) requires 120 FLOPS. Additionalsavings from any stagger to a consecutive stagger extend by usingthe intermediate results of points 2 and 10, which translate topoints 1 and 9, to the next stagger, points 4 and 12 (3, 11), and 6and 14 (4, 13). This gives us the 170 FLOPS, the minimum number ofoperations required to compute the next stagger (**Figure 4** ).Computation of stagger two costs us only (5K log K – 5(K/2) log(K/2) – 5*3(K/8) log (K/8)), or (320-120-30)=170 FLOPS, or 53% ofthe required number of operations. Of course, the savings will varywith the size of K and the number of staggers, since more or lessintermediate results could be saved.

**Figure 4:** Extended savings over consecutivestaggers

Another significant block of savings comes three staggers down,where intermediate results from the 4-point FFT can be used.Therefore, the last two staggers require only 150 operations(**Figure 4** ).

To compute five 16-point staggers without savings would cost1600 floating-point operations. Using parallel FFT techniques, weneed only 960 floating-point operations. This is a 40% improvement,for the entire data cube, since we need to perform 24 instead of 40Mflops.

DSP processing gives the programmer a valuable benefit, thecapability to control content of cache, certain memory locationscan be locked/unlocked and forced to remain in or be flushed out ofcache. Using DSP and odd/even point FFT properties, youcan save the intermediate results and reuse them to reduce thenumber of operations. The costs of remapping the data were nottaken into consideration, however. If possible, this optimizationcould bring substantial improvement in performance, hence possiblyreducing the size of the cluster or speeding up the process.