# Beyond Regression: Line-Fitting Algorithms for Exceptional Cases – Part 2

My previous article in this series on line-fitting algorithms dealing specifically with minimax (*see Part 1 * ) came from working with an engineer who thought minimax might do better than regression in his particular application. When I provided him with a minimax fit, the engineer still wasn’t satisfied. It was only then that I asked to see the data – which is what I should have done in the first place!

The engineer’s data came from sensor measurements of the angle of rotation for a rotating device at equal time increments. The data itself was very close to linear, so to see what was going on I plotted deviations of the data from the regression line. The plot in **Figure 1 below** shows clearly that on top of the linear trend there’s an obvious periodic “wobble”, caused by mechanical irregularities in the rotating device.

Figure 1: Plot of data minus regression line |

Wobbles like this in the data cause a systematic bias in the slope of the regression line, as shown in **Figure 2 below** .

Figure 2: Effect of sinusoidal wobble on regression line |

**Averaging out the wobbles**

The question is, How can we subtract out the wobbles, and get an accurate estimate of the underlying linear trend? Generally, the answer involves some kind of averaging or filtering to remove the wobbles before fitting the line.

For instance, if the data are evenly spaced the wobble in Figure 2 can be made to disappear if we average each data point from the first half of the interval with the corresponding data point from the second half of the interval as follows:

`data_average(n) = 0.5 x [data_point(n) + data_point(n + N/2)],`

where N is the total number of data points. The resulting trend line (shown in **Figure 3 below** ) has the same slope as the underlying trend line in Figure 2, but the periodic wobble is gone.

Figure 3: Two-point averages of wobbly data from Figure 2 |

This averaging of data points requires evenly-spaced data. However, even if the data is not evenly-spaced, the effect of the wobble can still be removed by averaging the slope estimates from the first and second halves of the data interval. Thus no interpolation of points is necessary, even for unevenly-spaced data.

It is also possible that there are multiple wobbles present with different periods. Multiple wobbles can be eliminated simultaneously by properly choosing the number of subintervals for averaging.

**Figure 4 below** , for instance, shows data subject to two superimposed wobbles: one with period equal to the entire data interval, and one with four complete wobbles over the data interval. As shown in Figure 4, it’s possible to make both wobbles disappear by averaging the slope estimates for three equal subintervals:

`slope_average = { (slope for first L/3) + (slope for second L/3) + slope for third L/3) } / 3,`

where L is the total data interval.

Figure 4: Three subinterval average of wobbly data |

The averaging however comes at a cost. A 2-subinterval averaging will preserve a wobble of index 2 (that is, two complete periods per data interval) as shown in **Figure 5 below** ; and the contribution to the slope error will actually be increased because the interval is reduced.

Figure 5: Amplification of wobble error via averaging |

In general, averaging across S subintervals has the following properties:

* it eliminates wobbles whose periods do not evenly divide the subinterval length, thus eliminating their contribution to the error in the linear trend estimate;

* it preserves wobbles whose periods evenly divide the subinterval length, thus increasing their contribution to the error in the linear trend estimate.

**Which average to use?**

In order to determine the best averaging for a particular situation, we need to characterize the relative amplitudes of noise wobbles of different periods. We can do this by plotting the power spectrum of the deviations from the corrected regression line (the power spectrum is the absolute square of the FFT).

The index in the power spectrum corresponds to number of noise wobble periods per data interval (the power spectrum index runs from 0, where the index 0 value is the average of the data – this value is excluded because it does not affect the slope). **Figure 6 below** gives the power spectrum in decibels (log scale) in order to show values covering several orders of magnitude.

Figure 6: Analysis of residual power components (versus period index) |

**Figure 6 above ** shows that index 1 component far outweighs the others: the 20-decibel increase over the next largest component means the power is roughly 100 times larger. This is to be expected from Figure 1.

The index 4 power is also high compared to neighboring indices (2,3,5,…) – the 6 dB difference between index 3 and index 4 means the index 4 power is about 4 times larger. More data should be examined to determine whether the large signal at index 4 is characteristic of the system, or whether it is due to chance.

Based on this data, we might choose to do a two-subinterval average (which eliminates the index 1 and 3 wobbles but amplifies the index 2 and 4 wobbles); or we might choose a three-subinterval average (which eliminates the wobbles for indices1,2 and 4, but amplifies the index 3 wobble).

**In-depth analysis**

In order to determine the best averaging alternative, a more in-depth mathematical analysis is required. We can calculate the effect of different sine and cosine wobbles on the slope estimate using Fourier series.

Since we have closely-spaced data, we can approximate it with a continuous function f(x). For a function f(x) on the data interval [0,L] whose average value is 0, the slope of the regression line is given by

Where **Int** {…} represents the integral over the variable x from 0 to L. For sines and cosines with q periods per data interval, we have

If the system experiences a linear combination of sine and cosine wobbles of the form:

when the error in the slope estimate due to these wobbles is:

Notice that only the sine wobbles contribute to the slope error: the cosine wobbles do not contribute. The coefficients bq will in general be independent random variables with zero mean: in this case, the variance of the slope estimate is the sum of the variances due to each individual sine term:

A similar analysis can be used to find the variance of the slope estimate obtained from averaging over S subintervals:

A good (low-variance) estimate can be obtained by finding the value of S that minimizes the variance of the slope estimate.

For example, we suppose that the power spectrum shown in Figure 6 is characteristic of the system, and that half the power (on average) for each index goes into sine and half into cosine. Then we obtain the variances for the slope estimate shown in **Table 1 below** :

Table 1: Variance of slope estimates for different data averaging schemes |

The variance for S=1 is largest because of the relatively huge wobble with period equal to L. Next in decreasing order is S=4, followed by S=2,3, and 5. Because of the very low residual power value for period index 5 (see Figure 6), the slope estimate with averaging between 5 subintervals is very tight.

However, before making use of this result practically, it should be verified that a low index 5 power is truly characteristic of the system, and not just an artifact of this particular data set.

**Further challenges in fitting angle data – Angle aliasing**

The forgoing discussion has treated one common difficulty encountered when using a linear fit on angle data, namely periodic noise. Another potential issue arises from angle aliasing, which occurs when the angle is obtained from position data (i.e. cosine and/or sine).

If only one coordinate is used to determine angle, then there is a +/-kπ ambiguity in the value of the angle. Even if both coordinates are used, there is a +/-k2π ambiguity, which can cause significant problems if the data is noisy. Part 3 in this series next will deal with this issue.

To read **Part 1** , go to **Minimax line fitting**

* Chris Thron is currently assistant professor of mathematics at Texas A&M University Central Texas, and does consulting in algorithm design, numerical analysis, system analysis and simulation, and statistical analysis. Previously Chris was with Freescale Semiconductor, doing R&D in cellular baseband processing, amplifier predistortion, internet security, and semiconductor device performance. He has six U.S. patents granted plus three published applications. His web page iswww.tarleton.edu/faculty/ thron/, and he can be reached at thron@tarleton.edu * .