For as long as I can remember, the term “sensor fusion” has always been the holy grail of data processing—a mythical unicorn mentioned frequently but (more often than not) in a confusing context. So, what is sensor fusion? The answer is not straightforward because this term can mean different things to different people. In this article, we will cover the general concept of sensor fusion, look at the historical perspective, and examine two detailed cases—inertial sensor fusion and image fusion.
Sensor fusion is one aspect of a larger field called data fusion which encompasses various ways of combining data (i.e. raw signals from sensors) to either a) refine the quality of information (i.e. analyzed data) received from a single source or b) derive new information which would not have been available otherwise. In the context of data fusion, a sensor simply acts as a source of data to be fused to produce information.
In everyday life, one could consider calculating monthly expenses to be a form of sensor fusion. Let’s say your data comes from several credit cards. The credit card company would become a sensor since fusing the data means assembling negative and positive values from various sources into a single table to create a unified picture and also derive a new piece of information—the monthly total. To take it further, we could derive other pieces of information from our credit card sensor fusion, such as a histogram that would categorize expenses by their type, or perhaps show a distribution of expenses across the month.
Although it is considered to be a modern (and even trendy) concept, sensor fusion has existed since ancient times in various forms. Sea captains traveling great distances relied on a navigation technique known as “dead reckoning,” an approach that required the combination of a chronometer and a compass with speed estimation to derive a ship’s latitude and longitude. And even before the chronometer was invented, a sextant was used to determine a ship's position based on the angles between celestial bodies. Captains would fuse their sensor inputs to plot the course and find their way long before the first computer gave its first tick.
Fast-forwarding to a much nearer past, early color movies used a technique where every frame was exposed on two black-and-white strips, each behind a different red-and-green filter. As early as 1906, a technique was invented to unite those images by applying complementary tints, thus creating the colored picture, another form of sensor fusion.
Now let’s make a swooshing sound and quickly zoom into the modern age. A boy is out in a field, holding an iPhone. From time to time, he touches the screen with his finger. If we follow his gaze, we can see a small contraption hovering a few feet up in the air, making a buzzing sound. You and I, being creatures of the modern age, can easily recognize the small drone as it is being controlled by the boy via Wi-Fi. However—and here’s the wonder—the boy’s control over the drone is very imprecise, and his adjustments are crude and loosely timed. So, how does the drone keep a steady position in the air? How does it hover? How does it move in such a beautiful straight line? The answer is both simple and complicated at the same time, and it has to do with inertial sensors fusion.
Here’s a riddle: what do a dreidel, a kid’s toy, a fighter jet, and Put and Take (a gambling game from the First World War) all have in common? The answer is a spinning top—an object that follows the physical law of conservation of angular momentum. Simply, this means that a body rotating around its central axis strives to maintain its orientation along that axis. This is the same principle used in throwing an American football and also the reason why gun barrels are threaded—rotating bullets are more precise in maintaining trajectory.
A gyroscope (or gyro) is a kind of a spinning top combined with a measurement device that is used to measure angular changes; while earlier aircraft versions from not so long ago used actual spinning tops, modern versions could include a vibrating nano-block or even just a beam of light going round and round inside an optic fiber.
The gyro operates on the following principle: assuming the gyro rotates around its axis (i.e.the “main axis”), any angular change around an axis orthogonal to the main axis (i.e. the “input axis”) will produce a proportional change around a third orthogonal axis (i.e. the “output axis”). By observing any change around the expected output axis, we can measure the amount of change that happened around the input axis of the gyro (angle θ). The derivative of those changes over the time domain will produce a measurement of angular velocity (ω) and a second derivative will produce angular acceleration (α).
Inside a drone, there are at least three tiny gyroscopes, set to measure any angular change along the orthogonal axes. All three of them are most likely contained within a single package that’s the size of the tip of your pinky; this was made possible by the introduction of micro-electro-mechanical systems (MEMS) technology, which enabled the production of unbelievably tiny components thus reducing the size of the gyroscope. Why is it important, you ask? Quite simple: this is what allows us to keep the drone stable.
We can think of a naive model where we integrate each axis along the time domain, thus obtaining a single angle for each gyro. Knowing three orthogonal angles provides us with full three-dimensional orientation. Tracking this orientation can help us keep the drone stable by detecting when the orientation begins to change in an unwanted direction and fixing it by adjusting power distribution in the motors to get back to the orientation we want to maintain.
Assuming θ to be an orientation angle, ωθ , to be angular velocity around corresponding axis, and t to be time, in our simple model
therefore making our “tracking model” to be the good old motion equation
or in our case (since we deal with angles)
This model is very simplistic and does not account for second order effects. As one example of a serious problem that this model introduces, the gyro is not a precise instrument, so each and every sample we read from it contains a tiny error. The error is negligible in itself, but when something negligible gets integrated, bad things happen. Assuming the angular velocity remains constant over a small period of time, the equation now becomes:
In other (human) words, we integrate the error function e(t) or (simply put) accumulate the error over time! In this way, a small error becomes a big error and is responsible for a property of a gyro known as a “random walk.”
To recap, a gyro can be a very precise mechanism over a short period of time, but becomes increasingly unreliable as time goes by. Some methods can be used to overcome this: one (which I will only mention briefly, because it deviates from our subject) may include using raw angular velocity without integrating it to simply try and keep the angular velocity at zero across all axes, thus ensuring a stable drone. Another (and more interesting) method involves compensating for the random walk periodically by introducing a secondary source of information—one that is less precise but more stable over long periods of time, or (in other words) relies on sensor fusion.
Depending on the sophistication of the drone—I’ll pretend the boy in our story got his hands on a very serious piece of equipment for his amusement, so I hope you don’t mind—we can easily think of at least two very stable sources of information which could be harnessed here. One is our good old friend gravity, measured in the form of a three-dimensional vector, and is a shiny arrow that always points downwards and has an approximate magnitude of 9.8 meters per second squared. The other is Earth’s magnetic field, which (unlike gravity) tends to shift over time, but at such a slow speed that (for our purposes) we can consider it stationary; this second friend is essentially the red and blue arrow of a compass pointing to magnetic north. (To be fair, I would say that relying on a compass is fairly difficult, because of the sensitivity of the former to any sort of magnetic disturbance or presence of ferromagnetic substances.) Having one or both of these sources harnessed can give us a reference against which to reduce the accumulated error. The device which measures linear acceleration is known as an accelerometer. Our little drone will have three of those as well, one for each cartesian axis.
Now, assuming that, for a stable horizontal position, the x and y axes would read zero acceleration and the z axis would read 9.8 meters per second squared downwards [Earth’s gravitational acceleration, or (0,0,-9.8) in a vector form], we can calculate the angular deviation from that value whenever the accelerometer assumes a different orientation. If we were to use the magnetic compass to supplement our measurement system, we would read the angle at which our drone is oriented towards magnetic north and find the angular deviation of that initial orientation whenever the drone starts to drift.
Together, these three inputs can be fused in several different ways—catching moments where angular velocity (the one that is unaffected by summation error) measured by the gyro is approximately zero, solving orientation equations based on accelerometer and compass readings, and finally resetting the nominal orientation of the drone to the freshly calculated angles (the new θ0 , if you will). Of course, this is all a very simplistic approach, but it is enough for our demonstration purposes. (A much more complicated and robust fusion system would be based on the famous Kalman filter—introducing all the sensor readings together with different weights and trying to perform an optimization based on the covariance matrix.)
One last aspect I would like to address is the importance of precisetime measurement for the fusion. Consider the different readings we aregetting from our sensors: we take the acceleration and gyro samples(assuming simultaneous readings), and try to calculate pitch anglesbased on each one of them. In the real-world application, each and everyone of them is sampled at a slightly different time, so we take anacceleration sample at time ti and a gyro sample at time tj. Then, wecalculate the pitch angles Φi and Φj based on these readings. Now,comparing Φi and Φj would introduce an error directly proportional tothe angular velocity and time difference, or
Two solutions come to mind; one calls for precise synchronization,such that the samples are truly acquired simultaneously, and this isusually achievable via a hardware solution. The other solution suggestsdisconnecting the information from “the metal” by running aninterpolation window on the measurements, so instead of the real-worldvalue, we have a continually adjusted mathematical model that representsthe sensor data. Then, whenever the time comes to use the differentsensor readings together, we can take an extrapolation of each of thesensors’ data to the exact same point of time. If the mathematicalbehavior model of each of the sensors is precise enough, the level oftime synchronization we can achieve is very high.
To illustrate the interpolation technique with a very basic model:
- Sensor A has readings 0.4 and 0.5 acquired at times 1 and 2 seconds.
- Sensor B has readings 0.41 and 0.52 acquired at times 1.2 and 1.8
If we were to compare the raw samples according to their order of appearance, we would have compared A at time t =2 with B at time t =1.8, getting the difference of 0.02.
Now, let’s pretend we decided to use the interpolation, and thechosen behavioral model for the sensors is the linear equation y= ax+b.
To find the value of sensor B reading at time t =2 seconds, we would use the two samples to figure out the coefficients
Solving this gives us y = 0.183333333x + 0.19.
Now plugging in the desired extrapolation time t =2 we get y =0.556666667, so comparing sensor A and sensor B at the time t =2 gives us a more accurate difference of 0.056666667.
Imagine these were angles in radians. The difference between thosetwo methods of comparison is 0.036666667 radians, which is 2.1 degrees—amajor issue for our little drone, which would now have been helplesslydrifting sideways had we not corrected such a major compensation error.
To gather all the pieces so far, we can now understand that the droneis fusing angular data from three gyros to keep itself stabilized up inthe air, and it fuses additional acceleration and magnetic field datato compensate for gyro instability. To achieve this fusion and keep itprecise, the drone maintains a mathematical model of the differentsensors’ behaviors. This shows how even simple model of inertial sensorsfusion can give interesting and delicate results.
Another form of modern sensor fusion deals with vision, or ratherwith different ways in which the same object can be seen. This is calledimage fusion.
Image fusion can happen in various ways, the most familiar of whichcan be found on many modern smartphones and cameras. It is usuallyabbreviated as “HDR” (short for “High Dynamic Range”) and refers to amethod for creating an optimal image despite extreme differences inlighting. For instance, let’s pretend we are taking a photograph. Ifpart of the scene has deep shadows and part of the scene is brightlylit, it is very difficult to find the optimal exposure, and due to thelimited dynamic range of most consumer cameras available today, we willget one of three equally frustrating results:
- Shadow: By pushing the exposure up, we will be able to see the details in shadow, but the highlights will be overexposed. This usually brings the pixels in this area to saturation, or (even worse) bleeds the overexposure over the edge of the object and creates a halo around it.
- Strong light: By pulling the exposure time down, we will optimize the highlights but underexpose the shadows, making them pitch black.
- Flat or weighted average: Unless we are making an artistic choice, this is usually the preferred method for classical photography because it allows us to make the most out of the situation. By giving some level of detail in shadow and some level of detail in highlights, we arrive at a compromise, but given enough contrast strength in the scene, we will get a grayish picture and usually lose some detail on both extremes of the curve.
This is where HDR comes into the picture. HDR can be a) a moreadvanced sensor that can “see” a higher range of contrast values (thushaving a higher dynamic range), or b) an algorithm that leverages astandard sensor by fusing several exposures. The latter is a moreinteresting case, because it pertains more to our subject and lets usget more from less specialized equipment.
So what does HDR do? To get an HDR picture, we take the differentexposure options similar to the ones discussed earlier and combine themtogether in a “smart” way. In other words, assuming we only did threeexposures, one will have interesting details in shadowed areas, one willhave interesting details in highlighted areas, and one will be genericand have details in the areas that are not highlighted or shadowed.Taking the best part of each picture and combining them together willgive us a picture which does the impossible—creates a single photo withrich details and colors in all the areas of the picture, and is closerto what our eyes are used to seeing in real life. (This is because theeye itself performs a continuous fusion of images by frequently scanninga scene and reconstructing a picture in the brain.)
Another technique not unlike the HDR imaging is a fusion of imagesfrom different imaging sources and (usually) technologies. Think of theway various night vision imaging techniques combine to produce one clearimage or hyperspectral imaging that allows us to examine differentfeatures that are normally unavailable to standard optical sensors.
Let’s say we have several cameras, each sensitive to differentwavelengths of the infrared spectrum. We want to expose the same scenewith those cameras and then combine the images to obtain maximum detailsor features in a final composite image. There are, of course, differentfusion considerations that depend on the use case. Image fusionintended for machine processing is not the same as image fusion intendedfor human viewing—it may sound obvious, but it is worth rememberingthat human brains process images differently than machines do. Ourdiscussion is directed towards machine processing oriented fusion, andso our emphasis is on maximizing the amount of features present in theimage and not the perceived quality. To do that, we need to accomplishthree basic tasks:
- Align the different cameras to each other so the fusing of images can happen. That can be obtained by taking precise mechanical measurements and calculating a four-by-four translation matrix that can be applied to each of the source images to align and transform them as needed. We will want to calculate a similar matrix for each camera or sensor so they can all be aligned to the same point. These matrices can also take care of perspective correction and given additional resources, can be obtained dynamically by performing image correlation.
- Calculate the pixel weights that will be used to determine the individual contribution of each source pixel to the combined target pixel. The weights are there to ensure the preservation of high frequency signals from each of the images (e.g. lines, corners, etc.) and can be obtained by subtracting the smoothed image from the original one so that details can be accentuated while low frequency signals are suppressed. This can be achieved in several ways, and two common methods include anisotropic diffusion and bilateral filtering, both of which smooth the image without distorting the edges or blurring features (unlike, for example, simple gaussian smoothing).
- Combine weights into a target image. This way, the destination pixels will contain the maximal amount of features from all the source images.
We will end up with an image that has the most “interesting”information from all the source images combined. In case of infraredsensing, it might include objects that would have been missed had weused only one camera. On a foggy day, a normal camera will be able tosee close objects well (like it always does), but as it tries to seeinto the distance, it will only register fog because visual light isdiffused by water vapor. However, a thermal (or infrared) camera is notnormally the optimal way to see things up close because they might beoverexposed or poorly detailed, but on this same foggy day, the thermalcamera will see through the fog, providing an image with good detail ofdistant objects.,
Together, these two source images give us a readable picture thatwould otherwise be completely impossible to get. Even more so, the samething can be done for video frames, making image fusion possible formotion video as well!
Overall, the idea of sensor fusion is foundational to many of ourmodern day technologies, and the more advanced it becomes, the moreinteresting technologies can emerge in the consumer market. In consumerelectronics, the possibilities of sensor fusion are getting a seriousboost by gaining the machine learning capabilities triggered byimprovements in processor technologies and the omnipresence of cloudservices that allow deep learning to fuse data on higher abstractionlevels. We already see some of the results of these technologies and theeffects they have: better cameras, smarter devices, and more areas oftechnological engagement with our daily lives.
Whenever we capture a pretty picture with a phone or see the screenrotating to react to a hand gesture, it should make us think about thedifferent sensor fusion technologies that worked to help us so casuallyenjoy these things. In the words of Edgar Allan Poe, “It is by no meansan irrational fancy that, in a future existence, we shall look upon whatwe think our present existence, as a dream.” Given that thesetechnologies were barely imaginable just a decade ago, think about allthe possibilities that are yet to come when we are able to do even morewith the data available to us using different, perhaps stillunthought-of sensors.