A logical method of debugging embedded systems
Debugging is an integral part of embedded systems as much as designing. Both can be rightly referred as two sides of the same coin. Considering the recent growth of the embedded systems in IoT space it is an advantage that engineers are equally good in debugging as they are in designing. Embedded systems have become very complex these days and the boundaries of software and hardware are merging. So when an issue occurs at system level it becomes difficult to find out the root cause. As an engineer working on embedded systems it is important that we quickly understand the issue and find the root cause.
Mentioned below are a few tips which can help in developing a logical approach and analytical thinking towards solving embedded system issues. This article does not cover the tools used for debugging rather presents the reader with a tried and tested method of how to approach the problem.
An embedded system consists of hardware, firmware and application software. Sometimes when an issue is reported it is not clear where the issue is present in the system. It can be due to hardware, firmware code or application software. Below is a generic diagram showing the basics of an embedded system along with the direction of data flow from one system to another.
Listed below is a step by step approach on how to go forward in debugging an issue. To illustrate the approach, an example of a video display issue seen on an embedded system is considered.
Step 1 – Understand the setup & reproduce the issue correctly
The first and foremost thing that one needs to do is to correctly reproduce the issue. Sometimes the issues are seen locally and the engineer is able to reproduce the issue easily. Whereas sometimes the issues are seen at a remote location or customer site and then the engineer has to rely purely on logs available to understand the setup and reproduce the issue. In the second case, it is very important for the engineer to understand the setup correctly as it will help in successfully reproducing the issue. If the engineer fails to do so then it can happen that the issue can re-occur at the remote location or customer site in the future. This happens primarily because the engineer may not have correctly understood the issue and hence has not developed the right solution. Thus it would lead to multiple iterations of fixing the issue.
For our example, let’s consider a video display where noise was seen only on some monitors. Since the devices were installed in a remote location so only the video logs were available for debugging. From the logs it was difficult to make out the root cause of noise. Initially we tried playing with different clips but were not able to reproduce the issue at our end. We were not sure of why the issue was not reproducible. We could not tell clearly if it was due to the video file used or due to the setup. Finally we were able to reproduce the issue by getting the same exact clip that the customer used and one of the monitors.
Step 2 – Breaking the issue into smaller issues
Once the issue is correctly reproduced the next step would be to break the whole issue into smaller issues. This is very important and can be done by understanding the whole data flow. The first step would be to break the flow of data at the interface of application layer with firmware layer and then firmware layer with hardware layer. This way each layer can be reviewed and tested independently for any issue. Also we need not stop at this and can go ahead and break the whole data flow path into more sub levels based on logical understanding.
For our example, we divided the whole path of video frame data from the application to the hardware layer. We understood the whole path of how encoded video data is received and then decoded, and after that how it is passed to the firmware layer. Here we understood how the pointers are assigned for each video frame and how they are given by the firmware to send each frame over the hardware layer. At the hardware layer, we understood the protocol of how a video frame is sent out on the physical lines. Once we understood the whole path, we divided it into logical blocks. One block was for application layer, where the encoded video data is decoded to raw video and stored in video buffers. Then another block is for firmware layer, where we check how the video buffers are given out to the hardware. The final block is for hardware where we check how the video data is given out on the actual physical lines.
Step 3 – Fix each of the smaller issues
Once we have broken the whole data path into logical blocks for each layer, we needed to individually test each block and validate it in their own respective ways. This would help in finding where the root cause of the issue lies. Sometimes a system issue can be fixed by changing only one layer whereas sometimes it would need a change in more than one layer. By logically breaking the whole path, we can properly find out where all changes are needed and then fix them accordingly.
For our example, at application layer we used reading and writing into a file for validating the data path flow. The video data buffers generated after decoding were compared with the expected values. At the firmware level, fixed pattern was used as data input instead of the data coming from application layer. Here we observed that the video data was given to the firmware layer in terms of fields and not frames but the field information (top or bottom) was not given correctly from application layer to firmware layer. So we had to modify the code accordingly so that the buffer contains the correct field information.
At the hardware level, these fixed patterns and the corresponding control signal for the interface was validated as per the specifications using logic analysers. For our example, the protocol we were using was BT.1120 and we saw that the protocol timings were not as per the specification. So we realized that this was the reason why some monitors worked properly and others did not. Once we made the protocol as per the specification we saw that all the monitors worked properly. We also realized that the whole issue of noise was actually a combination of the wrong field information and wrong protocol timings. This was the reason why some monitors were able to work and others did not.
Step 4 – Negative Testing
Testing is of course a very important aspect of problem solving and it is important to test that the issue is fixed properly and should not come back. So it is important that we do negative testing along with the usual testing done after fixing an issue. Negative testing basically means to ensure that the issue is forced into the system and then the system’s response is validated as per the designed solution. This basically means that if we are getting the wrong output from a system by giving correct input and we come up with a solution, so then we should be able to generate a wrong input and feed to the system so that it will generate the correct output. If this happens that means the root cause is correctly identified and the fix is validated.
For our example, we tested the following way. For protocol timings, we saw that after making them as per the specification, all the monitors showed the noise issue. Even one slight modification of the specification led to change in behaviour of monitors. This confirmed that the wrong protocol timings at the hardware layer was the root cause for different behaviour of monitors. Next, for a monitor where we were seeing the issue, we knowingly swapped the top and bottom fields at the application layer. Then we saw that the issue did not occur. This confirmed that the incorrect top and bottom field pointers was the root cause for the noise issue. This way we tested that the solution actually addressed the root cause and it included a fix in both application and hardware layer.
By using the above steps any issue can be approached in a better way to resolve it.
Ayusman Mohanty is a product architect with key focus on building hardware for embedded applications.