Techniques for debugging an asymmetric multi-core application: Part 2In Part 1 in this series, we covered what an asymmetric multi-core application is, and what are the typical problems that can be encountered in such a system. Now that we have an understanding of those issues, we can cover what tools and methodologies available to us to debug systems with these problems.
Analyzing the issue
In an asymmetric multi-core type of scenario, the first step for debugging any issue is to isolate the core at the source of the issue.
With access limited to the main core debug interface (serial port for example), analyzing the secondary core to find a potential issue there can be a difficult endeavor.
To do so, first we must determine the circumstances under which the issue occurred: we must characterize all incoming and outgoing activities on the secondary core, with special emphasis on using specific techniques depending on the type of issue encountered. Keeping in mind that, in most cases, we must not alter the timing in the system, counters in memory are the optimal means of characterizing input/outputs.
In cases where the issue investigated is timing-related, any change to the code like adding counters could completely alter the behavior of the system; hardware counters will typically have a minimal performance impact on the system hence they should be used whenever possible.
Profiling the flow control
Flow control in and out of each core is an essential part of any real-time application. Regulating the flow of data between the different blocks in the system can be exercised in several ways but a common method is the use of First-In-First-Out queues.
For each FIFO, we must have a counter for underflow and overflow; this is not only critical for debugging any issue with multi-core communication but is also essential in any bottleneck identification exercise.
Counters for FIFO read and write events can also be useful to identify where the data stopped: a discrepancy between read and write counters will indicate that the reader client has somehow stopped processing.
Profiling the data
In a network application, characterizing the data going through the system can potentially be the most informative, especially in "live" systems where we do not have as much control over input/outputs compared to a test environment. However, it is extremely difficult to put in place without impacting performance if the system does not complete data inspection under normal circumstances.
Should the system perform data inspection (Quality of Service filtering in a routing application for example), counters need to be put in place to count each type of data going in and going out of the each core (priority levels on a QoS application) involved in the system under test.
Should the system not perform any data inspection, debug functionality could be added (with an options switch to turn on or off at compile time). This provides a debug option if it is first determined that the issue is not linked to performance or timing. This can be easily proven by forcing the rate of events to be dropped well below the maximum throughput, if the issue is still there then it should not be related to timing/performance.
Another important step in characterizing the data is to count the number of data events incoming and outgoing on the system as it is quite common for an error to be cyclic in nature, this will typically be linked to rollover scenarios.
Profiling the data location
In any multi-core application, data is often passed from one core to another using a pointer to a location in shared memory. It is crucial to be capable of profiling these memory locations as it is quite common for software not to behave correctly when using memory addresses with a specific property.
For example, an error could be caused by software when handling pointers to byte aligned addresses, the error could also be caused by attempting to access memory at an address beyond a specific range.
Profiling data locations consists mostly of counting the number of occurrences of accesses to memory addresses with specific properties (number of times we used a byte aligned address, number of times we used an address beyond a specific range) and comparing with the number of errors observed in the same interval of time.
First we must separate the data locations used in different categories: separating by alignment is one option, by address range is another. Each category will represent a single memory address property. A single address used can fit into several categories; for example, a byte aligned high range address.
We can then add a counter for each category; if the categories have been determined intelligently, we will be able to see that the number of error events is equal to the number of times we used addresses from a specific category. For instance, we will see that the number of high range addresses used is the same as the number of errors; hence the error is caused by using high range addresses.
It should be noted that if the issue is not performance-related, then dumping the data pointers to a debug console will be sufficient to profile the data location, if the issue is performance-related then the use of counters is highly recommended.
In the type of applications that were described in the previous article, and in software in general, issues fall into two broad categories: software implementation defects and error-related defects. In the first case, one of the software components was implemented incorrectly for a normal case scenario and is causing errors in other components. In the second case, the software is not reporting and/or recovering correctly from an error that occurred in the system.
When debugging any software issue, it is important to capture any detectable errors in the system and report them immediately. This becomes critically important when dealing with asymmetric multi-core applications where the secondary core internals are not accessible.
Capturing and reporting detectable hardware or software errors (hardware ring overflow, underflow, CRC errors, etc.) as early as possible is essential to facilitate debugging because an unreported error may be causing more detected errors later on in the test.
Capture and reporting of errors can be done in several ways: writing a message to an error log in memory or in a file on a disk; assigning a counter for each error, incrementing for each occurrence and providing a stat retrieve function.
When using hardware features such as co-processor functions (MAC, hashing coprocessor, pixel shading acceleration, etc.) it is essential to capture any status information returned by the hardware as well as any hardware statistics that may be available.
In the case of a real-time application, it is recommended to keep track of the last X status reports from hardware as it is often possible for an error to occur in hardware due to a sequence of events rather than just a single event. The number of status reports from hardware that should be kept track of can vary. The recommendation is to track the last two status reports, this is usually sufficient for most systems.
Debugging hardware is a difficult exercise at the best of times. The recommended approach when attempting to prove a hardware issue is to collect as much information as possible on the circumstances surrounding the issue in order to reproduce it in a simulated and controlled environment.
However, it is entirely possible that the hardware data collection exercise will yield sufficient information to prove a hardware fault. Every piece of hardware under scrutiny should be approached as a secondary core from a main core perspective; except that in this case, we do not have the flexibility to have the hardware report any new information on top of what it already does.
If we have a flow control mechanism between software and hardware, we need to profile it, and also the data going in and out of the hardware. Timing and real-time considerations are often critical to these types of applications so it is often the case that the hardware has timing restrictions on its usage.
These restrictions can lead to potential hardware failures that will be solely related to timing in the system: for example, the software may be required to service a hardware interrupt within a period of time, or it may be required not to write to a register for given period of time.
These types of restrictions should be identified as early and as clearly as possible. It is also important to clarify the behavior of the hardware should the software fail to comply with one of these restrictions, and how to recover from them.
When debugging such a failure, all these restrictions should be kept in mind and testing should be focused on ruling these out by characterizing the exact hardware interactions at the time of failure.
This could be achieved through the use of carefully placed event counters or timers; or, in the case of more severe errors such as lock-ups, stopping the secondary core and examining its state may yield some additional information as to the current state of the hardware.
State determination is critical in the cases where the secondary core is locked up. Complete state determination can only be achieved through the use of hardware debuggers such as JTAG or ICE, or through reproducing the issue in a simulation environment.
In the case of application lock-ups, state determination can be a good starting point to establish the area to be investigated; this is assuming that a hardware debugger is available.
In that case, the user will reproduce the lock-up and then connect the debugger. Intimate knowledge of the system will be necessary to spot why the state is invalid: infinite loop scenario, circular dependency, hardware signal not happening, unexpected hardware register value
Characterizing code flow
When identical events can take several different paths in the code depending on the timing or configuration at that specific time, then it is important to add counters at critical branch points in the code to determine which path was taken and how many times.
Compared to the total number of events and errors this can yield some interesting information. For example, in the simplified example pipeline shown in Figure 1 below, it is crucial to characterize the number of events requiring anti-aliasing versus those requiring HDR imaging; in this case, counters should be placed at the highlighted red sections.
|Figure 1: Simplified graphics pipeline|
Profiling feature coexistence
within a core
A specific example of this is two identical events yielding different results in the system depending on the timing where two features coexisting on the same secondary core and share resources.
In one case, the event is processed by one feature with no interference from the second feature. In another case, the second feature will interrupt the first mid-way through the processing then let it resume at a later stage. This could cause data corruption and data drops for the event.
In this case, not only do we need to determine the branch points taken but we also need to track the core's global activity: determining if one feature is interfering with another will be essential.
Careful code examination will then be necessary to determine if any resources shared are inappropriately accessed (lack of necessary lock) or if assumptions have been incorrectly made (assume register/configuration has not been changed).
|Table 1. Analysis Matrix|
The matrix in Table 1 above provides some indications as to which debugging technique should be given priority when first investigating an issue. Below are questions to be resolved while the matrix indicates the point at which in the development process they should be considered. This is by no means a definitive matrix and as such should only be taken as a starting point guideline when first starting an investigation.
1. Is the flow
control configured correctly between the two cores?
2. Did the secondary core boot correctly? Did another feature running on the core lock-up?
3. If there are several paths to a successful completion of the off-loaded task, is one of them too slow?
4. Is another feature on the secondary core using too many resources at a critical time?
5. Is there a recognizable pattern to the corruption?
6. Is the corrupted data located at a recognizable location different from the non-corrupted data?
7. Is an error in the control flow causing the same data to be sent more than once?
8. Is bad use of the hardware causing it to duplicate data? Is there a defect in the hardware causing data duplication for some corner case or error scenario that's going undetected?
9. Is data being dropped due to a control flow error? Underflow? Overflow?
10. Is hardware dropping data due to misuse by software? Is an error scenario going undetected and triggering a hardware error?
11. Has the flow control suffered corruption due to an error (underflow) or the bad handling of a corner case (counter/pointer wrap-around scenario)?
12. Has the code locked-up due to a corner case being mishandled?
13. Does one of the code paths for a successful completion contain an error? Has an error in the system not been handled correctly, causing a lock-up of the core?
14. Is the handling of specific corner cases too slow?
15. Is access to specific data locations much slower than other data locations (aligned Vs un-aligned)?
16. Is another feature in the system monopolizing shared resources sporadically?
In Part 1 in this series, we set the groundwork for a common understanding of what is an asymmetric multi-core system and detailed the typical error scenarios that can occur in such a system. In this article, we examined the set of tools available to a developer for debugging an asymmetric multi-core system.
In Part 3, the last in this series, we will investigate how this set
of tools is applied to real-world problems in a series of specific
examples covering a range of error scenarios.
To read more about multicore issues, go to "More on Multicores and Multiprocessors ."
Julien Carreno is a senior engineer and technical lead within the Digital Enterprise Group at Intel Corp. He is currently the technical lead on a team responsible for delivering VoIP solution software for the next generation of Intel's Embedded Intel Architecture processors. He has worked at Intel for more than three years, specialising in acceleration technology for the Embedded Intel Architecture markets. His areas of expertise are Ethernet, E1/T1 TDM, device drivers, embedded assembler and C development, multi-core application architecture and design.