Techniques for debugging an asymmetric multi-core application: Part 2 -

Techniques for debugging an asymmetric multi-core application: Part 2

In Part1 in this series, we covered what an asymmetric multi-coreapplication is, and what are the typical problems that can beencountered in such a system. Now that we have an understanding ofthose issues, we can cover what tools and methodologies available to usto debug systems with these problems.

Analyzing the issue
In an asymmetric multi-core type of scenario, the first step fordebugging any issue is to isolate the core at the source of the issue.

With access limited to the main core debug interface (serial portfor example), analyzing the secondary core to find a potential issuethere can be a difficult endeavor.

To do so, first we must determine the circumstances under which theissue occurred: we must characterize all incoming and outgoingactivities on the secondary core, with special emphasis on usingspecific techniques depending on the type of issue encountered. Keepingin mind that, in most cases, we must not alter the timing in thesystem, counters in memory are the optimal means of characterizinginput/outputs.

In cases where the issue investigated is timing-related, any changeto the code like adding counters could completely alter the behavior ofthe system; hardware counters will typically have a minimal performanceimpact on the system hence they should be used whenever possible.

Profiling the flow control
Flow control in and out of each core is an essential part of anyreal-time application. Regulating the flow of data between thedifferent blocks in the system can be exercised in several ways but acommon method is the use of First-In-First-Out queues.

For each FIFO, we must have a counter for underflow and overflow;this is not only critical for debugging any issue with multi-corecommunication but is also essential in any bottleneck identificationexercise.

Counters for FIFO read and write events can also be useful toidentify where the data stopped: a discrepancy between read and writecounters will indicate that the reader client has somehow stoppedprocessing.

Profiling the data
In a network application, characterizing the data going through thesystem can potentially be the most informative, especially in “live”systems where we do not have as much control over input/outputscompared to a test environment. However, it is extremely difficult toput in place without impacting performance if the system does notcomplete data inspection under normal circumstances.

Should the system perform data inspection (Quality of Servicefiltering in a routing application for example), counters need to beput in place to count each type of data going in and going out of theeach core (priority levels on a QoS application) involved in the systemunder test.

Should the system not perform any data inspection, debugfunctionality could be added (with an options switch to turn on or offat compile time). This provides a debug option if it is firstdetermined that the issue is not linked to performance or timing. Thiscan be easily proven by forcing the rate of events to be dropped wellbelow the maximum throughput, if the issue is still there then itshould not be related to timing/performance.

Another important step in characterizing the data is to count thenumber of data events incoming and outgoing on the system as it isquite common for an error to be cyclic in nature, this will typicallybe linked to rollover scenarios.

Profiling the data location
In any multi-core application, data is often passed from one core toanother using a pointer to a location in shared memory. It is crucialto be capable of profiling these memory locations as it is quite commonfor software not to behave correctly when using memory addresses with aspecific property.

For example, an error could be caused by software when handlingpointers to byte aligned addresses, the error could also be caused byattempting to access memory at an address beyond a specific range.

Profiling data locations consists mostly of counting the number ofoccurrences of accesses to memory addresses with specific properties(number of times we used a byte aligned address, number of times weused an address beyond a specific range) and comparing with the numberof errors observed in the same interval of time.

First we must separate the data locations used in differentcategories: separating by alignment is one option, by address range isanother. Each category will represent a single memory address property.A single address used can fit into several categories; for example, abyte aligned high range address.

We can then add a counter for each category; if the categories havebeen determined intelligently, we will be able to see that the numberof error events is equal to the number of times we used addresses froma specific category. For instance, we will see that the number of highrange addresses used is the same as the number of errors; hence theerror is caused by using high range addresses.

It should be noted that if the issue is not performance-related,then dumping the data pointers to a debug console will be sufficient toprofile the data location, if the issue is performance-related then theuse of counters is highly recommended.

Error capture
In the type of applications that were described in the previousarticle, and in software in general, issues fall into two broadcategories: software implementation defects and error-related defects.In the first case, one of the software components was implementedincorrectly for a normal case scenario and is causing errors in othercomponents. In the second case, the software is not reporting and/orrecovering correctly from an error that occurred in the system.

When debugging any software issue, it is important to capture anydetectable errors in the system and report them immediately. Thisbecomes critically important when dealing with asymmetric multi-coreapplications where the secondary core internals are not accessible.

Capturing and reporting detectable hardware or software errors(hardware ring overflow, underflow, CRC errors, etc.) as early aspossible is essential to facilitate debugging because an unreportederror may be causing more detected errors later on in the test.

Capture and reporting of errors can be done in several ways: writinga message to an error log in memory or in a file on a disk; assigning acounter for each error, incrementing for each occurrence and providinga stat retrieve function.

When using hardware features such as co-processor functions (MAC,hashing coprocessor, pixel shading acceleration, etc.) it is essentialto capture any status information returned by the hardware as well asany hardware statistics that may be available.

In the case of a real-time application, it is recommended to keeptrack of the last X status reports from hardware as it is oftenpossible for an error to occur in hardware due to a sequence of eventsrather than just a single event. The number of status reports fromhardware that should be kept track of can vary. The recommendation isto track the last two status reports, this is usually sufficient formost systems.

Debugging hardware is a difficult exercise at the best of times. Therecommended approach when attempting to prove a hardware issue is tocollect as much information as possible on the circumstancessurrounding the issue in order to reproduce it in a simulated andcontrolled environment.

However, it is entirely possible that the hardware data collectionexercise will yield sufficient information to prove a hardware fault.Every piece of hardware under scrutiny should be approached as asecondary core from a main core perspective; except that in this case,we do not have the flexibility to have the hardware report any newinformation on top of what it already does.

If we have a flow control mechanism between software and hardware,we need to profile it, and also the data going in and out of thehardware. Timing and real-time considerations are often critical tothese types of applications so it is often the case that the hardwarehas timing restrictions on its usage.

These restrictions can lead to potential hardware failures that willbe solely related to timing in the system: for example, the softwaremay be required to service a hardware interrupt within a period oftime, or it may be required not to write to a register for given periodof time.

These types of restrictions should be identified as early and asclearly as possible. It is also important to clarify the behavior ofthe hardware should the software fail to comply with one of theserestrictions, and how to recover from them.

When debugging such a failure, all these restrictions should be keptin mind and testing should be focused on ruling these out bycharacterizing the exact hardware interactions at the time of failure.

This could be achieved through the use of carefully placed eventcounters or timers; or, in the case of more severe errors such aslock-ups, stopping the secondary core and examining its state may yieldsome additional information as to the current state of the hardware.

State determination
State determination is critical in the cases where the secondary coreis locked up. Complete state determination can only be achieved throughthe use of hardware debuggers such as JTAG or ICE, or throughreproducing the issue in a simulation environment.

In the case of application lock-ups, state determination can be agood starting point to establish the area to be investigated; this isassuming that a hardware debugger is available.

In that case, the user will reproduce the lock-up and then connectthe debugger. Intimate knowledge of the system will be necessary tospot why the state is invalid: infinite loop scenario, circulardependency, hardware signal not happening, unexpected hardware registervalue

Characterizing code flow
When identical events can take several different paths in the codedepending on the timing or configuration at that specific time, then itis important to add counters at critical branch points in the code todetermine which path was taken and how many times.

Compared to the total number of events and errors this can yieldsome interesting information. For example, in the simplified examplepipeline shown in Figure 1 below ,it is crucial to characterize the number of events requiringanti-aliasing versus those requiring HDR imaging; in this case,counters should be placed at the highlighted red sections.

Figure1: Simplified graphics pipeline

Profiling feature coexistencewithin a core
A specific example of this is two identical events yielding differentresults in the system depending on the timing where two featurescoexisting on the same secondary core and share resources.

In one case, the event is processed by one feature with nointerference from the second feature. In another case, the secondfeature will interrupt the first mid-way through the processing thenlet it resume at a later stage. This could cause data corruption anddata drops for the event.

In this case, not only do we need to determine the branch pointstaken but we also need to track the core's global activity: determiningif one feature is interfering with another will be essential.

Careful code examination will then be necessary to determine if anyresources shared are inappropriately accessed (lack of necessary lock)or if assumptions have been incorrectly made (assumeregister/configuration has not been changed).

Table1. Analysis Matrix

Analysis matrix
The matrix in Table 1 above provides some indications as to which debugging technique should begiven priority when first investigating an issue. Below are questionsto be resolved while the matrix indicates the point at which in thedevelopment process they should be considered. This is by no means adefinitive matrix and as such should only be taken as a starting pointguideline when first starting an investigation.

1. Is the flowcontrol configured correctly between the two cores?
2. Did the secondarycore boot correctly? Did another feature running on the core lock-up?
3. If there areseveral paths to a successful completion of the off-loaded task, is oneof them too slow?
4. Is anotherfeature on the secondary core using too many resources at a criticaltime?
5. Is there arecognizable pattern to the corruption?
6. Is the corrupteddata located at a recognizable location different from thenon-corrupted data?
7. Is an error inthe control flow causing the same data to be sent more than once?
8. Is bad use of thehardware causing it to duplicate data? Is there a defect in thehardware causing data duplication for some corner case or errorscenario that's going undetected?
9. Is data beingdropped due to a control flow error? Underflow? Overflow?
10. Is hardwaredropping data due to misuse by software? Is an error scenario goingundetected and triggering a hardware error?
11. Has the flowcontrol suffered corruption due to an error (underflow) or the badhandling of a corner case (counter/pointer wrap-around scenario)?
12 . Has the codelocked-up due to a corner case being mishandled?
13. Does one of thecode paths for a successful completion contain an error? Has an errorin the system not been handled correctly, causing a lock-up of thecore?
14. Is the handlingof specific corner cases too slow?
15. Is access tospecific data locations much slower than other data locations (alignedVs un-aligned)?
16. Is anotherfeature in the system monopolizing shared resources sporadically?

In Part 1 in this series, we setthe groundwork for a common understanding of what is an asymmetricmulti-core system and detailed the typical error scenarios that canoccur in such a system. In this article, we examined the set of toolsavailable to a developer for debugging an asymmetric multi-core system.

In Part 3, the last in this series, we will investigate how this setof tools is applied to real-world problems in a series of specificexamples covering a range of error scenarios.

To read more about multicore issues, go to “Moreon Multicores and Multiprocessors .”

Julien Carreno is a senior engineer and technical lead within theDigital Enterprise Group at Intel Corp.He is currently thetechnical lead on a team responsible for delivering VoIP solutionsoftware for the next generation of Intel's Embedded Intel Architectureprocessors. He has worked at Intel for more than three years,specialising in acceleration technology for theEmbedded Intel Architecture markets. His areas of expertise areEthernet, E1/T1 TDM, device drivers, embedded assembler and Cdevelopment, multi-core application architecture and design.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.