Techniques for debugging an asymmetric multi-core application: Part 3 - Embedded.com

Techniques for debugging an asymmetric multi-core application: Part 3

In two previous articles we covered first, whatan asymmetricmulti-core application is and the typical problems encountered,and,second, whattools are available for debugging in these circumstances.Now that we have a full set of tools at our disposal, it is time toexamine some specific examples of issues and determine which debuggingtechniques will be most relevant to each issue for an effective andtimely defect resolution.

In the following sections, we will cover a couple of real problemsin different asymmetric multi-core systems and how we can go aboutsolving them. These examples should provide an understanding of how todeal with problems identified in a previous article.

Data drops
In the specific example in Figure 1,below, the device under test is an asymmetric multi-core systemproviding Ethernet inline acceleration. The device is connected to anEthernet network. A file system server, which acts as an FTP and NFSserver, also resides on the same network.

Figure1: Data drop scenario setup

The asymmetric multi-core system in the device follows the typicalinline acceleration model that was detailed in Part 2. In Figure 2, below , is a reminder ofthe overall architecture described there. A critical point to keep inmind is that thecommunication between cores is achieved through the use of queuesresiding in a memory location visible to both cores.

In this example, it is assumed that performance testing has alreadybeen carried out and it has been proven to not be an issue. Duringtesting, normal traffic such as pings and FTP function correctly exceptfor NFS. The NFS service is timing out due to a communication breakdownbetween our device (the NFS client) and the NFS server: the NFS packetsare being received on the main core but the NFS client is not acceptingthem.

Figure2: Data drop scenario system architecture

Our investigation begins with inspection of the NFS packet headers,which reveals nothing suspicious, header data is as expected andwithout errors. However, data location examination reveals that all NFStraffic starts on byte aligned data pointers whereas all other trafficuses quad-byte aligned data pointers. Due to the problem not beingperformance-related, data location examination in this scenario is assimple as printing all data locations used to the debug interface.

Further investigation using internal test code reveals that thesecondary acceleration core is fully able to pass byte-aligned datapackets. This allows us to rule out the secondary core being the sourceof the problem.

Further inspection of the entire packet leads to the discovery ofdata corruption on the very last word of the packet; this causes achecksum error on all NFS packets.

The issue is thus determined to be related to data caching andinvalidations functionality on the main core not handling byte alignedpointer correctly. In this specific example, data profiling is theprimary source of information for root causing the issue.

To summarize, in the case of data being dropped unexpectedly (asgenerally described in a previous article and demonstrated above), thefirst step is to profile the data and data location. It is important toidentify any differences between the packets successfully going throughthe system and those being dropped.

Dumping the data pointers and data itself to the single debuginterface is the most common way for profiling the data in this case;should the issue not be reproducible due to the performance impact ofthese dumps, an alternative is to add event counters for each type ofdata identified.

Comparing these counters to the number of drops will also yieldvaluable information. If no valuable data is collected, considerrefining the number of data types that are being counted through thesystem.

It is also important to rule out early on any issue due to lack ofperformance. At a system level, we can determine if a core is provingto be a bottleneck by examining the communication mechanism betweencores and checking that underflows or overflows occur at any giventime.

Performance at a core level also needs to be examined to check thatdata is not being dropped within the core due to lack of performance(e.g. internal queuing issues, thread issues in a multi-threadedenvironment, etc.).

Hardware errors should also be closely monitored as they could beleading to random data drops. Event counters will be crucial as anycorrelation between the number and type of hardware errors (in the caseof the example above, CRC errors, packet collision, etc.) and thenumber of dropped packets needs to be identified immediately, thusgiving precious insight into the possible root cause of the issue.

Non-responsive secondary core
For this specific scenario, we will take the example of a physicssimulation application; for simplicity, we will only consider oneaspect of physics simulation, and that is collision detection given aset of objects and their trajectories, shown in Figure 3, below.

In our example we have a secondary core that will allow the maincore to offload all the collision detection algorithm calculations. Themain core provides the list of objects with positions and trajectories;the secondary core will take care of all the calculations and return alist of intersecting bodies.

For the purpose of this example, the main core will communicate withthe secondary core using queues implemented in hardware. The list ofobjects for each collision calculation will be built in the followingway: the main core writes a first object with its position andtrajectory into the hardware queue, identifying it to the secondarycore as the first of a list.

It will then continue writing each object descriptor into thehardware queue until it reaches the last one, which it will then markfor the secondary core so it knows the list is complete and thesecondary core can start the collision detection calculations.

On top of actually carrying out the collision detection algorithm,the responsibility of the secondary core is to internally build thelist of objects by extracting entries from the hardware queue andwatching the start of list and end of list markers in order to finalizethe list. We will not go into the details of how many lists thecollision detection accelerator can maintain concurrently or themaximum number of objects it can handle per list.

The collision detection accelerator will be provided a pointer to alist of colliding bodies through another hardware queue by the maincore; it will populate this list during collision detection and returnits pointer to the main core through another hardware queue.

Figure3: Collision detection acceleration

In this scenario, we are able to set up the system, but thecollision detection accelerator is never returning any completed listsof collided bodies following the submission by the main core of a listof objects. The accelerator is non-responsive from the start.

The first step of our investigation is to profile the flow control,which confirms that both the main and the secondary cores are set upcorrectly; we achieve this by checking the hardware queue state afterwriting to it. It is determined that the secondary core is seeing theentry in the queue and is reading it into local memory. The data isthen profiled.

In this case, we need to examine the actual queue entries as seen bythe secondary core, again these appear to be correct according to thespecification for the main core.

At this stage, we need to examine the state of the secondary core atvarious stages of a collision detection calculation. This is achievedsimply through the use of a set of counters incremented each time thecore transitions to a specific state. We identify four key stages:waiting for start of list, building list, calculating, and postingresult.

These results show that the accelerator is stuck in the buildinglist stage. Further data profiling leads to identifying the issue inthe object descriptor end of list marker; the main core and secondarycore have a different understanding of how an object is marked as thelast member of a list. This is leading the accelerator to never detectthe end of list marker and so it continues building a single listindefinitely.

Profiling the flow control
Following the resolution of the issue detailed above, we discover asecond issue related to the secondary core being unresponsive. In thisnew scenario, the system stalls after a variable number of collisioncalculations.

For this new investigation, once again we must start with profilingthe flow control. We discover that no errors occurred related to thecommunication between the cores; we have no overflow or underflowerrors.

Following the analysis matrix detailed in Part 2, the next task isto examine the state of the secondary core. Due to the completeunresponsiveness of the core, we are forced at this stage to use anydebugging tool available that will provide us access to the secondarycore internals (a tool developed internally or one commerciallyavailable).

This allows us to examine the program counter and determine that theaccelerator is locked up in the collision calculation stage during thebuilding of the list of collided objects.

The addition of a simple counter tracing the current number ofcollisions listed so far leads to identifying that the number of itemsthat the secondary core is attempting to put into the current list isgreater than the maximum supported and it is locked-up on the firstitem outside of the allowed range.

As demonstrated above, the approach to solving the problem of asecondary core not responding is similar whether the lock-up occursfrom start-up or after an unknown event. However, emphasis is placed indifferent areas.

The first step of the investigation is always to make sure that thesecondary core has been configured with the appropriate data for thecommunications mechanism and flow control. We need to make sure thatboth cores are using the same set of resources and that they are usingthe correct resource for each task.

In the case study above we used hardware rings, so we checked thatthe main core is writing object descriptors to the queue that theaccelerator is reading from. Should we use a memory-based mechanism(such as software queues) then we also need to make sure that the twocores are referring to the same area of memory and that they arereferring to it in the same way:

* If both cores are using different addressing schemes (virtualaddressing, physical addressing, physical offsets, etc.) then we mustverify that address pairs are correct.
* If configuration of the flow control on the secondary core is donewith the address of a single memory block and then split internally, weneed to make sure that both cores have the same understanding of howthis memory block is structured.

In the second scenario, the flow control functions correctly up to acertain point. We must identify the event that has caused the flowcontrol to break down. First we identify if the event was error-relatedor not: did an error occur in the system before control flow breakdown?

Having counters in place to capture occurrences of every standarderror scenario is critical for easy debugging. Should an error haveoccurred, then error recovery should be investigated closely. If anerror did not occur, then investigation of the code paths will benecessary to identify whether a single occurrence of a specific codepath was the issue.

The issue could be consistently reproducible after a finite numberof entries shared between cores, or reproduced for a specific dataevent. Examining the amount of data and the type of data will also comeinto play, out of bounds accesses as well as counter rollover issuesare relatively common issues that need to be catered to.

In both the scenarios, state determination is paramount for rootcausing the problem at hand and will be the primary source ofinformation leading to a successful resolution.

Conclusions
Throughout this article series, we have listed typical error scenariosin an asymmetric multi-core application and how to approach identifyingthe root-cause of the problem.

Like any other problem solving exercise, one must first look at thebig picture and then narrow the scope bit by bit until the problem isroot-caused; this article series is an attempt at providing someguidelines for how to approach particular error scenarios.

All the techniques focus on the usage of available resources, mostlyevent counters, and how to use these resources in the appropriatelocations in the software for each type of error scenario. The two casestudies explored above should provide enough guidance for people toindependently identify the type of error witnessed and what techniquesto use in resolving that error.

To read Part 1 in this series, go to  Definingan asymmetrric multicore application
To read Part 2 in this series, go to Tools and methodologies for multicore debug.

To read more about multicoreissues, go to  “Moreon  Multicores and  Multiprocessors .”

Julien Carreno is a seniorengineer and technical lead within the Digital Enterprise Group at Intel Corp. He is currently thetechnical lead on a team responsible for delivering VoIP solutionsoftware for the next generation of Intel's Embedded Intel Architectureprocessors.

He has worked at Intel formore than three years, specialising in acceleration technology for theEmbedded Intel Architecture markets. His areas of expertise areEthernet, E1/T1 TDM, device drivers, embedded assembler and Cdevelopment, multi-core application architecture and design.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.