CMP EMBEDDED.COM

Login | Register     Welcome Guest IPS  Call for Abstracts
 

Techniques for debugging an asymmetric multi-core application: Part 3
Typical multicore debugging problems and how to solve them



Embedded.com
In two previous articles we covered first, what an asymmetric multi-core application is and the typical problems encountered, and, second, what tools are available for debugging in these circumstances. Now that we have a full set of tools at our disposal, it is time to examine some specific examples of issues and determine which debugging techniques will be most relevant to each issue for an effective and timely defect resolution.

In the following sections, we will cover a couple of real problems in different asymmetric multi-core systems and how we can go about solving them. These examples should provide an understanding of how to deal with problems identified in a previous article.

Data drops
In the specific example in Figure 1, below, the device under test is an asymmetric multi-core system providing Ethernet inline acceleration. The device is connected to an Ethernet network. A file system server, which acts as an FTP and NFS server, also resides on the same network.

Figure 1: Data drop scenario setup

The asymmetric multi-core system in the device follows the typical inline acceleration model that was detailed in Part 2. In Figure 2, below, is a reminder of the overall architecture described there. A critical point to keep in mind is that the communication between cores is achieved through the use of queues residing in a memory location visible to both cores.

In this example, it is assumed that performance testing has already been carried out and it has been proven to not be an issue. During testing, normal traffic such as pings and FTP function correctly except for NFS. The NFS service is timing out due to a communication breakdown between our device (the NFS client) and the NFS server: the NFS packets are being received on the main core but the NFS client is not accepting them.

Figure 2: Data drop scenario system architecture

Our investigation begins with inspection of the NFS packet headers, which reveals nothing suspicious, header data is as expected and without errors. However, data location examination reveals that all NFS traffic starts on byte aligned data pointers whereas all other traffic uses quad-byte aligned data pointers. Due to the problem not being performance-related, data location examination in this scenario is as simple as printing all data locations used to the debug interface.

Further investigation using internal test code reveals that the secondary acceleration core is fully able to pass byte-aligned data packets. This allows us to rule out the secondary core being the source of the problem.

Further inspection of the entire packet leads to the discovery of data corruption on the very last word of the packet; this causes a checksum error on all NFS packets.

The issue is thus determined to be related to data caching and invalidations functionality on the main core not handling byte aligned pointer correctly. In this specific example, data profiling is the primary source of information for root causing the issue.

To summarize, in the case of data being dropped unexpectedly (as generally described in a previous article and demonstrated above), the first step is to profile the data and data location. It is important to identify any differences between the packets successfully going through the system and those being dropped.

Dumping the data pointers and data itself to the single debug interface is the most common way for profiling the data in this case; should the issue not be reproducible due to the performance impact of these dumps, an alternative is to add event counters for each type of data identified.

Comparing these counters to the number of drops will also yield valuable information. If no valuable data is collected, consider refining the number of data types that are being counted through the system.

It is also important to rule out early on any issue due to lack of performance. At a system level, we can determine if a core is proving to be a bottleneck by examining the communication mechanism between cores and checking that underflows or overflows occur at any given time.

Performance at a core level also needs to be examined to check that data is not being dropped within the core due to lack of performance (e.g. internal queuing issues, thread issues in a multi-threaded environment, etc.).

Hardware errors should also be closely monitored as they could be leading to random data drops. Event counters will be crucial as any correlation between the number and type of hardware errors (in the case of the example above, CRC errors, packet collision, etc.) and the number of dropped packets needs to be identified immediately, thus giving precious insight into the possible root cause of the issue.

Non-responsive secondary core
For this specific scenario, we will take the example of a physics simulation application; for simplicity, we will only consider one aspect of physics simulation, and that is collision detection given a set of objects and their trajectories, shown in Figure 3, below.

In our example we have a secondary core that will allow the main core to offload all the collision detection algorithm calculations. The main core provides the list of objects with positions and trajectories; the secondary core will take care of all the calculations and return a list of intersecting bodies.

For the purpose of this example, the main core will communicate with the secondary core using queues implemented in hardware. The list of objects for each collision calculation will be built in the following way: the main core writes a first object with its position and trajectory into the hardware queue, identifying it to the secondary core as the first of a list.

It will then continue writing each object descriptor into the hardware queue until it reaches the last one, which it will then mark for the secondary core so it knows the list is complete and the secondary core can start the collision detection calculations.

On top of actually carrying out the collision detection algorithm, the responsibility of the secondary core is to internally build the list of objects by extracting entries from the hardware queue and watching the start of list and end of list markers in order to finalize the list. We will not go into the details of how many lists the collision detection accelerator can maintain concurrently or the maximum number of objects it can handle per list.

The collision detection accelerator will be provided a pointer to a list of colliding bodies through another hardware queue by the main core; it will populate this list during collision detection and return its pointer to the main core through another hardware queue.

Figure 3: Collision detection acceleration

In this scenario, we are able to set up the system, but the collision detection accelerator is never returning any completed lists of collided bodies following the submission by the main core of a list of objects. The accelerator is non-responsive from the start.

The first step of our investigation is to profile the flow control, which confirms that both the main and the secondary cores are set up correctly; we achieve this by checking the hardware queue state after writing to it. It is determined that the secondary core is seeing the entry in the queue and is reading it into local memory. The data is then profiled.

In this case, we need to examine the actual queue entries as seen by the secondary core, again these appear to be correct according to the specification for the main core.

At this stage, we need to examine the state of the secondary core at various stages of a collision detection calculation. This is achieved simply through the use of a set of counters incremented each time the core transitions to a specific state. We identify four key stages: waiting for start of list, building list, calculating, and posting result.

These results show that the accelerator is stuck in the building list stage. Further data profiling leads to identifying the issue in the object descriptor end of list marker; the main core and secondary core have a different understanding of how an object is marked as the last member of a list. This is leading the accelerator to never detect the end of list marker and so it continues building a single list indefinitely.

Profiling the flow control
Following the resolution of the issue detailed above, we discover a second issue related to the secondary core being unresponsive. In this new scenario, the system stalls after a variable number of collision calculations.

For this new investigation, once again we must start with profiling the flow control. We discover that no errors occurred related to the communication between the cores; we have no overflow or underflow errors.

Following the analysis matrix detailed in Part 2, the next task is to examine the state of the secondary core. Due to the complete unresponsiveness of the core, we are forced at this stage to use any debugging tool available that will provide us access to the secondary core internals (a tool developed internally or one commercially available).

This allows us to examine the program counter and determine that the accelerator is locked up in the collision calculation stage during the building of the list of collided objects.

The addition of a simple counter tracing the current number of collisions listed so far leads to identifying that the number of items that the secondary core is attempting to put into the current list is greater than the maximum supported and it is locked-up on the first item outside of the allowed range.

As demonstrated above, the approach to solving the problem of a secondary core not responding is similar whether the lock-up occurs from start-up or after an unknown event. However, emphasis is placed in different areas.

The first step of the investigation is always to make sure that the secondary core has been configured with the appropriate data for the communications mechanism and flow control. We need to make sure that both cores are using the same set of resources and that they are using the correct resource for each task.

In the case study above we used hardware rings, so we checked that the main core is writing object descriptors to the queue that the accelerator is reading from. Should we use a memory-based mechanism (such as software queues) then we also need to make sure that the two cores are referring to the same area of memory and that they are referring to it in the same way:

* If both cores are using different addressing schemes (virtual addressing, physical addressing, physical offsets, etc.) then we must verify that address pairs are correct.
* If configuration of the flow control on the secondary core is done with the address of a single memory block and then split internally, we need to make sure that both cores have the same understanding of how this memory block is structured.

In the second scenario, the flow control functions correctly up to a certain point. We must identify the event that has caused the flow control to break down. First we identify if the event was error-related or not: did an error occur in the system before control flow breakdown?

Having counters in place to capture occurrences of every standard error scenario is critical for easy debugging. Should an error have occurred, then error recovery should be investigated closely. If an error did not occur, then investigation of the code paths will be necessary to identify whether a single occurrence of a specific code path was the issue.

The issue could be consistently reproducible after a finite number of entries shared between cores, or reproduced for a specific data event. Examining the amount of data and the type of data will also come into play, out of bounds accesses as well as counter rollover issues are relatively common issues that need to be catered to.

In both the scenarios, state determination is paramount for root causing the problem at hand and will be the primary source of information leading to a successful resolution.

Conclusions
Throughout this article series, we have listed typical error scenarios in an asymmetric multi-core application and how to approach identifying the root-cause of the problem.

Like any other problem solving exercise, one must first look at the big picture and then narrow the scope bit by bit until the problem is root-caused; this article series is an attempt at providing some guidelines for how to approach particular error scenarios.

All the techniques focus on the usage of available resources, mostly event counters, and how to use these resources in the appropriate locations in the software for each type of error scenario. The two case studies explored above should provide enough guidance for people to independently identify the type of error witnessed and what techniques to use in resolving that error.

To read Part 1 in this series, go to  Defining an asymmetrric multicore application
To read Part 2 in this series, go to Tools and methodologies for multicore debug.

To read more about multicore issues, go to  "More on  Multicores and  Multiprocessors ."

Julien Carreno is a senior engineer and technical lead within the Digital Enterprise Group at Intel Corp. He is currently the technical lead on a team responsible for delivering VoIP solution software for the next generation of Intel's Embedded Intel Architecture processors.

He has worked at Intel for more than three years, specialising in acceleration technology for the Embedded Intel Architecture markets. His areas of expertise are Ethernet, E1/T1 TDM, device drivers, embedded assembler and C development, multi-core application architecture and design.

1

Rate this article: Low High
Current rating
  • .
Embedded.com Career Center
Ready to take that job and shove it?
SEARCH JOBS

Browse all jobs

SPONSOR
RECENT JOB POSTINGS




 :