By Julien Carreno, Intel Corp.
In two previous articles we covered first,
what
an asymmetric
multi-core application is and the typical problems encountered,
and,
second,
what
tools are available for debugging in these circumstances.
Now that we have a full set of tools at our disposal, it is time to
examine some specific examples of issues and determine which debugging
techniques will be most relevant to each issue for an effective and
timely defect resolution.
In the following sections, we will cover a couple of real problems
in different asymmetric multi-core systems and how we can go about
solving them. These examples should provide an understanding of how to
deal with problems identified in a previous article.
Data drops
In the specific example in Figure 1,
below, the device under test is an asymmetric multi-core system
providing Ethernet inline acceleration. The device is connected to an
Ethernet network. A file system server, which acts as an FTP and NFS
server, also resides on the same network.
 |
| Figure
1: Data drop scenario setup |
The asymmetric multi-core system in the device follows the typical
inline acceleration model that was detailed in Part 2. In Figure 2, below, is a reminder of
the overall architecture described there. A critical point to keep in
mind is that the
communication between cores is achieved through the use of queues
residing in a memory location visible to both cores.
In this example, it is assumed that performance testing has already
been carried out and it has been proven to not be an issue. During
testing, normal traffic such as pings and FTP function correctly except
for NFS. The NFS service is timing out due to a communication breakdown
between our device (the NFS client) and the NFS server: the NFS packets
are being received on the main core but the NFS client is not accepting
them.
 |
| Figure
2: Data drop scenario system architecture |
Our investigation begins with inspection of the NFS packet headers,
which reveals nothing suspicious, header data is as expected and
without errors. However, data location examination reveals that all NFS
traffic starts on byte aligned data pointers whereas all other traffic
uses quad-byte aligned data pointers. Due to the problem not being
performance-related, data location examination in this scenario is as
simple as printing all data locations used to the debug interface.
Further investigation using internal test code reveals that the
secondary acceleration core is fully able to pass byte-aligned data
packets. This allows us to rule out the secondary core being the source
of the problem.
Further inspection of the entire packet leads to the discovery of
data corruption on the very last word of the packet; this causes a
checksum error on all NFS packets.
The issue is thus determined to be related to data caching and
invalidations functionality on the main core not handling byte aligned
pointer correctly. In this specific example, data profiling is the
primary source of information for root causing the issue.
To summarize, in the case of data being dropped unexpectedly (as
generally described in a previous article and demonstrated above), the
first step is to profile the data and data location. It is important to
identify any differences between the packets successfully going through
the system and those being dropped.
Dumping the data pointers and data itself to the single debug
interface is the most common way for profiling the data in this case;
should the issue not be reproducible due to the performance impact of
these dumps, an alternative is to add event counters for each type of
data identified.
Comparing these counters to the number of drops will also yield
valuable information. If no valuable data is collected, consider
refining the number of data types that are being counted through the
system.
It is also important to rule out early on any issue due to lack of
performance. At a system level, we can determine if a core is proving
to be a bottleneck by examining the communication mechanism between
cores and checking that underflows or overflows occur at any given
time.
Performance at a core level also needs to be examined to check that
data is not being dropped within the core due to lack of performance
(e.g. internal queuing issues, thread issues in a multi-threaded
environment, etc.).
Hardware errors should also be closely monitored as they could be
leading to random data drops. Event counters will be crucial as any
correlation between the number and type of hardware errors (in the case
of the example above, CRC errors, packet collision, etc.) and the
number of dropped packets needs to be identified immediately, thus
giving precious insight into the possible root cause of the issue.
Non-responsive secondary core
For this specific scenario, we will take the example of a physics
simulation application; for simplicity, we will only consider one
aspect of physics simulation, and that is collision detection given a
set of objects and their trajectories, shown in Figure 3, below.
In our example we have a secondary core that will allow the main
core to offload all the collision detection algorithm calculations. The
main core provides the list of objects with positions and trajectories;
the secondary core will take care of all the calculations and return a
list of intersecting bodies.
For the purpose of this example, the main core will communicate with
the secondary core using queues implemented in hardware. The list of
objects for each collision calculation will be built in the following
way: the main core writes a first object with its position and
trajectory into the hardware queue, identifying it to the secondary
core as the first of a list.
It will then continue writing each object descriptor into the
hardware queue until it reaches the last one, which it will then mark
for the secondary core so it knows the list is complete and the
secondary core can start the collision detection calculations.
On top of actually carrying out the collision detection algorithm,
the responsibility of the secondary core is to internally build the
list of objects by extracting entries from the hardware queue and
watching the start of list and end of list markers in order to finalize
the list. We will not go into the details of how many lists the
collision detection accelerator can maintain concurrently or the
maximum number of objects it can handle per list.
The collision detection accelerator will be provided a pointer to a
list of colliding bodies through another hardware queue by the main
core; it will populate this list during collision detection and return
its pointer to the main core through another hardware queue.
 |
| Figure
3: Collision detection acceleration |
In this scenario, we are able to set up the system, but the
collision detection accelerator is never returning any completed lists
of collided bodies following the submission by the main core of a list
of objects. The accelerator is non-responsive from the start.
The first step of our investigation is to profile the flow control,
which confirms that both the main and the secondary cores are set up
correctly; we achieve this by checking the hardware queue state after
writing to it. It is determined that the secondary core is seeing the
entry in the queue and is reading it into local memory. The data is
then profiled.
In this case, we need to examine the actual queue entries as seen by
the secondary core, again these appear to be correct according to the
specification for the main core.
At this stage, we need to examine the state of the secondary core at
various stages of a collision detection calculation. This is achieved
simply through the use of a set of counters incremented each time the
core transitions to a specific state. We identify four key stages:
waiting for start of list, building list, calculating, and posting
result.
These results show that the accelerator is stuck in the building
list stage. Further data profiling leads to identifying the issue in
the object descriptor end of list marker; the main core and secondary
core have a different understanding of how an object is marked as the
last member of a list. This is leading the accelerator to never detect
the end of list marker and so it continues building a single list
indefinitely.
Profiling the flow control
Following the resolution of the issue detailed above, we discover a
second issue related to the secondary core being unresponsive. In this
new scenario, the system stalls after a variable number of collision
calculations.
For this new investigation, once again we must start with profiling
the flow control. We discover that no errors occurred related to the
communication between the cores; we have no overflow or underflow
errors.
Following the analysis matrix detailed in Part 2, the next task is
to examine the state of the secondary core. Due to the complete
unresponsiveness of the core, we are forced at this stage to use any
debugging tool available that will provide us access to the secondary
core internals (a tool developed internally or one commercially
available).
This allows us to examine the program counter and determine that the
accelerator is locked up in the collision calculation stage during the
building of the list of collided objects.
The addition of a simple counter tracing the current number of
collisions listed so far leads to identifying that the number of items
that the secondary core is attempting to put into the current list is
greater than the maximum supported and it is locked-up on the first
item outside of the allowed range.
As demonstrated above, the approach to solving the problem of a
secondary core not responding is similar whether the lock-up occurs
from start-up or after an unknown event. However, emphasis is placed in
different areas.
The first step of the investigation is always to make sure that the
secondary core has been configured with the appropriate data for the
communications mechanism and flow control. We need to make sure that
both cores are using the same set of resources and that they are using
the correct resource for each task.
In the case study above we used hardware rings, so we checked that
the main core is writing object descriptors to the queue that the
accelerator is reading from. Should we use a memory-based mechanism
(such as software queues) then we also need to make sure that the two
cores are referring to the same area of memory and that they are
referring to it in the same way:
* If both cores are using different addressing schemes (virtual
addressing, physical addressing, physical offsets, etc.) then we must
verify that address pairs are correct.
* If configuration of the flow control on the secondary core is done
with the address of a single memory block and then split internally, we
need to make sure that both cores have the same understanding of how
this memory block is structured.
In the second scenario, the flow control functions correctly up to a
certain point. We must identify the event that has caused the flow
control to break down. First we identify if the event was error-related
or not: did an error occur in the system before control flow breakdown?
Having counters in place to capture occurrences of every standard
error scenario is critical for easy debugging. Should an error have
occurred, then error recovery should be investigated closely. If an
error did not occur, then investigation of the code paths will be
necessary to identify whether a single occurrence of a specific code
path was the issue.
The issue could be consistently reproducible after a finite number
of entries shared between cores, or reproduced for a specific data
event. Examining the amount of data and the type of data will also come
into play, out of bounds accesses as well as counter rollover issues
are relatively common issues that need to be catered to.
In both the scenarios, state determination is paramount for root
causing the problem at hand and will be the primary source of
information leading to a successful resolution.
Conclusions
Throughout this article series, we have listed typical error scenarios
in an asymmetric multi-core application and how to approach identifying
the root-cause of the problem.
Like any other problem solving exercise, one must first look at the
big picture and then narrow the scope bit by bit until the problem is
root-caused; this article series is an attempt at providing some
guidelines for how to approach particular error scenarios.
All the techniques focus on the usage of available resources, mostly
event counters, and how to use these resources in the appropriate
locations in the software for each type of error scenario. The two case
studies explored above should provide enough guidance for people to
independently identify the type of error witnessed and what techniques
to use in resolving that error.
To read Part 1 in this series, go to Defining
an asymmetrric multicore application
To read Part 2 in this series, go to Tools and methodologies for multicore debug.
To read more about multicore
issues, go to "More
on Multicores and Multiprocessors ."
Julien Carreno is a senior
engineer and technical lead within the Digital Enterprise Group at Intel Corp. He is currently the
technical lead on a team responsible for delivering VoIP solution
software for the next generation of Intel's Embedded Intel Architecture
processors.
He has worked at Intel for
more than three years, specialising in acceleration technology for the
Embedded Intel Architecture markets. His areas of expertise are
Ethernet, E1/T1 TDM, device drivers, embedded assembler and C
development, multi-core application architecture and design.