Techniques for debugging an asymmetric multi-core application: Part 1 -

Techniques for debugging an asymmetric multi-core application: Part 1

In a project life cycle, there always comes a time when one must debuga software issue found during testing. Development teams are alwayslooking to pro-actively find potential defects as early as possible inthe development cycle.

Unfortunately those methods are not 100 percent foolproof and therewill always be a few issues which only appear when testing the fullsystem. This leads to teams reactively debugging issues as they appear.

Because of this, development teams always look for ways to improvetheir debugging techniques for faster, more effective debugging as well as betterperformance characterizations and bottleneck identification inreal-time applications. Issues related to this subject increasedramatically in the context of a multi-core application where data isbeing passed from one core to another.

Complexity will increase even further when dealing with an asymmetric multi-core scenario whereonly a single debugging interface may be present for the entire system.In this series, the aim is to provide a clear understanding of typicalissues that can occur in an asymmetric multi-core application andprovide a set of tools for effectively debugging these issues.

In order to provide the appropriate level of detail, this topic willbe covered in a series of three articles. In this first article we willcover setting a common understanding of what is an asymmetricmulti-core application and what are the typical problems that can beencountered in such a system.

Typical system scenarios
In order for the developer to be able to effectively debug anasymmetric multi-core application, he/she must first clearly understandthe system; this section's purpose is to set a common understanding ofwhat an asymmetric multi-core system is.

For the purpose of this article, this section will cover thespecific example of a highly integrated System on a Chip (SoC) with anaccelerated Ethernet interfaceand cryptography acceleration.In this case, we will assume a network routing application with packetencryption enabled.

This type of application can be split into two parts: the networkinput/outputs and the packet encryption/decryption. Figure 1 below shows how the twoparts of the application are accelerated.

Figure1: Hardware overview

For the purpose of this article, let's consider the main core isbased on typical Intel architecture, this assumption does not have anymajor implications on the issues and techniques detailed later on.

The architecture of the secondary cores 2 and 3 is a purpose builtacceleration engine with specific hardware functionality targeted for aspecific type of application. In this case, we have two versions of theacceleration engine as described below.

Scenario #1. Core 2 is asecondary core whose sole responsibility is to offload low-level dataprocessing for a network interface from the main core. The secondarycore will receive packets from the interface (for example an EthernetMAC built into the core as a coprocessor ), do a pre-configuredset ofchecks and processing on the data (checksumvalidation, filtering, VLANtagging ) and then will pass the data up to the main core forhigherlevel processing (protocol stack ).

The main core will also pass data to the secondary core fortransmission of data, the secondary core performs a pre-configured setof actions (checksum appending )and sends the data out on theinterface. This scenario is typically described as inline acceleration.

Scenario #2. Core 3 is asecondary core whose sole responsibility is to offload specific dataencryption/decryption processing functions (task) from the main core.The main core will provide data to the secondary with a description ofthe task to perform. The secondary core will perform the action andreturn the result to the main core.

This scenario is typically referred to as look-a-side acceleration.In this scenario, offloading security processing to core 3 is not onlydue to the performance enhancement of having another core do some ofthe processing for each packet.

It also has the added benefit of core 3 having hardware accelerationfunctionality (hashing coprocessor inthis instance ) allowing for muchfaster processing compared to a standard core.

As with any asymmetric multi-core application, the primary objectiveof both core 2 and 3 is to free up resources and CPU cycles on the maincore for higher level data processing. Effectively, secondary coreswill perform some of the complex functions for the main core, such ashashing and/or encryption (core 3), or provide acceleratedinput-outputs (core 2).

An asymmetric multi-core application will always be made up of amain core, the master in the system that will drive the application,and one or more secondary cores, the slaves in the system that performa limited set of tasks and are driven by either an externalsignal/interface or by the master core.

For each task, data is passed from one core to another using ashared memory model or using hardware communication mechanisms (mailboxtype of mechanism for instance ).

Typically, only the main core will provide a debug interface, accessto the internals of the secondary cores can vary widely; it can becompletely inexistent, very limited, or can even impede the system'snormal operations (access requireshalting one of the cores forexample ).

Techniques described later in this series will always assume asingle debug interface and inexistent access to secondary coresinternals unless specifically stated. A typical debug interface can bea serial port providing a command line interface or a JTAG interfacefor instance.

The debugging guidelines and techniques described in a later part ofthis series are not confined to specific application and systems asdescribed above. The material presented is highly relevant no matterthe architecture of the main and secondary cores.

For instance, we could even argue that provided there is a main coredriving the application and the secondary cores depend on the main coreproviding them data to process, all cores could have the samearchitecture with the main core being determined at boot time.

In this case we have a physically symmetric multi-core architectureimplemented an asymmetric application with each core dedicated to aspecific task. Further, these techniques could even be applied tosystems where the secondary cores are off chip, i.e. a multi-processorconfiguration.

In the life cycle of a project involving silicon as well as siliconenabling software, it is often the case for the software to be readybefore the first version of the silicon is available for use.

Common techniques used to alleviate this are the use of a simulationof the silicon running on a traditional workstation or/and the use ofan emulation of the silicon using FPGAs (or some other similartechnology ).

Both these techniques, although very useful at the start of aproject, have the major drawback of being slower by several factors (inthe order of 10 times slower for emulation, one thousand times or morefor simulation) compared to the real silicon.

All the debugging techniques listed in this paper will always assumethe real silicon to be the starting point of any investigation. Shouldsimulation and/or emulation be relevant to debugging a specific type ofissue, it will be specifically stated.

The techniques described hereafter will assume that the accelerationcores are a black box on the real silicon and access to their internalsis very limited or even non-existent under normal operations.

Problem statement
As stated previously, in an asymmetric multi-core environment, it isoften the case for only the main core to have a debug interface, andfor the secondary core(s) to be black boxes with no visibility of theinternals.

This often raises the issue of attempting to know what is happeningin the secondary core on top of investigating whatever communicationmechanism is in place between the cores for passing data back andforth. Debugging tools are often used to connect directly to thesesecondary cores in some way or form to access internal data otherwiseinaccessible (VisionICE, JTAG,internallydeveloped tool, etc )

However, these tools often have drawbacks, such as impacting theperformance of the core while running, causing a change in thebehaviour of the system, or having to stop the core to examine itsstate.

In a real-time application this type of debugging may not be of anyuse as we may need to maintain the behaviour of the system in order toreproduce the issue while debugging; the application will havereal-time constraints which could potentially mean that any issueinvestigated can be related to timing in the system.

In these cases it is left up to the designer of the application andthe tester to incorporate into the application other means of debuggingthe entire system using the only debug interface with no impact orminimal impact to performance thus preserving the real-time behavior ofthe system during debugging. These techniques are what we will cover indetail at a later stage in the series.

Typical applications: two examples
Detailed in the following section are two possible applications basedon an asymmetric multi-core system. Should more detail be required,consulting the documentation on products such as the IXP425 or IXP2350 can be a good startingpoint.

Inline Ethernetacceleration. The diagram in Figure2 below gives a simplified overview of a network applicationwith an accelerated network interface. The application design is basedon a network processing engine (purposebuilt processor with aradically different architecture than an Intel XScale or Intelarchitecture core ) performing the low level packet processingfor themain Intel XScale core.

As mentioned previously, the specific architecture of each core isnot incidental to the following material but this will help in gaininga clear understanding of this specific example. Communication oftransmit request and receive notifications is done through the use of asimple First-In-First-Out queuing mechanism implemented using softwarebased memory queues.

Figure2: Inline acceleration

Look-a-sidepixel shading acceleration. The diagram in Figure 3 below gives a simplifiedoverview of a graphics application with look-a-side pixel shadingacceleration.

The main core (Intel architectureor Intel XScale ) willoffload graphics content calculations to the graphics module (core withmultiple hardware thread support). Communication of pixel shadingrequests and completion notifications is done through the use ofhardware rings.

Figure3: Look-a-side acceleration./b>

Some possible problems
In this section we will list the different types of issues that can beencountered in an asymmetric multi-core application. This list is by nomeans exhaustive. Other types of issues are possible. Listed here arejust the common and main issues that can come up in such projects.

How to approach them for an effective and timely defect resolutionwill be covered in a later part; first we will set the ground work fora common understanding of these typical issues.

Drop in performance. Anyreal-time application has requirements specifying the performance to beachieved for a specific scenario. Performance is usually specified bythe number of events per second that the system can handle. Onepossible issue that can arise is that the performance drops sharply andunexpectedly during the course of a use case. This drop is recurringand will happen every time the test case is run and will happen once ormore (see Figure 4 below ).

The system recovers without any intervention from the user. Below isan example of a system operating at maximum capacity with approx 80percent of incoming events being processed and sent back out; the restof the events are dropped by the system. You can see clearly thatthroughput drops on occasion and then recovers.

Figure4: Recurring performance drop

Another possibility is for performance to drop occasionally on thesame scenario. In this case, the drop in performance cannot bereproduced on every instance of the test case and will happen only once(see Figure 5, below ). Here aswell the system will recover without any outside intervention.

Figure5: Occasional performance drop

Yet another scenario involves the throughput dropping (or evenstopping all together) and not recovering until there is an outsideintervention such as the user stopping traffic and then starting itagain.

Figure6: Performance drop requiring outside intervention

Applicationlock-up. In worst case scenarios, the application may lock upcompletely, for example the secondary core may lock and become totallyunresponsive.

In this scenario, the application was functioning correctly but thenunexpectedly stopped and does not recover when all activity on thesystem is stopped then restarted.

The seriousness of the lock-up can be assessed through the extent ofthe steps taken by the user to recover the system. It can range fromhaving to reconfigure the application, to stopping and restarting oneor more of the secondary cores, and even rebooting the entire system.

Data drops. In a network acceleration scenario, the application may start droppingdata while maintaining adequate performance. In this case, the systemis not functioning at full capacity, there is still bandwidthavailable; but the system is dropping data in specific scenarios.

In Figure 7 below , the samesystem bandwidth is used no matter the packet size (i.e. the samenumber of bytes are processed per second, only the packet size varies ).And in specific test cases, a percentage of the data is dropped eventhough with slightly different settings the same throughput can beachieved with no errors.

Figure7: Percentage packet processed depending on packet size

Extra dataappearing. While testing secondary core behaviour it is commonto compare actual data with expected data, it is sometimes possible forthe actual data to contain all expected data but also contain someextra non-corrupting data, this data can be valid or invalid but itdoes not impact the expected data. The extra data can present itself inthe form of entirely new extra packets, duplication of packetstransmitted or received.

Data corruption. Data corruption is probably the most common of issues related to anynetwork data processing application or data processing offloadingapplication.

Corruption can take several forms, such as, increased/decreasedpacket length, data value changed causing CRC errors,corrupted entries in a FIFO queue Datacorruption can besummed up by two scenarios:

* The output data from processing is different from the expectedoutput.
* The data taken out of a storage location is not what was supposed tohave been input.

Data corruption can manifest itself at different levels of theapplicationand falls into two broad categories. First, some ofthese corruptions may happen systematically but may not be detectableunder normal functioning. Second, other corruptions may happenoccasionally andare detectable immediately through normal path checks (such as CRCchecks in a network application ).

Timing misses. Certain typesof applications may have hard deadline requirements whereby thesecondary core may be required to be capable of responding to aspecific signal in a certain interval of time.

Missing this timing window can have drastic repercussions on theperformance and stability of the system. A timing miss may lead to avariety of errors ranging from an application lock-up to sporadic orsystematic data corruption.

A timing miss is not usually linked to a specific defect in theapplication but more to a lack of performance at a certain point of thedata processing. This lack of performance may be due to the applicationitself or may be related to other applications interfering and, forexample, overusing a shared resource.

Non-responsive secondary core. A non-responsive core, as opposed to a core lock-up, will present theproblem of having a single path from one core to another notfunctioning.

For instance, any requests using a specific type of communicationmechanism might get ignored, or a specific acceleration feature on thesecondary core might not be responsive. This type of issue could occurfrom start-up or after an unknown event which destabilizes the system.

Next in Part 2: Tools andtechniques available for debugging multicore applications.

To read more about multicoreissues, go to  “Moreon  Multicores and  Multiprocessors .”

Julien Carreno is a seniorengineer and technical lead within the Digital Enterprise Group at Intel Corp. He is currently thetechnical lead on a team responsible for delivering VoIP solutionsoftware for the next generation of Intel's Embedded Intel Architectureprocessors.

He has worked at Intel for morethan three years, specialising in acceleration technology for theEmbedded Intel Architecture markets. His areas of expertise areEthernet, E1/T1 TDM, device drivers, embedded assembler and Cdevelopment, multi-core application architecture and design.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.