How to automate stress tests - Embedded.com

How to automate stress tests

The goal of stress testing is to find system-level interaction problems. Here's a strategy for developing a successful stress-test framework.

Semiconductor testing usually consists primarily of running diagnostic software written to verify each function of the chip. But it's not sufficient to test each function of the chip one by one in isolation.

While these directed tests are a must to ensure that each feature works as claimed, they won't catch most system-level problems, no matter how thorough each test is. This is because such problems occur when multiple functions on the semiconductor device interact in specific ways. The number of possible function interactions greatly exceed what you can test using a directed-diagnostics-style test. Therefore, good system-level stress testing is needed to exercise multiple functions together in an intense and random fashion to force these design problems to the surface.

In this article, we'll explore the subject of system stress testing for verifying semiconductor device design. Implementing a good system stress test for design validation can be challenging, so I'll describe an example stress-test framework you can use as a basis for creating an expandable and full-featured stress test. You can use the techniques presented here with any semiconductor device or ASIC that's controlled by software.

Stress tests
What is stress testing and how does it apply to semiconductor design? Stress testing consists of exposing a product to much harsher conditions and use than that which normally occurs in the field. The goal is to overwhelm the device in different ways to quickly flush out errors. A mechanical product for example, may be subjected to harsh environmental conditions such as high pressure, temperature, and vibrations. If it survives for a month under this stress, then it should survive much longer during normal usage.

The same concept applies to a chip design. In this case, stress testing consists of executing diagnostics software whose purpose is to get as many functions of the chip operating at one time in a continuous, intense, and random manner that greatly exceeds what occurs through normal use of the chip. These tests are normally run continuously and mostly unattended including on nights and weekends. Using these tests, you can force lurking system problems to happen rather quickly before your product is shipped.

Why is stress testing so important for semiconductor designs? To illustrate, let's assume that it's common for a semiconductor device to be deployed in a variety of electronic systems with varying purposes. In turn, these electronic products will be used by thousands, even millions, of people each using the product differently. With so much use, problems can be encountered when certain sequences and combinations of functions and events on the chip occur that have never been exercised in device testing. These problems usually manifest themselves as occasional lock-ups, data corruption, or other hard to reproduce anomalies. If these problems are excessive (or just rare in mission critical products) your product will have a reputation of unreliability or even instability.

Organizing chaos
During my initial attempt to create a system stress test, my method was to dive in and write software that directly activated all the chip functions I could and have them execute at the same time. I used timer interrupts and had them run random tests while the main program was copying stuff around and starting other chip operations haphazardly, without much upfront planning. As I thought of more ideas to introduce chaotic and random events to the chip, I added them to the code. I quickly realized that although I was finding many problems, they were mostly in my own software!

Test functions in my test interfered with each other by sometimes using the same memory regions or peripheral resources. Or, for example, one test started at random changed the state of a general-purpose I/O that another test was using for a different purpose. The resulting intermittent failures were time consuming to debug.

My next attempt was to develop stress-test software that was extremely organized. Randomly selected test functions took turns running and interrupts were locked out at lots of points to avoid software conflicts along with other safeguards. This conservative coding resulted in great test stability, but it also turned out to be no better at finding chip problems than a directed test since it was running tests in too much of a predictable sequential manner. The purpose of the test after all is to test the semiconductor device design, not to have stable test software.

Ideally, we want to avoid both of these extremes. The overall goal of the test is to uncover system-level interaction problems in the chip, and to do that, we must mix things up and cause many different events to occur with random patterns of operation. The trick, however, is to contain this wild side of the test in an organized manner so that you have reliable test code that's expandable and that aids in debugging as much as possible. Meeting these goals takes careful thought and planning. This is where a stress-test framework comes in.

Test framework
The remainder of this article describes the design of a software framework that you can use to build an expandable stress-test for a semiconductor device. Although I don't provide the complete framework, what's here should stimulate ideas for building your own stress-test framework.

Software frameworks are used to ease development of specific types of applications by providing a common execution engine and environment for them. An application hooks into the framework using a well-defined interface. The stress-test framework uses this same idea by providing a software engine to handle the stress test's basic execution and common functionality.

The stress-test framework itself doesn't test any chip functionality, but it allows test-event modules that do to be plugged into it. A test-event module is a function that implements a specific semiconductor device test or event, usually using a set of randomly generated test parameters. When executing the stress test, the framework engine will randomly select and launch these test-event modules in a variety of ways.

The framework also provides basic services for test-event modules to use, including functions to reserve system resources, select random test parameters, and log data. Most of the upfront planning and design goes into the framework itself, since the framework handles the complicated and “wild” side of stress testing. Individual test-event modules are then added as needed to expand the stress test. Since the framework handles the common stress-test code, test-event module developers can focus on writing a good test for the targeted chip function. See Figure 1 for a high-level view of the various components of the stress test framework.


Figure 1: Components of a stress-test framework

The random test pool is a central part of the stress test and is where the actual semiconductor device tests reside. This pool is the collection of test-event modules written to initiate events and/or perform a test on some specific feature of the chip. As new test-event modules are developed, you can add them to the pool to be included in the next stress test run. I'll explain the test-event module in more detail later, since these are the actual semiconductor device tests developed for the chip and launched and supported by the framework.

When executing, the stress-test engine randomly selects test-event modules from the pool and runs them. To mix things up more (and thus put more stress on the device design), the framework uses asynchronous slots to randomly select and launch tests at varying time intervals.

A “slot” is implemented using a hardware timer, and a separate slot can exist for as many timers as you have on the device or test board. The framework selects random intervals for each slot timer and when they expire, the engine interupts the current test module and the timer's interrupt handler will select another test module to execute. The engine selects another random interval for the next slot firing. In this way, tests will execute in an interleaved fashion causing a variety of execution patterns to occur as well as obtaining test execution parallelism, as we'll see later.

The framework code to select and launch a test is simple since a test-event module is nothing more than a C function. Listing 1 shows a snippet of code for the launcher. In general, the framework keeps an array of C functions, which are the test modules, selects a random index, looks up the function, and executes it.

Listing 1: Random test launcher

typedef void (*TestModuleFunction)(void);

TestModuleFunction  TE_CacheToggle;
TestModuleFunction  TE_CPUCopy;
TestModuleFunction  TE_DMAm2m;
TestModuleFunction  TE_DMAUart;

const TestModuleFunction testEventModulePool[] =
{
    &TE_CacheToggle,
    &TE_CPUCopy,
    &TE_DMAm2m,
    &TE_DMAUart,
}

void TestLaunch()
{
    int TE_index;
    int nModules = sizeof(testEventModulePool)/sizeof(TestModuleFunction);
    // Select a test event module at random, then launch it
    TE_index = GetRandomValue(nModules);
    (*(testOperation[toPool[TE_index]]))();
}

Test modules
As I explained earlier, a test-event module is a function that performs an action and possibly a verification on a specific feature of the semiconductor device. Although the framework invokes the test module as a C function, the module can call assembly- level code if required. The framework is written so that you can develop new modules and add them to the test pool to become candidates for random selection by the framework during the stress test. The test modules in the following example framework are void C functions, which take no arguments. It's up to the module itself, with the help of framework-support functions, to initiate the test and call appropriate functions to log the results.

There are three types of test-event modules:

  1. event only
  2. verification test
  3. delayed verification test

As its name suggests, an event-only module initiates a system event without doing any verification. This could involve changing the system in some way or starting some background functionality. You may wonder what's the benefit in initiating an event without performing any verification. To illustrate, consider a test-event module that simply toggles the processor cache from its current setting. Every time the framework randomly selects this module, the cache is toggled from on to off or vice versa. This adds an extra variable in that other test modules are run with the cache sometimes on and sometimes off.

The second type of test-event module, a verification test , performs an action as well as a verification of that action before the test module function exits. For example, the test could fill a randomly selected memory region with a random data pattern and then copy this area using random CPU access widths to a random destination memory region and verify the result. An example of a problem such a test might uncover is a failed write to an 8-bit PCMCIA memory region when immediately preceded by a 16-bit read of system SRAM.

The third type of test module, a delayed verification test , is the most powerful, although usually only possible for chip functions that can master the bus (in other words, run in the background of the CPU). This type of test module will start a test and then exit, enabling other test modules to execute while the original test continues to run. In most cases, the test completion is indicated by an interrupt. When the interrupt occurs, the interrupt handler will verify the results of the test.

An example of a delayed verification test might be a direct memory access (DMA) transfer. A test module can be written to select random source and destination memory areas and a DMA channel. Such a test module can fill the source memory with a random value and install a DMA complete interrupt handler that will verify the test results. This interrupt handler, known as a delayed verifier , will have access to data structures that have the details of the test that was initiated and will verify the results of the test when invoked.

An example
To illustrate how stressful one of these tests can be, consider this simple example of a test framework with these four test-event modules:

1) Simple CPU-based memory copy (TE_CPU_COPY):
Fills a random data pattern to a random source address of random block size, then copies the pattern using a random mix of access widths to a randomly selected destination address. After the copy, you'll verify that the data in the destination area contains the previously selected data pattern.

2) DMA Memory-to-Memory copy (TE_DMA_M2M):
Fills a random data pattern to a random source address of random block size then selects a random Memory-to-Memory DMA channel and programs it to copy from the source address to a randomly selected destination address. When the DMA complete interrupt occurs, verify that the two memory regions match.

3) DMA to Peripheral (TE_DMA_M2P):
Fills a random data pattern to a random source address of random block size then selects a random DMA channel and programs it to copy to a random port number of a specific peripheral. When the DMA complete interrupt occurs, verify that the correct data is in the peripheral's FIFO (or otherwise sent to peripheral).

4) Cache Toggle (TE_CACHE):
Toggles the cache from on to off or vice versa and exits. This test module is an event-only test module and doesn't perform verification when run.

Assume that we have three slot timers allocated to the framework as test-event module launchers. We'll say that the main() routine that launches them is in slot 0. A specific run might look something like the one shown in Listing 2.

Listing 2: An example run of a stress-test framework

Test launches TE_MEM_COPY, slot 0
Test launches TE_CACHE, slot 2
Exit TE_CACHE, slot2
Test launches TE_DMS_M2P (dma ch=2), slot 1
Exit TE_DMS_M2P, slot 1
Test launches TE_DMS_M2M, slot 3
Exit TE_DMS_M2M, slot 3 (dma ch=5)
Test launches TE_DMS_M2M, slot 1
Exit TE_DMS_M2M, slot 1 (dma ch=4)
Exit TE_MEM_COPY, slot 0
Launching TE_CACHE, slot 0
Exiting TE_CACHE, slot 0
Launching TE_DMS_M2M, slot 0 (dma ch=6)
Delayed verification slot 1 TE_DMS_M2P (dma ch=2)
TE_DMS_M2P Slot 1 (dma ch=2) — TEST PASSED
Exiting TE_DMS_M2M, slot 0

The run starts with the memory copy test. While it's executing, the slot 2 timer expires in the middle of the copy. The cache-toggle test-event module is then launched; it toggles the cache state and exits. The copy test is resumed with the cache in the new position. Again, before that test completes, the slot 1 timer fires and a DMA M2P test module is launched. This sets up the random DMA and exits. Now the DMA is occurring in the background and the original copy test is resumed. A slot 3 timer interrupt then occurs and initiates a memory-to-memory DMA transfer using DMA channel 5 and exits. I could go on, but you can see that with this test, you can get all the DMA channels running at one time with the cache toggling at random and the CPU randomly accessing different memory areas throughout the test.

Test parameters
A test-event module will use one or more randomly selected test parameters to carry out the test operation. Examples of test parameters used by tests are source and destination memory addresses, block sizes, data patterns, access widths (such as 8, 16, or 32 bit), DMA channels, and UART ports.

While selecting test event functions at random seems straightforward, how do we select random test parameter values while ensuring these values are valid for the system and do not overlap somehow with other tests such that the tests interfere? For example, if one test selects a random memory address and size, we need to know that (1) the memory is actually there and that (2) another test has not selected a memory region overlapping this one (since one test could overwrite the other test's memory causing a failure). Another example of a contention would be two tests that both randomly select DMA channel 3 to do a transfer.

The framework provides support functions so that test modules can safely select random test parameters. To help you understand, here's an example of the most commonly used test parameter, a randomly selected system memory block. A random memory block will have a start address and a size. Both are random values. The memory block can reside anywhere in the CPU's address space including on SRAM, PCMCIA, and flash memory.

In the example in Listing 3, the array memPool is filled with the start address and size of memory segments present in the system address space, and thus available for random selection by the test. Each entry has a reservation flag (inUse element). To select a random block, first the function selects a random index into the memory segment array. Then it selects a random start address with the segment as well as a block size. The maximum block size is the end of the memory segment less the selected start address. Once the random block's start and size is selected, the entire memory segment is marked as reserved by setting the reserve flag.

Listing 3: Random memory-block selection

typedef struct
{
    uint32_t  start_addr;
    uint32_t  size;
    bool      inUse;

} memSeg;

memSeg memPool[] =
{
/*    start addr        size                inUse */
    { 0x10000000,0x00040000, 0},
    { 0x10040000,0x00040000, 0},
    { 0x10080000,0x00040000, 0},
    { 0x100c0000,0x00040000, 0},
    { 0x30000000,0x00040000, 0},
    { 0x30040000,0x00040000, 0},
    { 0x30080000,0x00040000, 0},
    { 0x300c0000,0x00040000, 0},
    { 0x30100000,0x00008000, 0},
    { 0x40004000,0x00004000, 0},
    { 0x40008000,0x00004000, 0},
    { 0x4000c000,0x00004000, 0},
    { 0xa0000000,0x00002000, 0},
    { 0xb0000000,0x00008000, 0},
    { 0xb0008000,0x00008000, 0},
    { 0xb0010000,0x00004000, 0}
};

int SelectAndReserveBlock(unsigned int *addr, unsigned int *size)
{
    int retries=0; /* limit the number of times we try to get a segment */
    int seg;

    while(retries <>
    {
        /* Select a random memory segment (index into memPool array) */

        seg = GetRandomNumber(sizeof(memPool)/sizeof(memSeg));

        if (memPool[seg].inUse)
        {
           retries++;
           continue;
        }
       /* reserve the segment */
        memPool[seg].inUse = 1;

       /* select random block in segment, and don't let it run past the end
        of the segment. */

        *addr = memPool[seg].start_addr + GetRandomNumber(memPool[seg].size);
        max_size = memPool[seg].size – (*addr – memPool[seg].start_addr);
        if (max_size > MAX_BLOCK_SIZE)
            max_size = MAX_BLOCK_SIZE;
        *size = GetRandomNumber(max_size);
        break;
    }

/* return the segment number so it can be used to free it later,
   or return -1 if we gave up looking for a free segment */

    return (retries >= MAX_RETRY_COUNT) ? -1 : seg;
}

void FreeMemorySegment(int seg)
{
    memPool[seg].inUse = 0;
}

A side effect of this method is that using a random memory block within a memory segment, as defined in the memory segment array, reserves the entire memory segment. Couldn't we run out of memory segments if they're all locked? Yes, and the way around this is to break the memory into smaller segments. We may want to divide up the segments in different ways in order to distribute the probability of selecting memory blocks across different memory devices of varying data widths and types. Note that this scheme is kept simple. We could have made it much more powerful, but the key is simply to have direct control over indicating what memory regions are available.

Managing the selection and reservation of other resources can be even simpler. For example, an array of reservation flags can be maintained to manage DMA channels where the index to the array represents the channel. You can write a ReserveDmaChannel() function that returns a random DMA channel number within the range of number of channels on the chip and one whose reservation flag is not set. It then sets the flag to reserve it. When the test is finished, a FreeDmaChannel() can be called to clear the flag.

Debug tips
Debugging and isolating problems involving system interactions can be difficult; you should design the stress test framework to assist in this process. The issues you'll encounter will differ depending on whether you're running in a presilicon simulation environment or on a board with real silicon.

The biggest simulation issue is speed. A simulated CPU runs much slower than a real device. The advantage of a simulation environment is that the simulation tool can output a detailed log of any of the design's internal signals. More signals can be logged as you start to zero in on the problem. This would make the debug straightforward except for one issue—reproducing the problem. If the problem was found after running the stress test on 10 machines over a weekend, it would be unacceptable to debug when it takes so long to make the problem happen.

The solution is to run the stress test as a large number of shorter (say 30 minute) tests. This is done by building different tests with a set number of test-event modules to be executed before exiting. In addition, each test should be assigned a unique random number seed so that another test run with that seed will recreate the exact sequence of test events and test parameters every time it's run.

A good execution script can randomly generate seed values, build a test, and launch it. The seed value should be logged with the results. To reproduce a failure, you can rebuild the test (since we probably don't want to keep all test executables around since there can be thousands) with that seed and then reproduce the problem in a short time.

On actual silicon, you can run many more stress tests due to vastly greater execution speed. In theory, it should still take about the same time since the more frequently occurring problems should have been caught in presilicon stress testing. Again, the big challenge is to reproduce the problem; you can try the techniques I described for presilicon testing, but they may not work as well since the actual system may not be as deterministic as it is in simulation. In other words, two test runs with the same random test seed may not result in precisely the same run sequence each time. In addition, you won't be able to log internal signal activity and should, therefore, rely on techniques such as logging checkpoint information to memory or high-speed port, inserting various logic analyzer trigger events, and employing other tricks to help track down the problem details.

No sure formulas for debugging system problems on silicon exist. All I can say is that you should think carefully and plan as if you will get elusive failures, writing the code to supply as many clues to the problem as possible.

In addition, your framework should include the ability to build test runs with certain test modules excluded so that stress testing can go on without stopping at known problems. You may also want to stress only certain chip functions if excluding the rest will allow you to concentrate your execution on these particular functions.

The framework needs to have an organized set of scripts (not provided here) to generate the seeds, build the tests, execute, store test logs, post-process, and alert the team to problems encountered. The process should be completely automated so that you can run tests unattended for long periods of time.

Logging
At a minimum, the stress-test framework should log which test modules are run and any failures that occur. A good log will include additional information such as the random parameters selected and when each test-event module was started, exited, and verified (the latter for delayed verification functions). In this way, you can see how the tests were run in relation to each other. Post-processing software can be written to generate some useful reports to show test coverage.

You should develop the logging functionality such that the stress test software does the least possible to log an event. In a simulation environment, HDL code can be written to provide a log peripheral where you can just write a token with a value and the HDL will generate the complete log message. On actual silicon, you could send byte token and values to a memory card or out a debug port. The whole point is to avoid the logging itself from sucking up valuable test execution time and causing undue interference. The task of making the log clearer and prettier can be performed by post-processing software.

Other suggestions
Since your framework will involve the use of shared data structures that are accessed from interrupt handlers, you should be careful to protect critical sections in the code when these structures are accessed. For example, consider a test event that has just read the inUse flag of a memory segment and finds it free, but before the test sets the memory to reserve the segment, a slot interrupt occurs. The test module that interrupts starts could end up reserving the same segment as the interrupted test module since it still shows as free. The slot-timer interrupts should be disabled during critical global data accesses like this.

When designing a stress-test framework, there are endless possibilities for expansion. One example is the ability to adjust probability weights of test-event modules or even test parameters to fine-tune the stress test. For instance, maybe the cache toggle event is one of 100 test-event modules, but you want it to run much more than 1% of the time.

Taking the time to plan, design, and implement a solid stress-test framework for your silicon device is a great investment in product quality.

Steve Babin has worked on embedded systems for over 16 years. He's currently a consultant at IBM's Pervasive Computing department developing software for smartphones. He holds a BSEE from Louisiana State University. He can be reached at .

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.