The Zen of Diagnostics - Embedded.com

The Zen of Diagnostics

Give a tech a break. As software engineers, it is our responsibility to give technicians the tools they need to ship the product.

Software engineers rarely concern themselves with how a product is manufactured. Sure, we work like slaves meeting performance goals, but the product must do more than function correctly–it must be producible. As our hardware brethren tune their designs to meet manufacturing, cost, and performance goals, we work in relative isolation–a sort of bed of coals where few dare to tread.

In high school I worked for a while in a machine shop, serving as a helper to highly skilled machinists making parts for the space program. When they discovered I intended to go to college and become an engineer, a veteran warned me to not “design something we can't make.” Now, 20 years later, I have to admit that these words, though casually and freely given, have been more important than most of the engineering courses I struggled through. Designing a fantastic widget is suicidal if it cannot be made and marketed at a profit.

In a typical manufacturing operation, boards are stuffed, tested, assembled into units, and repaired. While the code doesn't impact board stuffing and assembly, it can strongly influence test and repair. Smart designers build products that fit easily into the company's manufacturing operation. We can contribute by writing code that speeds production testing.

All employees work toward common corporate goals, yet each has a different vision of the company's needs and problems. To programmers, testing conjures images of correctness proofs, exhaustive software trials, and code-coverage analysis. Production people have probably never heard of these concepts. They look at testing as the daily routine of ensuring that every unit works correctly before being shipped.

Conventional software testing is a one-time event; once the product is finished, it is ready to be shipped. Product testing goes on every day. Very complex products are tested and repaired by technicians with little formal computer training. As software engineers, it is our responsibility to give technicians the tools they need to ship the product.

Internal diagnostics
Many embedded systems include diagnostics in the product's ROM to give a sort of go or no-go indication without using other test equipment. The unit's own display or status lamps show test results. Internal diagnostics are worth-while because they enable test technicians to track down problems. They're also an effective marketing tool, giving customers confidence in the product each time they turn it on.

Though internal diagnostics are often viewed as a universal solution to system problems, their true value lies more in giving a crude test of limited system functions. Not that this test isn't valuable. Internal diagnostics can test quite a bit of the unit's I/O, CPU, RAM, and ROM areas. The kernel frequently defies stand-alone testing, because so much of it must be functional for the diagnostics to run. A majority of systems couple the main ROM and RAM close to the processor. As a result, a single address, data, or control-line short can prevent the program from running at all.

It's easy to invest a lot of time coding internal diagnostics that will never provide useful information. They may satisfy vague marketing promises of self-testing capability. Internal diagnostics have limitations but, if carefully designed, they can yield valuable information. If we can apply our engineering expertise to the diagnostic problem, we can carefully analyze the tradeoffs and identify reasonable ways to find at least some common hardware failures.

First, separate your tests into kernel (CPU, RAM, ROM, and decoders) and beyondkernel (I/O) tests. Then consider the most likely failure modes. Try to design tests that will survive the failure, then identify and report it. Kernel tests will never be very robust since many hardware glitches will prevent the program from running. But, if carefully designed, they'll really help your friends in production.

What portions of the kernel should be tested? Some programmers have a tendency to test the CPU chip, running an instruction sequence to prove that this essential chip is operating properly. (the ubiquitous PC's BIOS CPU tests, for example). But, I wonder just how often failures are detected. On the PC, an instruction-test failure makes the code execute a HALT, causing the CPU to look as if it never started.

More extensive error reporting with a defective CPU is a fool's dream. Instruction tests stem from minicomputer days, where hundreds of discrete integrated circuits were needed to implement the CPU. A single chip failure might only shut down a small portion of the processor. Today's highly integrated parts tend to either work or not; partial failures are pretty rare.

Similarly, memory tests have to rely on operating memory–a high-tech oxy moron that makes me question the test's value. Obviously, if the ROM is not functioning, the program won't start, and the diagnostic will not be invoked. If the diagnostic is coded with subroutine calls, a RAM (read stack) failure will prematurely crash the test, without providing any useful information. The object is to design diagnostics to ensure that tests use as few unproven components as possible.

A carefully engineered RAM test can be quite valuable. In a multiple-ROM system, it (and all the diagnostics) should be stored in the boot ROM, preferably near the reset address. The low-order address lines must operate to run a trivial program, and enough of the upper-order address lines must work to select the boot ROM. In a small system with undecoded address lines, try to write test routines that rely on a minimal number of address wires.

RAM tests usually write out a pattern of alternating ones and zeroes, read the data back, and repeat the test using data that complements the original data. This approach reflects poor problem analysis. Before writing code, put on your hardware hat (or get help from another member of the design team) and consider the most likely failure modes. Tailor the test to identify all or most of these problems. Nine times out of 10, address or data-line shorts will crash the diagnostic. Sometimes RAM is isolated from the kernel by buffers. In this case, post-buffer problems can be found.

One or more signals may be needed to turn each device on. Chip-select complexity can range from a simple undecoded address line to a nightmarish spaghetti of PALs and logic. Regardless, every RAM must receive a proper chip selection to function. Multiple addressing is a variant of the chip-selection problem. If more than one memory device is used, several can sometimes be turned on at once.

Bad chips
Semiconductor vendors do a wonderful job of delivering functioning chips. Occasionally though, a bad chip may slip through (or, through mishandling, one may toggle to the bad state). Device geometry is now so small that it is unusual to see the pattern sensitivity that once plagued DRAMs. Usually the entire chip is bad, making problem identification a lot easier. Frequently, socketed RAM devices are improperly seated, and sometimes a pin bends under the device.

Dynamic RAMs require periodic accesses to keep the memory alive. This refresh signal is generated by external logic or the CPU. Sometimes the refresh circuitry fails. The entire memory array must be refreshed every few milli-seconds to stay within the chip's specifications. However, it's surprising just how long DRAMs can remember their data after losing refresh. One to two seconds seems to be the limit.

With these potential problems in mind, we can design a routine to test RAMs. The first criteria is that the test cannot use RAM. Several of the failure modes manifest themselves by the inability to store data. For example, data-bus problems, a bad device, chip-select failures, or an incorrectly inserted pin will usually exhibit a simple read-and-write error. The traditional write and read of a 55, followed by an AA, will find these problems quickly.

Writing a pattern of 55s and AAs tests the device's ability to hold data, but doesn't insure that RAMs are being addressed correctly. For example, a post-buffer address short, an open address line from an incorrectly inserted pin, or chip-select failures causing multiple addressing could pass this test. It's important to run a second routine to isolate these problems.

An addressing test works by writing a unique data value to each location, then reading the memory to see that the value is still stored. An easy-to-compute pattern is the location's low-order address. At location 100 store 00, at location 101 store a 01, and so on. This method isn't unique, since an 8-bit location can only store 256 different values. If we repeat the test using the location's high-order address bytes as a data pattern, the entire array (up to 64 kbytes) will be tested uniquely after two passes. The first test insures that address lines A0 to A7 function correctly; the second checks lines A8 to A15 and the chipselect logic.

Since this test also insures that the RAM can store data, why use the 555 AA check? At location 0, the addressing diagnostic will write 0 (the low address) and later another 0 (the high address). While addressability will be confirmed, some doubt remains about the RAMs ability to store data. The 55 AA check tests every bit.

The peril of these diagnostics is that the address lines will cycle throughout memory as the test proceeds. If the refresh circuit has failed, the test will most likely keep DRAMs alive. This situation is the worst because the testing process camouflages failures. A simple solution is to add a delay of several seconds after writing a pattern and before doing the read-back. It is important to constrain the test code to a small area, so the CPU's instruction fetches don't create artificial refreshes.

Sensitive DRAMS
In the bad old days, small DRAMs manufacturing defects and alpha particles caused some memories to exhibit pattern-sensitivity problems. Selected cells would not hold a particular byte if a nearby cell held another specific byte. Elaborate tests were devised to isolate these problems. The “walking ones” test burned an enormous amount of computer time and could find really complex pattern failures. Fortunately, this sort of problem just doesn't show up anymore.

The code for a routine that performs all of these tests is available from the Embedded Systems Programming data library on CompuServe (Library 12 of CLMFORUM). This routine is cumbersomely coded in 8088 assembly language, using no RAM at all. Simpler, prettier code using CALLs and RETs would not be dependable, since it would rely on the RAM we're testing.

While it is easy to see some justification for testing the product's RAM, ROM tests are perhaps not as obviously valuable. If the ROM is not working, how can it test itself? As always, a completely dead kernel, one that doesn't even boot, cannot run diagnostics. If the boot ROM at least partially works, the testing is valuable.

In the boot ROM, we can realistically expect to detect only a simple failure, such as a partially programmed device, although with some luck we might find a shorted or open high-order address line. If the line floats or is tied to the wrong level, the diagnostic code will not start.

ROMs are exceptionally tedious to program. Sometimes, technicians unknowingly remove the chip before the programmer is completely finished. If you include a ROM test, be sure to locate it early in the code so it stands a chance of executing even if the ROM is not entirely programmed. It's easier to make an argument for ROM testing in multiple ROM systems. If the boot ROM starts, its diagnostics can test the other ROMs.

While memories can fail in a number of ways, the most common failure occurs as the result of a misinserted pin. If you've spent time troubleshooting electronics, you'll know that it can be awfully hard to tell if all the pins are in the sockets. Other problems cover the usual range of broken circuit-board tracks (such as address, data, and control lines), misprogrammed devices, and nonfunctioning chip-select lines.

One way to test ROMs is to read the devices and compare each byte to a known value. Since this redundancy is impractical, most programmers compute an 8- or 16-bit sum of the ROM data and compare it to a known value. Usually this comparison is adequate, but a number of pathological cases will report incorrect results. For example, a long string of zeroes will always checksum to zero, regardless of the number of items summed.

Cyclic redundancy checks
A much better approach than a simple checksum is a cyclic redundancy check (CRC). The CRC is a polynomial that is seeded, typically with FFFF, then divided into the input data (in this case the ROM data) one byte at a time, using the dividend at each step as the new seed. While mathematically complicated, the CRC is pretty easy to implement. Its great virtue is that each byte of repeated strings (zeroes, for instance) will yield a different CRC. The CRC is harder to code than a simple checksum, but the code in Listing 1 is a cookbook solution. Once again, it is written in 8088 assembly language to insure that it uses the minimal number of as-yet-untested CPU resources.

Listing 1: A CRC program:

code segment public  assume cs:code,ds:data ; ; This routine will compute a CRC of a block of ; data starting at the address tn DS:BX with ; the length in CX. Don't try to exceed a ; 64kbyte segment (keep BX+CX <= FFFF). ; ; The CRC will be computed into register DX ; and compared to a value saved in ROM. ; rom_test:   mov dx,0ffffh           ; initialize CRC to -1 rom_loop:   mov al,[bx]             ; get a character to CRC   inc bx                  ; pt to next value   xor al,dl               ; compute crc   mov ah,al               ; save tamp result   shr al,1                ; shift right 4   shr al,1   shr al,1   shr al,1   xor al,ah               ; xor temp with partial product   mov ah,al               ; new temp   shl al,1                ; shift left 4   shl al,1   shl al,1   shl al,1   xor al,dh               ; combine with high crc   mov dl,al               ; save low result   mov al,ah   shr al,1                ; shift right 3   shr al,1   shr al,1   xor al,dl   mov dl,al   mov al,ah               ; get temp back   shl al,1                ; shift left 5   shl al,1   shl al,1   shl al,1   shl al,1   xor al,ah   mov dh,al               ; high crc result   dec cx                  ; dec data byte count   jnz rom_loop            ; loop till all CRCed   cmp dx,word ptr cs:crc  ; crc match?   jnz rom_error           ; error if no match   jmp rom_ok              ; jmp if ok crc: dw 0                 ; save crc here rom_error:                ; error location - flag an error rom_ok:                   ; rom crc compare ok code ends data segment public data ends end

A CRC or checksum test yields useful information, but is sometimes a nuisance to implement because the correct value must be computed and stored in ROM. No assembler or compiler has a built-in function to build this routine automatically, so you need to run the program under an emulator, record the CRC or checksum the routine computes, and manually patch the resulting value into an absolute ROM location. The only way to automate this patch is to write a short program that tests the linker-output file and patches the result into the ROM file.

It is best to address the diagnostics problem using engineering thought processes. Examine the system dispassionately and analytically, looking for all possible failure modes. Study the trade-offs. Investigate alternate approaches. Then, implement the best possible solution that does a reasonable job of solving the problem yet doesn't take too much programming time. No one said it would be easy.

Next month we'll look at another aspect of diagnostics--error reporting. We'll also look at writing external diagnostics to help the technician when the system won't boot.

At the time this column was published (June 1990), Jack Ganssle was the president of Softaid Inc. He has been developing embedded microprocessor products since 1973. 

Read PDF from ESP's CD-ROM library


Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.