Mean time between failure made easy - Embedded.com

Mean time between failure made easy

In addition to size, weight, and power constraints, reliability is a key requirement of many embedded systems. A common way of measuring a design's product reliability is the mean time between failure (MTBF) calculation. Being able to accurately measure and report a product's MTBF has advantages: customers are generally willing to pay more for a product they can depend on and will purchase again from companies that produce reliable products.

But measuring MTBF usually involves adding parts, power, and another point of failure to an already densely populated design. Described here is a software-based method to track time-in-service and MTBF, using resources that already exist in most embedded devices. The additional software doesn't affect the MTBF of your design.

Top-level design
For the purposes of this article, an MTBF task (or more generically the task ) is defined as any message-based task in any modern embedded operating system. As described herein, the task uses a periodic timer and two 64-kbyte, byte-addressed flash sectors to track 31 years of time-in-service (TiS ) in one-minute increments, without affecting the flash memory's life. Figure 1 shows the top-level MTBF task design.

View the full-size image

An interrupt service routine initiated by the periodic interrupt timer sends a count message to the task once per minute. In response to the message, the task changes one bit of flash, as pointed to by the minute-counter memory pointer (pMCM ), from a 1 to a 0.

The task increments pMCM when it clears a full byte to 0x00. When the task increments the pointer past the end of the minute-counter memory (MCM), it performs a year-switch . In this way, the task counts an entire year's worth of minutes before a sector-erase is required.

The user interface task sends an update failure information message to MTBF task when the operator reports a failure using the interface. In response to the message, the task performs a time-of-failure (ToF) switch .

To perform a year-switch or ToF-switch, the task copies the appropriate time and failure information to the inactive flash sector before making it the new active sector and erasing the old active sector. The detailed design specifies the copy-order used to prevent accidentally counting too much time or losing the information altogether.

Time equations and flash map
Figure 2 shows the rationale for using a 64-kbyte flash sector as the basis for this design. The first three equations show the top-level sector utilization. There are 524,160 minutes per year, so (524,160 / 8=) 65,520 MCM bytes are required to count one 364-day year's worth of minutes, with 16 bytes left over.

Figure 2 also details the equation to convert from minutes to hours in 0.1-hour units. When the user interface reports a failure, the task counts the number of 0 bits in the MCM to determine the minutes of operation in the current year. The task uses the conversion equation to calculate ToF_Hours . Two example conversions highlight the significance of (1) the “plus 3” rounding and (2) the maximum-hours-per-year value. The latter maximum-converted-hours value of 87,360 (0x15540) requires 17 bits to store in flash.

Figure 3 shows the detailed flash memory map used by the MTBF task. The MCM consumes the majority of the flash sector. Of the remaining 16 bytes, eight are unused and the other eight store the:

Active sector indicator (ASI )–Used to determine which of the two flash sectors is ACTIVE . This 8-bit value contains either 0xFF to indicate an INACTIVE sector, 0x66 to indicate an ACTIVE sector, or 0x00 when the sector transitions from ACTIVE to INACTIVE .

Time-in-service years (TiS_Years )–Incremented when a 364-day year ends. This 5-bit value counts up to 31 years of TiS.

Number of failures (NoF )–Incremented when the user interface reports a failure. This 5-bit value counts up to 31 failures.

Time-of-failure years (ToF_Years )–Copied from TiS_Years when the user interface reports a failure. This 5-bit value matches the size of TiS_Years .

Time-of-failure hours (ToF_Hours )–Used to store the hours of operation in 0.1 hour units, converted from the 0 bits in the MCM. This value is stored in the least-significant 17 bits of a 32-bit value.

Task initialization
Figure 4 shows the MTBF task-initialization process. The task begins by initializing the timer to count the first minute of TiS. Because an operator can remove power from a product at any time, the task counts the first minute after 30 seconds of operation. This effectively rounds up to the next minute of TiS instead of always truncating.

The task then determines which, if either, of the two flash sectors is active by looking for the active sector indicator. For now, a cloud represents this process on the flow chart. After describing the sector-switch processes later in this article, proper logic will replace this cloud. The task sets the pActive and pInactive pointers to the active and inactive sectors, respectively.

After determining the active sector, the task erases the inactive sector. If the active sector does not have the ASI set to ACTIVE , the task initializes the TiS_Years , failure information, and ASI as shown in Figure 2 .

The task sets pMCM to the beginning of the MCM and increments it to the first non-zero byte. If it increments past the end of MCM before finding a non-zero byte, it performs a year-switch.

Switching years
As indicated in the top-level design, the MTBF task counts TiS until the minute pointer is incremented past the end of the MCM. When this occurs, it is time to increment TiS_Years from 0 to 1. This requires a flash-erase process. To prevent data loss during the sector erase, the task copies data to inactive sector before erasing flash.

Careful ordering of the sector switch is required to avoid either counting too much time or losing the TiS altogether due to an ill-timed, asynchronous power-cycle. Figure 5 shows “before” and “during” memory images reflecting each flash sector during the year-switch process.

View the full-size image

Figure 5a shows the initial flash contents and pointer locations before a year-switch begins. The pMCM points past the end of the active sector's MCM. Before writing the ASI in the new sector, the task:

• Increments TiS_Years and stores it in the inactive sector.

• Copies NoF , ToF_Years , and ToF_Hours to the inactive sector.

• Sets pMCM to beginning of the inactive sector's MCM.

Figure 5b shows the flash contents and pointer locations immediately after setting the ASI on the new active sector to ACTIVE . While this condition exists only long enough to perform a single flash write, it represents a special case in determining the active sector at power-up.

Immediately after setting the ASI in the new active sector, the task completes the year-switch by performing the following actions:

• Write 0x00 to the old active sector's ASI .

• Swap the pActive and pInactive pointers.

• Erase the old active sector.

Tracking failures
When the customer returns a unit with a failure, the first thing a technician will do after verifying the failure is command the unit to update the failure information. Upon receiving an update failure information message from the user interface task, the MTBF task begins the ToF-switch process, which differs slightly from the year-switch process. Figure 6 shows “before” and “during” memory images reflecting each flash sector during the ToF-switch process.

View the full-size image

Figure 6a shows the initial flash contents and pointer locations before a ToF-switch begins. The pMCM points to the 3,025th byte of the active sector's MCM. This byte contains a value of 0x1F (0b00011111). This represents ((3,024 * 8) + 3=) 24,915 minutes, or 403.3 hours, of TiS. Before writing the ASI in the new sector, the task:

• Copies TiS_Years to the inactive sector's TiS_Years and ToF_Years fields.

• Increments NoF and stores it in the inactive sector.

• Converts the MCM 0 bits to ToF_Hours and stores it in the inactive sector.

• Copies the MCM 0 bits to the inactive sector.

• Sets pMCM to the 3,025th byte of the inactive sector's MCM.

Figure 6b shows the flash contents and pointer locations immediately after writing the ASI on the new active sector. This condition represents another special case in determining the active sector at power-up. Immediately after setting the ASI in the new active sector, the task completes the ToF-switch the same way as the year-switch.

Active-sector determination
The special cases identified by the year-switch and ToF-switch processes drive the logic that replaces cloud in Figure 4 . Most of the time, there will be only one active sector. However, if a product is power-cycled at exactly the wrong time during a year-switch or ToF-switch process, MTBF task then sees two active sectors and must determine which is correct. Figure 7 shows the logic that replaces the cloud in Figure 4 to determine the active sector.

View the full-size image

The task first looks to see if both sectors are ACTIVE . If they are not, it simply checks to see if Sector 1 is ACTIVE . If it is, the determination is clear. If not, it assumes that Sector 2 is ACTIVE . The initialization sequence shown in Figure 4 checks this assumption and initializes the sector if it's not ACTIVE .

If the task finds both sectors ACTIVE , it must use the TiS_Years and NoF to determine which the proper active sector is. As shown in Figures 5b and 6b , either the TiS_Years or the NoF will be “one higher” in the new active sector. The task sets pActive to the new active sector and writes 0x00 to the inactive sector's ASI . Figure 4 's initialization sequence later clears the inactive sector.

This logic, combined with the careful sector-switch processes described above, ensures no double-counted years or lost TiS data.

MTBF calculation and user interface
The user interface depends on the product. A menu selection on an LCD, VME or cPCI registers, a TCP/IP socket, a web server, and ASCII messages on a serial bus are all possible interfaces to products hosting the MTBF task. Regardless of the physical interface, the product must allow the operator and repair technician to:

• Query TiS.

• Update the failure information.

• Query the failure information.

• Query the MTBF.

For simplicity, the user interface reports time to the user in easily calculated, constant-unit time measurements. As an example, it reports 364 days as one year (52 weeks ¥ 7 days = 364 days = 1 year).

The following shows sample interface responses presented on the user interface. The user interface reports time values in either raw hours, rounded to the nearest 0.1 hour, or user-comfortable units broken down into years, weeks, days, hours, and minutes. The rationale for using weeks instead of months for user-comfortable units is that every week is seven days long, but the length of a month varies. Example responses are:

TiS=01y05w01d06h25m

NoF=3 Last Failure@ 21,688.5h

MTBF=7229.5 hours.

The task maintains TiS with the TiS_Years and MCM zero-bits. It stores the failure information as ToF_Years and ToF_Hours . The task only converts these values to raw hours or user-comfortable units when requested by the operator. Figure 8 details the conversion process from years and hours to both raw hours and user-comfortable units.

View the full-size image

Figure 9 shows the MTBF calculation. Simply, convert the time of the last failure, as stored in ToF_Years and ToF_Hours , to raw hours and then divide by the number of failures.

View the full-size image

Design scalability
This design easily scales up or down to accommodate different data-bus widths and flash sector sizes. The design scales for data-bus width by changing pMCM to point to the appropriate data size and by adjusting the flash accesses associated with the year-switch and ToF-switch processes accordingly.

The design scales up to use larger flash sectors by lengthening the time between sector swaps. It scales down by limiting the timing resolution to three or six minutes, or by reducing the TiS_Years counted. Scaling the design down to smaller flash sectors requires packing the ASI , TiS_Years , and failure-information bytes to remove the unused bits.

The unused bytes and bits also allow for design expansion. As shown in Figure 3 , there are 8 bytes and (15 + 3 + 3 + 3 =) 24 bits not used. With proper data packing, possible uses of this data include:

• Storing two additional sets of ToF_Years and ToF_Hours for a failure history.

• Storing minimum/maximum values of another important product specification, such as temperature, operating voltage, or input/output voltage levels.

• Storing the maximum continuous time of operation without a power-cycle.

MTBF and beyond
This article presents just one implementation and use of this design. It shows how to track a product's TiS and MTBF using existing resources without affecting the MTBF of that product.

Aside from providing MTBF information, the TiS counter provides other valuable information to a business. Some potential uses include:

Production-metric tracking– Tracking the time taken to assemble, align, and test an individual product. This metric can be as simple as querying TiS right before the product is delivered to get a single production-time metric, or as detailed as querying TiS at every step in the production process.

Repair-metric tracking– Tracking the time taken to fix a failed product and return it to the customer. Again, this can be as simple or complex as desired.

Fielded-product TiS reporting– Reporting the TiS over a network to a central collection site, managed either by the manufacturer or the customer. This provides insight into when fielded equipment needs calibration or service.

Using the design for production or repair metrics requires significant discipline on the part of production/repair facilities. A technician can easily over-inflate the production- or repair-time metric by leaving a product on overnight. Additionally, accurate MTBF calculation requires a ToF information update at the time of every repair. Missing just one update invalidates the MTBF data. Even if an organization never uses this design for anything more than a TiS counter, it still provides valuable and interesting insight into just how much use a fielded product receives.

Dan Swiger is an embedded systems software engineer at DRS Signal Solutions, Inc., where he works on technology development for the signal intelligence community. He has a BSEE from West Virginia University Institute of Technology. You may reach him at .

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.