Using clock margining for system test boundary stability and early failure prediction - Embedded.com

Using clock margining for system test boundary stability and early failure prediction

Like many of you, I distinctly remember the PC clone platforms of the 1980's and early 90's contained an important button called turbo mode. I loved to push the turbo button and watch the display numbers change. Many times the numbers made no sense, but then what did it really matter as long as they changed when I pushed the button?

Pushing the turbo button made me feel better thinking that somehow I was on the edge of computational performance and getting more than “my monies worth” when it came to purchasing a $2500 desktop system. I also knew that should I ever doubt system instability I could always return to “normal” mode to ensure full system stability.

Frankly, I never did operate in “normal” mode and neither did anyone else. A quick walk around the office space quickly revealed that everyone had the turbo-button set. Of course the thrill of turbo mode operation is a two edged sword as it is continually blamed for system crashes with the incessant fear of liquefying the CPU down to a blob of molten silicon should the fan ever fail.

The turbo mode of yesterday is often referred to in today's nomenclature as overclocking. Perhaps a new name will heal old wounds. The fundamental concept has not changed; the direction is always pushing the envelope of computational speed of stability (usability) vs. instability. When we think of overclocking, we naturally gravitate to thinking about the PC experience.

Aside from what many may consider a hindrance, if we analytically look at the overclocking experience is it possible that this can become a tool capable of revealing system weaknesses? Is it possible that through a structured “design of experiments” that the logic weakest link might be forced to reveal itself?

Further, on this point can overclocking analyze failure such that a more robust system meets catastrophic failure upon crossing the unstable threshold? Through structured analysis, is it possible that overclocking will accurately expose the boundary of stability vs. instability in a system? Are there other hidden treasures in our analysis that may result in estimating early failure detection through aging?

If overclocking serves to push a system to the stability edge then what can be said for the compliment of the overclocking — what about underclocking? We often think about overclocking as predominately attacking setup time.

Underclocking, then, might attack the compliment in our system of hold time. Implementing the concept of over or underclocking requires that we have a baseline condition referenced in this article as the system “nominal” response. The system designer establishes a nominal response where specifications are based on manufacturing stipulation normally provided through component datasheets.Overclocking and the total timing budget
While many system designers dismiss overclocking as nothing more than detrimental to system stability, there are hidden benefits that can potentially provide information advantageous to determining actual system Total Timing Budget (TTB) margin and estimation for product end-of-life in the field.

The concept of overclocking is growing into a new terminology called “clock margining” where guaranteed stability is maintained. This article also explores implementation technique of Clock Margining utilizing programmable clock sources that help converge on the true system stability boundary condition.

Closely tied to an overclocked stability condition, the total timing budget (TTB), describes what the unique system is capable of achieving as by definition it establishes the system boundary-timing limit of sanity. TTB caters to the whole boundary limit condition and may include not just effects of an overclocked condition, but also that of the underclocked experience.

TTB parameters are largely discovered through empirical analysis and reports what a datasheet does not report in terms of margin in the min and max specifications. By our definition, datasheets set nominal clocking condition for the system (through min and max specifications) and are inherent in the selection of devices that make of the whole of the system.

When analyzing TTB in a system, a performance gap (typically in units of Hertz), or a delta frequency exists between the nominal and the TTB result. System voltage and temperature play into TTB and must be factored into the result for consistent results.

Clock margining
Another acronym of interest is clock margining. I like to think of clock margining as a concept that encompasses something much more than an overclocked condition. Clock margining is the process that explores and encompasses system stability parameters fully around the TTB condition through the full exercise of regression test. Clock margining gains a full perspective on the edge of computational sanity while overclocking exercises small fragments of software.

Clock margining establishes a performance gap that can be executed a number of times over the lifetime of the product thereby establish gap trends. Such trends then can help to estimate system end-of-life. Shortly after the system manufacture, a reference clock margin gap test is typically performed that establishes a baseline. This is considered in practice to represent the greatest gap with ever diminishing gaps occurring over time due to aging.

The age-old question of why a system ages is an interesting one. We except that imperfections exist and silicon manufacturing is no exception. Packaging the silicon can play a big role in decreasing lifetime namely around exposure of the silicon to the outside environment through loss of hermeticity.

From a silicon standpoint, the natural effects of hot carrier injection and minute effects of electro migration continue to play into the system. Heat accelerates the aging of silicon. Aging, as a sensitivity parameter, is reflected in the TTB numbers. For over versus underclocking, my experience shows that the overclocking condition is generally the most stressful and serves as the primary path for gap analysis.

With all this talk about clock margining how is this achievable? First, understand that most modern systems are often composed of a number of clocking sources. In many cases there is much interdependency between clocks, but in other cases, independent clocking exists.

Now days, the generation of a clocking source is predominately accomplished through phase-locked loop (PLL) technology. Adherence to minimizing noise allows these newer generation of PLL to rival prior fixed clocking sources in terms of phase noise performance coupled with minimized jitter.

PLLs are often used as clock synthesizers with programmable divisors that allow different synthesized clock outputs and different “gear ratios” to exist between various interdependent clocks. A gear ratio is old terminology for the PC base clock but applicable for any interdependent clocking event.

Clock margining, to be successful, must have some inherent ability to modify or change in frequency. This can be more challenging in practice that originally anticipated because the performance of the PLL must be fully understood.

Such understanding should encompass not only achievable target frequency spans of operation, but phase noise and jitter performance under the various conditions of feedback programming. The time-domain based jitter is important to understand so that consistency from frequency to frequency is maintained (no abrupt discontinuities) otherwise the system stability maybe incorrectly analyzed.

Should jitter discontinuities exist, the world doesn't come to an end, rather notching out specific output frequencies or gear ratios may be necessary. Moreover, you want to ensure that no “glitch” activity on the clock output(s) is present during a frequency change unless the PLL is set during an interval where the CPU is insensitive to any form of glitching or runt pulse activity.Achieving accurate TTB
The little secret to achieve accurate TTB is to have in our arsenal a known range of well behaved frequencies so that we can “gingerly” approaching the TTB condition through incremental steps where the step size may vary as we approach the perceived TTB target.

Large frequency steps consistently lead to smaller TTB gap results. TTB boundary detection requires that we eventually step over (break the system), restart, and back off until we are satisfied with a consistent threshold. There are many undocumented “tricks” to this. The key is consistency and repeatability in establishing system TTB.

As briefly mentioned, one of those golden nuggets discoveries of performing clock margining is that it can lead to field based product lifetime estimates. Our analysis for the purposes of this article is simple. We compute clock margin gap as the difference between the nominal and TTB of our product and we maintain a record of this information for future use.

Over the course of weeks, months or years the product (in the field) runs the same regression process to re-compute the gap. My consideration for product end-of-life is when the gap goes to zero, or negative. This is not to say that the system fails, but simply that no margin remains and therefore serves to flag an end-of-life event.

Whatever the business model of support, generally a gap of zero is understood that the useful life is over which can result in a huge informational advantage for systems expected to perform 24/7. As illustrated in Figure 1 below , prediction of a zero gap is based on historical gap information. Estimation of end-of-product-life through simple linear extrapolation or by means such as non-linear curve analysis is employed.

Figure 1 System Lifetime Estimates

Accuracy in clock margining is most consistent when external factors such as temperature and voltage are known, recorded and matched during future regression testing. While Figure 1 above shows a positive clock margin gap representative of an overclocked condition, the same principle can apply for an underclocked condition. In this case, however, the gap in these cases is not usually as wide and remains more constant. I assign the computational value of the under clocked margin gap as a positive value.

The process of clock margin implementation is through modification of the synthesized PLL clock source or sources. Figure 2 below outlines a simple first order approach to the PLL process that encompasses the margin technique. One of the best ways to manage the system is through use of a watch-dog timer where successful completion of a regression test will lead to the software resetting the timer as opposed to a watch-dog timeout due to system failure.

The iterative process continues to loop and test regression patterns where PLL frequency content is stored and the process continued until failure. As mentioned, approach the TTB limit is an exercise in understand PLL parameters and system step size sensitivity with the notion that step size generally decreases as the theoretical TTB boundary condition is approached.

Thus, the loop is expected to cycle a number of times with the last known successful test parameters extracted once the system fails the regression test. The last known success is therefore considered as the TTB boundary condition.

Figure 2 Clock margin process

Conclusions
The process of determining the clock margin gap is beneficial to understanding full system exploitation through understanding the TTB boundary. Clock margining can play an interesting role in ferreting out the system weakest link for further analysis, can serve to refine and create a matched system whereby catastrophic failure occurs once in excess of the TTB boundary.

Careful conditioning of the system allows the TTB to play the role of establishing a clock margin gap that can help estimate product end-of-life from the field. To make this all possible, at the heart of clock margining is the programmable PLL; careful analysis and understanding performance parameters is critical to ensure that the TTB boundary condition is truly representative of frequency compression and not subject to excessive jitter.

Discussion with reference to the process of PLL loop programming illustrates that use of hardware timers and non-volatile storage elements allows easier management in the determination of the TTB boundary condition. A managed clock margining approach can lead to effective and cost saving field analysis for product end-of-life determination.

David Green is Advanced Technology Business Development Manager at Cypress Semiconductor.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.