Advertisement

Time to re-evaluate Windows CE?

January 25, 2005

JackCrens-January 25, 2005

Either Windows CE has a serious bug or some programmers just never learn. Oldtimer Jack Crenshaw saw this coming a long time ago.

On simplicity:

  • There's an infinite number of ways to make a simple problem seem difficult; only a handful to make a difficult problem seem simple—Crenshaw's Law #1
  • KISS (Keep it Simple, Sydney)—Crenshaw's Law #2
  • Simplify, and add lightness—Race car designer George Miller (oft attributed to others)
Why do I feel it necessary to remind you of these seemingly obvious platitudes? Because people stubbornly insist on breaking them. The story I'm going to tell you this month is a case in point. But before we go there, I want to share yet another quote. Unfortunately, I've lost the quote and the author, so I'll have to just paraphrase for now:
  • I've seen good, small programs and I've seen good, large programs. But I've never seen a good, large program that didn't begin life as a good, small one.
Words to live by.

Good ideas, bad ideas
It probably won't surprise most of you to learn that I've been writing software for a long time now (I wrote my first program in 1954 for an IBM 650). I've seen programming standards and ideas come and go, and I've seen basic principles such as information hiding, structured programming, and modular programming, come to stay. Mostly, I like the ones that stayed because they all tend to agree with my way of doing things anyways.

As standards change and evolve, what depresses me most is that I find myself fighting battles I thought had been fought and won decades ago. Back in the '60s, I tended to write programs with a lot of general-purpose subroutines. I had, for example, packages to do vector and matrix operations, so an expression like:

y = Ax + Bu
(Equation 1)

was a simple matter of four function calls. At the time, I took a lot of flack for this coding style; the systems guys beat on me for wasting computer time with all those calls. My response, of course, was "Better its time than mine."

Eventually, the notion of reusable software became much more accepted, and the efficiency nuts had to admit that the overhead of a function call was not a burden too great to bear.

Yet, as recently as last week, noticing that someone was writing vector expressions in scalar form, I asked him why he didn't just call a subroutine. His answer? "Too inefficient."

Sigh.

Back in my Fortran days, circa 1963, I used to see—and wanted to hurl over—Fortran programs where each subroutine had seven pages of common statements before every function; even functions that were stubs. In one extreme case, the chief programmer had carefully gone through a working program and pulled every single variable—even loop counters like I, J, and K—into common blocks. When I asked him why he would do such a thing, he replied proudly, "You never know when you might want to print something out."

Sigh.

As recently as 2000, I came across a Fortran program whose designer had what he regarded as a superior idea: Put every variable in a single common array called X. In his program, X(1) might be time, X(2) might be pi, and so forth. Every operation on every variable in the program was really an operation on an element of X. When I asked him why he would do such a thing, he replied proudly, "You never know when you might want to print something out."

Sigh.

This view always reminds me of the Dr. Seuss book Too Many Daves. A woman had 23 sons and named them all Dave. Through the rest of the book, Dr. Seuss gives us imaginative names that she could have used. "She could have named one Buffalo Bill, and one Biffalo Buff." Then he closes with:

"But she didn't. And now it's too late."

It's too late for that program, too, which now stands proudly atop the scrap heap of bad ideas that seemed good at the time.

In the "good old days" we fought battles over such things as how deeply one should nest functions, when and how to pass parameters, and how much modularity was a good choice. Eventually, the good guys (always including myself, of course) prevailed and were allowed to write software in the way we thought it should have been done all along.

Each time we "won" such a debate, I breathed a little sigh of relief and thought to myself, "Well, I'm glad that's settled."

Little did I know that I was going to end up fighting those same battles, over and over, every 10 years or so.

Why is this? Why do these bad ideas keep rising up, like Phoenix from the ashes?

The reason: We're getting whole generations of new programmers who never heard about the old battles, and who have ideas they believe are wonderful ones. At this point, I've resigned myself to the notion that I'm always going to be spending a significant percentage of my career—say 30 to 40%—trying to convince new hires that their hot new idea came and went back in '72.

If you're one of the embedded programmers younger than I (and who isn't?), I have only bit of advice: Learn from the past, lest you end up repeating it.

Ticking away
Now we get to the meat of this month's column. Recently, Editor in Chief Jim Turley received an e-mail from Richard M. Smith, Computer Bytes Man (www.computerbytesman.com). Richard was concerned about a failure in software that calls the Microsoft Windows call GetTickCount()—a part of their standard Win32 system, including Windows CE. He claims that this function can cause a crash, and he gives references that seem to support it. Links to his references are in the sidebar.

GetTickCount( ) May Get You in Hot Water

On his website www.computerbytesman.com, Richard Smith has the following links relating to problems the Windows call GetTickCount( ) may be causing.

Article "Lost Radio Contact Leaves Pilots On Their Own,"
Linda Geppert, IEEE Spectrum Online.

Microsoft support documents

The Rpcss.exe process consumes 60 percent of CPU time and performance is affected

Computer Hangs After 49.7 Days Print Spooler Stops Scheduling Print Jobs SNMP SysUpTime Counter Resets After 49.7 Days

Boiled down to its essence, Microsoft has a clock counter that is bumped every millisecond. It's a 32-bit unsigned integer. A little math will show that this integer can stand to be incremented 4,294,967,296 times before it rolls back around to zero. That's equivalent to:

4,294,967.296 s = 71582.78827 min = 1193.046471 hr = 49.7103 days

Now, incrementing a counter doesn't sound at first to be anything worth worrying about. At first blush, I was ready to write Richard off as someone who had a fundamental misunderstanding about how computer integers work. But if you check Richard's first URL, you'll find that this fact caused a near-disaster. Airliners in the Palmdale, California area were flying around each other with no communications with the ground controllers, thanks to a voice switching and control system that shut down—or, rather, failed to do so—because a counter wrapped around through zero. Scary, and potentially fatal, stuff. Something else besides counting has to be going on.

Before we go any further, let's just consider the notion of counting with integers and the way they're supposed to be used.

As all of you undoubtedly know, any integer variable has a limited range. If it is an unsigned integer n bits long, it has a modulus of 2n and can hold integer values from 0 through 2n-1. If it's signed, it can hold values from -2n-1 to 2n-1-1. To give a specific example, an 8-bit integer can hold unsigned values from 0 through 255, or signed ones from -128 through +127.

Now, consider what happens when an integer "overflows." I'll use hex to show what happens:

...	fc
	fd
	fe
	ff
	00
	01 ...
There's no "overflow." The counter just blithely wraps around, just like the odometer on your car. The proper model for an integer variable is not a yardstick, which has a beginning and end, as in Figure 1, but a clock that has no beginning and no end, like Figure 2. The circular model of an integer "number line" is infinitely appropriate. Suppose we start at noon (00 o'clock), and move clockwise around the circle. The hand passes 01, 02, 03, etc., and finally arrives at the 3 o'clock position, which is denoted by 40. Continuing, we pass 6 o'clock (80), 9 o'clock (c0), then approach noon again with fd, fe, ff, 00.-- Nothing remarkable happens at this transition.


Figure 1: Number line; the wrong model

Oh, sure, if you look at the CPU's internal status flags as the clock counter is being implemented, you'd see the overflow flag set as the counter rolls through zero. If we were writing our software in assembly language, and we even bothered to look at the status flags at all, we could detect the condition. But typically we don't, and we expect that counter to go right through zero with nary a bobble.

Measuring intervals
We've been using real-time counters like this in embedded systems since the beginning of time. They're in the Mariner spacecraft, which was last reset in 1976.

So what's the problem? What could be going on that could cause a crash of the kind Richard Smith reports? To find out, we need to look at ways the counter might be used.


Figure 2: A clock that repeats indefinitely

Typically, we use the counter to measure the time between events. When some event occurs, we store away the time. That is, we store the current value of the clock counter. When another event occurs—perhaps a recurrence of the same one—we compare the two times. In Figure 2, I've shown a number of event pairs, each one occuring at a different place on our circular clock. In each case, the red event is the more recent one, the blue the earlier one. The time between events is:

Δt=tred—tblue
(Equation 2)

I'm going to assert, right here, that this Δt will be correct regardless of which part of the clock we happen to be on, even if the pair happens to straddle a key point like noon or 6 o'clock. I've given some examples in Figure 3. In this figure, I'll show each number in its hex, signed decimal, and unsigned decimal forms. As you can see, the interval comes out correctly in all cases.


Figure 3: Doing the arithmetic

It certainly seems that we can always get the right time interval between two events, whether the clock wraps around or not. So as far as interval timing goes, the "overflow" is a non-problem.

Sorting events
Although the computation of intervals seems to be perfectly okay using the 32-bit clock counter, we must examine another application. In all the cases I've looked at so far, and all the cases of red and blue arrows in Figure 2, we assumed that we already knew which event came first. So the order of subtraction can never be an issue.

Suppose, on the other hand, that we simply time-tag events as they occur, and we later want to sort them into chronological order. This means comparing each time to some other time and determining if the result is positive or negative.

The problem in this case is that a clock, after all, is just a clock. If you show me photos of two clocks, one of which shows 11:30, and another 9:20, I can't say for sure which photo was taken first. I don't have enough data. Does the 11:30 mean AM or PM? How about the 9:20? Were the photos even taken on the same day?

The fact is, if your only tool for comparing things is a clock, you can't possibly answer the question. A clock should only be used if the time interval is small compared to the cycle time (12 hrs, for most clocks). For other cases, you need a clock and calendar.

Take another look at Figure 2, and imagine two clock hands, each moving at the same speed, and diametrically opposed to each other. Which hand is in front?

Clearly, we can't tell. Either answer could be correct.

So what does this tell us? I could go through some more math as I did in Figure 3, but I don't believe it's necessary. The answer is simple enough: You can only compare events for time order if the time between them is less than half the modulus; in our case, 24.85 days. For intervals larger than that, you're using the wrong tool. You should have used a date/time stamp instead of just a time stamp, and you should have used the clock/calendar chip built into almost all computers these days, instead of a real-time clock counter. Instead of a time stamp from a counter, you need something like a Julian date.

So who's at fault?
So far, all we've learned is that the notion of overflow in an integer that's being incremented is a non-problem. This is true whether the integer is treated as a signed or unsigned parameter. Though there's a limit to the time interval that we can reasonably expect to be able to consider, that limit is large, and for most cases, we should never have a problem. The time between events is liable to be, and should be, something like "Either 27 minutes, or negative 49 days, 16 hours, 35 minutes, and 47.296 seconds." Usually, that choice is not hard to make.

So if there seems to be no problem mathematically, where is the problem? I'd like to say that there isn't one, but the Palmdale computers did crash. The Microsoft systems do seem to start eating CPU time. Something is going on, we simply don't know what. Yet.

I'm going to stay on top of this story and will let you know what I discover. Much as I love to bash Microsoft, I'm not yet prepared to place all the blame on their doorstep. But I can tell you this: Somewhere in there, someone didn't follow the KISS principle; someone took a ridiculously easy problem—one we had solved 40 years ago—and managed to make it hard. I don't yet know who the culprit is, but I see only a few possibilities. There's someone out there who either:

  • doesn't understand that integers that are always incremented will eventually roll back to zero, or
  • understands that but thinks it's a problem that requires extra code to deal with, or
  • doesn't understand the difference between a clock and a calendar.
Each possibility is worth yet one more, deep sigh.

Jack Crenshaw is a senior software engineer at Spectrum-Astro and the author of Math Toolkit for Real-Time Programming, from CMP Books. He holds a PhD in physics from Auburn University. E-mail him at jcrens@earthlink.net.

Reader Response


Been there, done that. Where a tick counter can bite you is in the way you code a determination of whether a minimum interval has expired. Ignoring the roll-over, if you want to wait until 10 ticks from now, you could do this:

DWORD wait = GetTickCount() + 10; while (GetTickCount() < wait)="" {="" do_stuff();="">

Sounds reasonable and intuitive. Except the roll-over can cause two problems. Problem 1 is when you get the starting count just before it rolls over, and your wait value is less than "now." The while loop never happens because the wait appears to have already expired, which is probably bad (else why did you code the delay in the first place?). Problem 2 is when you get the starting count 11 ticks before rollover, and you are too busy to actually call GetTickCount on that single tick when the tick count is at its maximum. Sorry, you missed it, but you'll have another chance to end the loop in 47 days.

I once coded delay loops in the intuitive and obvious way above... sometimes deciding I could live with the dangers, other times building in additional code to watch for the boundary conditions. I continued doing this for some years before I first saw how to do the delay safely:

DWORD start = GetTickCount(); while ((GetTickCount()-start) < 10)="" {="" do_stuff();="">

The first time I saw this in someone else's code I wondered why they took such a goofy approach. But then I realized it never fails! The subtraction is done first, and as the article pointed out, computing the duration is never a problem. Now that we're testing for a duration to have passed rather than an alarm point being reached, the roll-over problem is gone. This has nothing to do with Microsoft or anyone else's tick counter; it's just using it in a way that doesn't have scary boundary conditions... and if you don't think it through carefully, you can misuse the counter. I know I did.

Mark Bereit

The author replies:
Thanks very much for your comments, Mark. I see that the "problem" is only a problem if one tries to compare absolute clock values. As long as we stick to intervals, we're OK.

Jack


Don't quote me but I believe that the windows timer bug in the ATC center did not directly cause Air to Ground (AG) comm to be lost. The ATC equipment monitoring/control system lost status polling because of the timer glitch and turned all equipment status on the monitor screen 'red'. AG comm still would work with out the monitor and control subsystem. But the system was immediately cut over to a backup system that was not configured properly and then resulted in loss of the AG comm. Had not a switchover to the backup system occurred with out checking if the primary system actually Lost Comm, or if a backup system been configured properly comm would not be lost.

John Matchett
Harris
Staff Engineer

The author replies:
It's true, and I think I said that, or at least tried to. The problem occurred because a function that was supposed to reset the computer every 30 days failed. And, you suggest, when the backup computer took over, it also failed. The counter problem did not cause the error, but people worrying about how to avoid it, did.

Jack


The real Y2K problem will happen in the year 2038. It's the same overflow problem. It crashs applications and system kernel.

Keith
System General Corp.
Assistant Manager

The author replies:
Keith, thanks for your thoughts, but I don't think the problems are the same. The Y2K problem occurred because people used the tried-and-true method of storing on the last two digits of a year, instead of all four.

I suppose you could make the case that, if the integer counter were long enough, there wouldn't be a problem. That's certainly true. A 64-bit integer would not roll over for 584 million years, which is probably long enough. Still, the problem should not even be an issue if the code is properly written. I'd hate to see someone use the brute-force method of a longer word length, just to cover up bad programming.

Jack


Excellent article, as ever. Thought you might be interested to know that flak is an acronym of FliegerAbwehrKanone, or aircraft defense cannon.

Best wishes!
Nick Hewish


I don't see how this differs from the infamous Y2K "bug" other than the frequency of rollover. The more things change, the more they stay the same.

David Kelly


I've spent too much time wondering if my code (and other people's) handled overflows correctly. In C++, we can save the unwary from cutting themselves and ourselves from squinting at code by using a lightweight Interval class like this:

class Interval {
private:
DWORD initial_;
public:
inline Interval()
: initial_(GetTickCount()) {
}
inline DWORD value() const {
return GetTickCount()-initial_;
}
};

// Usage:
Interval i;
while (i.value() < 10)="">
do_stuff();
}

Dominic Herity
Red Plain Technology


Your article is great. You point to the fact that ignorance hurts. The ignorance of the rookie programmers, but there is also the ignorance of managers who insist in making decisions in fields that are not theirs. What are the implications in embedded systems, in which there are programmers with little or no background in either electronics or control theory, and engineers with no more programming training than Labview or basic?

Best regards,
Julian Suarez
LIC-Ciencias, ULA, Venezuela


I was recently burned by a GetTickCount implementation that had been ported from an earlier system. Access to the tick counter must be atomic. In this system, the tick counter was only 16 bits. In the code, it originally appeared the access to the tick counter was atomic and on a 16-bit microprocessor, it was. However, the system was ported to an 8-bit microprocessor, so simply reading the tick counter was no longer an atomic operation. The access to the tick counter had to be protected by disabling interrupts, so the tick interrupt did not update it in the middle of the operation.

Further hiding this problem was the fact the original code appeared to run without problems. The code this had been ported from had always run with interrupts disabled. Thus in the original system, the tick counter access was atomic, but when the code was reused, with interrupts enabled, problems began to occur. In fact the bug existed in the system for two years, until I started to look at why certain periodic events would (very rarely) not run at the correct period.

Perhaps all this suggests that rather than using big integers (32 bits or more) perhaps tick counters should be implemented with small numbers to bring out the rollover problems more frequently. A 16-bit counter will roll over every 6.5 seconds. An 8-bit tick counter every 0.25 seconds. Can you live with these periods in your time scale?

If you are reviewing code, make sure to ask what happens when the clock rolls over, since this could take a long time to test. Better yet, write a function or macro to check for a timer expiration (if your OS doesn't have one) and use that. Always use it. The extra overhead is cheap compared to the effort to find the problems caused by an error in the timer checking.

As for Windows, checking the references to the Microsoft website, it appears the problem is not the kernel of the OS, but the users in higher levels. RPC Services, VTDapi95, the print spooler and SNMP system up timer (evidently your network should never stay up for more than 49.7 days!) Searching for 49.7 days reveals other problems with mishandling the system tick timer. The most telling is http://support.microsoft.com/default.aspx?scid=kb;en-us;843561, a problem where all your timers appear to expire simultaneously.

Sadly, the Microsoft developers must not be paying attention to their own documentation, the 49.7 day limitation appears in many documentation references on the Microsoft website, even the GetTickCounter documentation.

- Bob Bailey


DWORD wait = GetTickCount() + 10; while (GetTickCount() < wait)="" {="" do_stuff();="">

can be done as:

DWORD wait = GetTickCount() + 10; while ( (int)(GetTickCount() - wait) < 0)="" {="" do_stuff();="">

and it will work. If two event times are within 1/2 the modulus of each other, and their (modulo-reduced) times are A and B, the reliable way to compare them is to check the sign bit of (A-B), which is not the same as comparing A with B in the usual sense. Mark's solution is very similar - but my point is, you can pre-calculate the 'end' time if you use it properly in the compare. It's a good idea to define a macro to do the compare, since it becomes clear that (a) you are comparing times and (b) you are doing it with a special policy (which presumably works, or if it doesn't, can be fixed globally by changing the macro).

... another point:
In real-time systems, it's a very bad idea to have clocks that roll over in a cycle that is more than 40 minutes but less than 2 times the worst-case power-on time of the equipment.

I.e. they should never roll over, or they should roll over enough that the rollovers are tested. JMHO

- Gregory Smith


Loading comments...