During a cold winter a few years ago our team was getting ready to showcase a brand new mobile phone platform at 3GSM Mobile World Congress. This very prestigious event takes place every February in the beautiful city of Barcelona, and it is the best place to hob nob with the who's who of mobile communications, and also indulge in the exquisite Catalan cuisine, and perhaps too much sangria.
We were eager to present our latest 3G platform in a form factor (a phone-like enclosure) for maximum impact and, since we were a small company, we were aiming to get some street cred in front of the major players. Back then, 3G phones were just starting to appear, with the promise of high data rates (see iPhone, G1) and our data capability was excellent.
Our handsets were handmade from breakout development boards, we had but a few to show around, and we had not previously tested with the new enclosures in our mad dash for Mobile World Congress – all major ingredients in a recipe for disaster, but as good engineers we soldiered on.
Our applications team worked around the clock to get all the video capabilities enabled, doing a fabulous job. One after another, our handsets were passing all the necessary data tests. A couple of weeks before the show the video demo was doing the rounds inside the company, and for once the engineers were garnering all the accolades.
We were riding high. Then some people decided, in their folly, to try a voice call (what on earth were they thinking?) and all hell broke loose.
It turned out that all the handsets were working properly for data, but voice was a different story. Lest we forget, these were phones we were talking about. In the past we had a fair share of problems in the voice subsystem (see Part 2 for an example ) and once again I was sucked into the fire. Further tests revealed that only one of the handsets was capable of performing a voice call, and it had a bright yellow flip phone enclosure, so we dubbed it (quite originally) the golden phone.
I ran a few tests and pretty quickly had an ominous flashback. A few months back a customer had reported a failure in some of our units that had been modified to increase the speed of the DSP clock.
The failing units crashed when voice was enabled. I had spent some time trying to figure out why this was happening, but the customer gave it low priority because voice was not their focus.
So this problem never received enough attention and I was pulled off to work on more important stuff – or so we thought. One of the unspoken rules of engineering is that when you wait long enough for a problem to go away, and it does, it will come back to bite you with a vengeance – and it did.
My notes from that previous investigation revealed that while the crash was apparently random, approximately one out of ten times the software would fail in the same module due to some inexplicable data corruption during a codebook search routine.
This was odd because during this codebook search the software would essentially compute information using values from ROM tables as input, and as we all know, ROM values never change. The debug sessions were long and not very effective on the new handsets. I could not find anything wrong with the software.
Heated debates ensued, with plenty of the usual hardware vs. software name calling. With the tradeshow deadline looming, someone in the hardware team discovered that the problem would go away under two conditions: a) the temperature of the IC was decreased, and b) the voltage to the DSP core was increased. Option a) would require to open the handset enclosure and spray the IC with cold air, something that would be very hard to pull off in a tradeshow.
Hence, we settled for option b) and the circuit boards going into the handsets were modified accordingly. We had our fix. The phones got really hot, and the batteries died really fast, but in the razzle-dazzle that is a tradeshow, it was all good.
Back at the mothership, however, all was not well. Having the software vindicated didn't feel like much of an accomplishment. Several of us were not pleased that we hadn't found the root cause of the problem.
We knew the current solution was not production quality. Much to my chagrin I was pulled off to work on something else since we now believed the problem resided in the hardware.
This turned out to be rather fortuitous because the person assigned to look into this problem was an SoC designer. She knew very little about the software and, being much smarter than me, approached the problem in a very simple way. This is what she discovered:
Any read from a particular RAM that was immediately preceded by a read from data ROM could be potentially corrupt.
Here's what the problem looks like in Figure 1, below :
Now the IC design team had to figure out why the DSP memories were susceptible to a voltage drop. Their analysis showed that the failing memory was the farthest away from the power grid, and the ROMs consumed more than six times as much current as the RAMs.
This combination reinforced the theory that ROM reads could cause the RAM in question to be browned out, i.e. starved of current. To confirm this as the culprit, one of the boards was modified to have independent power supplies for the DSP core and the DSP memory. Testing began, and the voltage to the DSP memory was lowered, and lowered, and lowered… but still all tests passed. Say what!?
When the DSP core voltage was lowered the failure began to manifest itself again, but this did not conform to the hypothesis. The failure was observed independent of whatever voltage was applied to the memory. Something else was afoot here. The next suspect was that the failing RAM was not getting its memory enable on time, with two possible causes for this: a) The memory enable logic was slow, and, b) the memory clock was fast.
Here's a quick explanation of how the memory reads in that platform worked. The DSP core used an asynchronous memory interface to the memories that drives the signals to the memories off the rising edge of the DSP clock, and expects the data back from the memory prior to the next rising edge of the clock. We used clocked memory instances, so the memory clock had to be generated somewhere between the two DSP clock pulses.
In this design, the time required before and after the memory clock was not well balanced – in more technical terms, the logic delay from driving flip-flops + memory setup time before the clock was much less than the memory read access time + logic delay to capture flip-flops.
Using the falling edge of the DSP clock to clock the memories (Figure 2, below ), seemingly the more straightforward approach, would have resulted in sub-optimal performance (the DSP clock would have to be stretched leading to a loss of MHz). Therefore, to achieve maximum DSP performance, the memory clock was generated using a tuned delay chain.
The description of this tuned delay chain and memory clocking in general is beyond the scope of this article, but the take home lesson is one you will all be familiar with. In an effort to squeeze the last drop of performance out of our systems, we embedded engineers (both hardware and software) sometimes get too creative and outsmart even ourselves.
In this case a static timing analysis tool was used in a way it wasn't meant to be, resulting in an over optimistic analysis. As a result the memory clock generation that was meant to deliver higher performance caused a horrible bug that manifested itself first as an apparent software problem, then as a sure voltage drop problem, and only after much work and frustration did it reveal its true self.
In the end the team did the right thing, and in the spirit of continuous improvement, the pre-silicon verification process was fixed so that such a scenario would not occur again. Nevertheless, the memories remain.
Next in Part 3: “Bus Interrupted.”
Mauricio Gutierrez graduated from the University of Michigan at Ann Arbor with a Masters degree in Electrical Engineering. Since graduating he has worked on embedded software development for wireless communication devices at Motorola, PrairieComm and Freescale Semiconductor. He currently works as a consultant with a team of engineers providing services in wireless communications. He enjoys photography, biking, and chess where he is competent enough to beat a sedated turtle. He can be contacted at firstname.lastname@example.org.