If you have been in the embedded software business long enough surely you have experienced or heard horror stories about hardware upgrades claiming to be “backward compatible” with the software.
Particularly aggravating is how schedules to deploy products using such upgrades tend to get compressed. The argument for reduced schedules usually falls along the lines of “the effort should be minimal” or “it is just plug and play”.
This story takes place during the development of a high speed wireless mobile platform. The much trumpeted 3G technology responsible for bringing to life the iPhone, G1 and the latest BlackBerry smartphones is in fact a series of standards.
Each different version of the standard adds new features that result in enhanced user experience, mainly due to faster over the air data transfer speeds and increased network capacity compared to previous versions. It was during the upgrade of one of our products to a more recent 3G version that I experienced one of the most interesting problems of my career.
We had a working part that conformed to a particular set of specifications. As we designed its more advanced sibling, the hardware, software and architecture teams had all gone through a lot of effort to make sure everyone was on the same page about what was changing.
From a software point of view the new hardware supported features which we would have to develop, but there were few functional changes that needed to be addressed to get the product to the same level of functionality as its predecessor.
The team responsible for updating the drivers had already delivered all the required changes. For the most part the differences we needed to address were confined to the memory map.
Once we took delivery of the hardware from the team doing the board support we methodically started to activate the communications software stack. One by one we enabled modules and basked in our success rate.
Sure, we encountered some bugs, and stumbled on a few “details” that had been omitted from the due diligence, but for once it was beginning to appear as if we would be ahead of schedule – that alone should have tipped us off to what was ahead.
As we marched through the plan ticking off functional milestones from the list, we started to notice an intermittent problem. At some point during execution our debug tools would lose all communication with the target, and the whole development environment would lock up on us; you could almost see the dreaded spinning hourglass hover above our hardware.
The team found some bugs that when fixed would alter the point at which the lock-up occurred. However, after much toil and quite a few bug fixes that deep inside we knew were unrelated, it started to become apparent that we really had no clue what was causing the problem.
Very quickly we ate away the schedule cushion we had gotten ourselves with our early progress, and we were now on borrowed time. We were one functional milestone away from reaching our first schedule gate and moving on to the next phase of the program; a lot of people up the chain of command were keeping track of our progress. Slowly we started to assign more engineering resources to look at the problem.
One of the much repeated rules about software project management is that adding resources when you are in the thick of battle can at best not hinder. The software we are talking about is very complex and requires expertise in many different areas, so we wanted to make sure all areas were properly scrutinized for issues. We were obviously desperate because we knew that most of the modules we were asking the new recruits to inspect would not have been activated yet.
I must now take a short detour to give a few details about our debugging environment. We used very powerful integrated debuggers (provided by one of the advertisers in this publication) that can gather all types of real-time information from the target in a non-intrusive way. Code and data traces, OS statistics, memory snapshots, you name it.
Once captured we can replay at will all this information and subject it to analysis, a powerful way indeed to figure out the most complex of problems. Unfortunately, this time we were left in the dark because once the environment locked up on us we were unable to retrieve any information from the debugger.
Presumably the information would still be somewhere inside the debugging hardware, but this was a new platform and we hadn't fully tested our tools with it. The option to recover the debug information was of course (in honor of Murphy's law) not working.
Imagine an army battalion walking into battle only to discover that electronics, explosives and fire arms were all not working properly. That was us. We had less than a week to meet our deadlines so we had to revert to some of the tried and true manual debugging methods, but it was already Friday.
We tried our best to keep from asking the team from working through the weekend, but this time we agreed it was necessary to maintain any hope of delivering on time. This was going to be an all hands weekend.
We got off to a rocky start that Saturday. The lights in the building were centrally controlled so when we arrived everything was dark. We scavenged whatever desk lamps we could from the surrounding cubes and brought them into the lab. Meanwhile our project manager was engaged in an entertaining conversation with building security. It went something like this:
Project Manager: “We need to have the lights turned on in building four, second floor, south side.”
Security Guard: “I am sorry sir, I can't do that.”
Project Manager: “So what do I need to do?”
Security Guard: “You need to call this number…”
Of course nobody answered so he left a message.
Project Manager: “I already called and described the situation. Can you turn them on?”
Security Guard: “I am sorry sir, I can't do that.”
Project Manager: “Come on, work with me. I have a team of more than 100 engineers who will be unable to do their job unless we fix this today.”
Security Guard: “Your company needs to authorize this by calling the number I gave you.”
Project Manager: “I already did. Do you mean they'll turn them on remotely?”
Security Guard: “No, they'll call me and then I can turn them on”
Project Manager (in disbelief): “Well, I am authorizing it for the company. Can you please turn them on?”
Security Guard: “That's not how it works…”
Eventually they did turn on the lights, but this shows that there is more than a grain of truth to the old adage that says “when it rains, it pours.”
We set up several stations to try different parallel approaches. We had a good idea by now of the general vicinity of where the software would fail, but we still needed to hone in on the exact location.
Some of the engineers were busy figuring out how to map GPIOs and connect the logic analyzer, but we had gotten lazy with our (now rendered useless) power debugging tools, and just figuring out all the details would take some time. Meanwhile, another group approached the problem from a different angle.
At the point of the lock up our software was executing from the highest priority task in the system (we ran in a pre-emptive multi-tasking environment), so in the unlikely event that we had a rogue interrupt corrupting us we might be able to tell by ceding control of the system while our task waited for a non-existent event.
We marched down the execution tree with our “wait for world peace” strategy and were able to rule out any rogue interrupts and narrow down the problem area to a loop.
The problem with this approach is that it is highly intrusive and ill-suited for loop debugging because it generates an unrecoverable change of flow. Regardless, now we had our target candidate and were ready for some GPIO action.
When the logic analyzer was ready we were able to observe where the problem was occurring, and as we suspected it was truly bizarre. Within the loop, there were a couple of consecutive read instructions to different memories in the system. We were able to ascertain the following behavior:
read memory A
read memory B
GPIO toggle <-- absent
Then, we increased the granularity of our GPIO toggles.
read memory A
read memory B
…die somewhere else
Clearly there was something wrong. Any instruction (even a nop ) inserted between both memory reads would cause the problem to go away. Something like this bears the unequivocal stench of a hardware issue, but we needed to convince ourselves and find conclusive evidence.
A quick analysis of the failing code gave us enough of an insight into what could be occurring. The first read was from a peripheral memory and the second read was from external memory, which meant it had to go through the cache.
We then modified the memory map to place the contents required by the second fetch in internal memory to bypass the cache. The problem disappeared! There remained quite a few instances of these back to back memory reads down the execution tree, but we now had a workaround.
By now it was Monday, and we were busy with our local hardware experts trying to figure out what could be happening. At the same time, part of the team was busy modifying the memory map to pull back into internal memory some of the external memory components.
The workaround we had would get us through this problem but it was certainly not production worthy. Internal memory is a precious resource because of its zero wait states, and consuming it to mask an existing problem would lead to unacceptable performance degradation down the line. The other alternative was to – cringe – litter the code with nops, something you don't want to do inside loops.
Now we were racing against the clock to meet our deadline. In the nick of time we got our functional milestone to work and spared ourselves a lot of grief. But the job was not done. We raised a big red flag to get some dedicated hardware support to identify to the root cause of the problem. Our local experts were quick to point out the likely culprit after analyzing the hardware specs and observing the behavior.
Adding to the troubles, the designers of the suspect hardware were not local and spent our waking hours sleeping and vice-versa. It took quite a bit of time, and the usual hardware vs. software name calling (despite the irrefutable evidence gathered) to get to the bottom of the problem.
Since RTL hardware simulations take forever to run, we had to provide a small segment of code that could reproduce the problem. This turned out to be much harder than we thought. A simple piece of code like the one used here for illustration purposes was not enough.
We needed to replicate the system state without having to execute all the code leading to the point of failure. In the end we did manage to provide a small segment of code that wouldn't take days to simulate, and the hardware designers were able to isolate the problem.
So what happened?
At the heart of this problem was a system wide change that was introduced from one platform to the next. The first platform was intended for mass market devices, for users that are happy with a few basic applications. As such, it was built using a proven but not state of the art bus architecture.
The new platform was intended for much more powerful and demanding devices (e.g. smartphones), and the need to use a higher performance bus was stated as a requirement. Still, a good chunk of the internal hardware didn't need the higher performance bus, so the old bus was used for some peripherals and a translator gasket was introduced to go from one bus to the other (Figure 1 below ).
Unfortunately there was some faulty logic introduced into a splitter block that resulted in the translator gasket getting confused, thereby mishandling a read from peripheral memory A on the old bus, followed by a read from external memory B on the new bus. This resulted in an invalid and irrecoverable state in one of the lines in the old bus, causing the system to lock up.
I am not a hardware expert, and the architecture details of the device are proprietary, but what the gasket did was take transactions from the old bus and make them look native to the new bus. This behavior was properly handled most of the time, though as we painfully learned, not always.
Once we knew the details of the problem we set out to find an adequate solution. We discovered that, more by serendipity than design, we had the option to reconfigure the cache in the communications processor to use a burst mode native in both the old and new buses, thereby bypassing the translator gasket.
The new configuration was taking advantage of a higher performance option present only in the new platform, an option that had not yet been enabled as we were still operating the device in backward compatibility mode. We were of course very happy to learn there would be no need for any software changes.
As usual, the original intentions were good: introduce a higher performance bus to deliver a rich user experience. Unfortunately, when a hardware bug slips through the verification cracks it usually results in many long nights for the software developers.
If anything, future project schedules will further compress under the pressure to deliver the latest and greatest devices into a very competitive market. The need for effective team communication is present now more than ever.
Both hardware and software engineers must reach across the aisle and work together if their products are to remain competitive in this market. The state of the union should be evidence enough that partisan politics inhibit rather than help progress.
Mauricio Gutierrez graduated from the University of Michigan at Ann Arbor with a Masters degree in Electrical Engineering. Since graduating he has worked on embedded software development for wireless communication devices at Motorola, PrairieComm and Freescale Semiconductor. He currently works as a consultant with a team of engineers providing services in wireless communications. He enjoys photography, biking, and chess where he is competent enough to beat a sedated turtle. He can be contacted at email@example.com.