When I'm reading a magazine from the software or electronics industry, I almost always run into an article hammering on the subject of software quality. The topic, which warrants plenty of hammering, is usually illustrated with abundant examples of the poor-quality software that irritates our everyday lives. These articles either focus on the negatives (horror stories or why software development is so stinkin' hard) or issue pleas for development teams to use more “best practices” to mitigate some of the challenges. But through it all we tend to accept that software development will always be hard, that code will always be buggy, and that the very best to which we can aspire is some statistical reduction in how far we fall short of success.
I don't buy it.
I don't think that software development has to be such a quality disaster. Software development could be far more disciplined, manageable, and credible than it is today.
I'm not saying this from the viewpoint of some remote, ivory tower. In my current job I write and maintain code for both Windows desktop GUIs and embedded systems using the 8051, 68HC08, Z180, ColdFire, MIPS, and Pentium processor families under home-grown RTOS code, Linux, and Windows XP Embedded–and that's just for our currently shipping products. I've worked professionally in various small teams designing, managing, training, and mostly coding for 22 years. I've seen and done lots of software development, and I know what's hard about it.
Working with a variety of environments and tool chains may have contributed to my tendency to question why development hassles occur, as they are occurring. Is it a shortcoming in this particular platform? Or of these particular tools? The language? The design? Some element of our process? Some shortcoming in my experience, training, or understanding? How can I make it better? And if I had unlimited resources with which to tackle the problem in some ideal world, what might I do about it?
We've somehow become conditioned to accept failures of computer code as some inevitable element of modern life. We talk about the “essential complexity” of software as if it were some immutable universal law. But recently I have come to suspect that these limitations apply, not to all possible software development, but solely to the software development paradigm that we've followed, unchallenged, for decades. I suspect there are ways out of this mess, but they all involve first stepping out of our familiar habits and finding new ones without these limitations.
The limited worldview
I first started questioning our paradigm after reading an article last fall on Embedded.com, “Margin,” by Jack Ganssle. The article started by recounting the September 22, 2005 emergency landing of an Airbus 320 jetliner with a broken nose wheel. The landing was accomplished with the wheel sideways, friction grinding half of the wheel completely away and stressing a wheel strut that was never designed for this sort of abuse. The plane landed safely because that strut survived the stress, having been specified to a greater strength than its intended application required. The article pointed out that in mechanical engineering, it's possible to design in margin to accommodate greater demands than expected, while in software, we can never have a margin. We can only try to reduce the number of imperfections.
This article bothered me. The comparison felt true and correct but when I thought about it, I began to feel that the concepts being compared were mismatched.
There were two variables to the success of the strut on that airplane. The first variable was the characteristics specified for that strut: did it meet or exceed the requirements? The second was whether the design was properly implemented: was that particular airplane built with the strut that had been designed? The first variable is an engineering choice where margins are possible (and expected): the strength of the part must exceed some known worst-case expectation. The second is a human accountability question, where the likelihood that the correct strut was used, that the strut itself had been made correctly, and so on, can approach but never exceed perfection. The human accountability questions are addressed by processes and quality control that seek to minimize defects, while the engineering questions are addressed by careful analysis building upon a body of known and tested quality metrics and then adding safety margins.
In theory, software–particularly embedded software–should be no different. The specifications variable looks at whether the system has the resources to do the job. Is the CPU fast enough? Is the guaranteed interrupt latency low enough? Does the algorithm cover the expected range of input values? These are engineering questions, all of which allow for margins. The human accountability variable looks at whether the design was properly implemented. Was the correct algorithm used? Was it coded correctly? Are the code and operating system bug-free? Was a particular unit manufactured correctly? These questions are addressed by processes and quality control.
So, for both mechanical and software systems we have the same interplay of engineering choices, for which we can build margin into our choices, and accountability issues, for which we can only strive to improve, but never perfect, our success rate. It would seem, from this perspective, that both mechanical and software engineering should be able to yield the same high levels of success and the same low levels of catastrophic failure. Yet a complex software system turns out to be a lot less reliable. Every line of code represents a possible point of failure if done wrong, and failures tend to be catastrophic, halting the full function of the larger system.
Consider an everyday mechanical product we largely take for granted: the automobile. Every component–down to each screw, washer, or strand of wire–represents a possible point of failure, where the total product either got it right or fell short. The car sitting in my garage represents a larger number of points of failure than many quirky embedded systems projects, which begs the question: why doesn't the car just fall apart where it sits? There must be millions of points of possible failure in the manufacture of my car, so why does it work at all?
In my car, a failure at one of those millions of points is usually not able to do much harm. Let's say the structural integrity of one copper strand of wire in my engine fails because it was made wrong or stressed too much during assembly. So what? The other strands in the wire still carry electricity, and the loss of the current-carrying capability does not drop below the requirement for that cable. Or say that one machine bolt in a mounting was made from impure metal, can't hold its load, and snaps. I'll guess that if that bolt were fairly important (holding the frame together), other bolts are there too and the car survives; margin comes from redundancy.
Somehow we accept that in software, a bad strand or a bad bolt can cause the whole thing to disintegrate. Why do we accept it? Because our understanding of software technology, our software development worldview, says this is Just The Way It Is.
The universal bolt
Imagine for a minute that I've invented the Universal Bolt. This is a metal object for joining threaded holes that can extend or collapse to fit a variety of lengths. It can expand or contract to fit holes of different diameters. The really cool feature is that I have replaced the bolt's spiral ridge with a series of extendable probes that can accommodate different thread pitches. What a marvelous product! You no longer need to stock a variety of bolts of different sizes and lengths and thread spacings because my Universal Bolt can be used in place of any of them. Of course, with all this flexibility this bolt is admittedly far less solid than conventional bolts, but I have addressed this concern with super-high-strength metal and a clever micro-machined mechanism for locking it to the desired length, thickness, and threading. This introduces a lot of complications, and my Universal Bolt turns out to be more expensive than a conventional bolt. But no problem!
Because it's able to change configurations extremely quickly, a single Universal Bolt can take the place of many conventional bolts simultaneously. What we do is rig up a clever and very fast dispatcher device that quickly moves the bolt from hole to hole. If the dispatcher is fast enough, my Universal Bolt can spend a moment in each hole in turn and get the whole way through your product so fast that it returns to each hole before the joint has had a chance to separate. Sure, the bigger the project, the more complex the dispatcher becomes, but if we keep boosting the dispatcher speed and the bolt reconfiguration speed, we can keep the whole thing together.
Absolutely absurd, right? You'd have to be crazy to get into a car built along these lines. If anything caused the dispatcher to derail, the entire product would collapse in a second.
Well, yes, it is absurd. But it also pretty accurately describes the function of several embedded systems I've worked on. A fast and complex thread dispatcher keeps moving one simple and stupid integer-computation unit all over a big system tending to tasks rapidly enough that they all get done. And if that dispatcher ever once leads the CPU into an invalid memory address the whole thing crashes to a halt. Does this sound familiar to any software developers out there?
The present worldview
Software development started with programming. At one time a computer's primary purpose was to compute: that is, to be a programmable machine for the purpose of processing mathematical algorithms. A fairly simplistic but flexible arithmetic engine was given the ability to step through computations according to instructions that could be changed to handle the next problem. Want to grind out trigonometry tables? A computer is just the tool for you.
Then some of the programs became useful in their own right; something you would run over and over again with different inputs. Suddenly we had software applications. Add some long-term storage and the computer became suited for data processing. The computer was a big and expensive solution for applications, so some clever people came up with time-sharing (or multitasking or, if you prefer, the dispatcher for the Universal Bolt) to get more jobs done with the same hardware.
Sometimes these applications were just better tools for helping people do computations (for example spreadsheets such as VisiCalc). Other times applications had absolutely nothing to do with computing (word processors such as WordStar) but were implemented as computer applications simply because it was less expensive to build them in software than to build dedicated hardware alternatives. The same approach holds true today: Excel, Word, Access, PowerPoint, and so forth are still applications implemented on a computer because to do the same work in a dedicated appliance would be more expensive and less useful. But underneath, a very simple arithmetic engine is racing around a hideously complex and fragile maze of instructions to do the work. Embedded systems are not much different. Just like most big-name software applications, the very concept of the computer has grown and evolved into something unwieldy that tries to do too much.
Our desire to do powerful and flexible things with computing technology has stayed rooted for too long in the concept of executing a single algorithm; the result has simply become absurd. Somehow the tool we've had–creating expanded algorithms for ever-more powerful processors–has become the only way we know how to look at software needs.
We need a different model for software development.
Code reuse to the rescue?
Actually, it's worse than all that. Because not only do we try to create really complex applications by coding really complex algorithms, we tend to do it as a custom job every time.
Imagine, if you will, that you asked me to build you a car. In order to do this I was provided with your description of what a car ought to do and a forge for making the metal. No existing parts, no existing designs, just a description and some very raw materials. I can say with some confidence that what I would create–if indeed, I succeed in making any sort of car at all–would crash more decisively (and more literally) than any software I've seen.
I've heard about the importance of code reuse for most of my career: how silly it is to reinvent the wheel, how important it is for code quality to build upon trusted code. It's all true. I can also say that professionally we've still got a long way to go. The mainstream software development tools don't do much to facilitate code reuse and there seems to be a rule that the closer some development environment comes to allowing good reuse, the more the performance suffers or the more unstable the end result becomes. All I know is that in my career, the number-one explanation for discovered bugs (both mine and others') tends to be the “cut and paste mistake”–that is, an attempt at code reuse that caused problems.
When we as developers try to reuse code, the code we're reusing is rarely a perfect fit. I encounter this all the time: some module I wrote for a previous application is what I need for the new application. So I try to add the module to my new project and things break. In C or C++, the languages I use most, it turns out that the module needs different #include files than my previous project used, or different compiler switches, or different libraries to link against, or things like that. Every module carries with it its own tangle of requirements and dependencies, and I spend a block of time trying to get the project to build again. Almost always I have to make some code changes to the module, break routines out into different source modules, add some #ifdef conditional sections to reconcile the disagreements, change around the order of some headers or the like–and then see if the module still builds in the original project(s). Did I break something along the way? Very hard to tell. And that's ignoring the fact that the module probably isn't exactly what I need, functionally, so I have to tweak things. I can do this by adding run-time flexibility (adding parameters or flags, slowing down the execution, and making more execution paths that may not ever get tested) or by making #ifdef sections compile differently in different cases, which very quickly renders the code a completely unreadable mess that neither I nor my fellow developers can really maintain. This doesn't sound at all like the goal of reuse we were aiming for.
I know, these problems are characteristic of the tools I'm using. But they're not unique to particular tools or languages. Let's face it: even leaving code 100% untouched doesn't guarantee that it doesn't change. When I'm working in 8051 assembly language it's not uncommon that when adding some initialization code, some medium-distance calls (ACALLs) later in the code will no longer build because their targets are no longer in the same memory block as the caller. So I change the offending ACALLs into the less-bounded but slower LCALL instructions, which changes the timing of the program. That grows the code size, which may suddenly cause conditional branches to no longer reach their targets, requiring that the conditions be rewritten to test the opposite condition and branch around an ACALL that gets to where it needs to go, which again changes the timing. So much for leaving the code alone!
At the other end of the scale, in the desktop PC world of Windows and Pentiums, adding a bit of code causes unrelated and untouched code modules to link to different addresses, suddenly changing where in the code the paging breaks for virtual-memory swapping occur. Or worse yet, two important and heavily used routines now hash to the same locations in the processor's cache; you take a sudden performance hit in code you haven't changed for no reason you can detect.
Our code reuse is, in many ways, trying to build our new insanely complex algorithm in part from snippets of other insanely complex algorithms, hoping that they fit and make our situation better than starting from scratch, although they may not. This observation sums it up: as developers, we're told of the virtues of both code reuse and refactoring, when in fact, these are opposite approaches to crafting code.
It occurs to me that the problem is neither too much nor too little code reuse. Instead, the problem is that our code reuse is all about helping to build huge algorithms, not creating components. What we're doing doesn't match what we think we're doing.
Now imagine that I wanted to construct a new office building. I suppose I could hire a single worker with every needed construction skill and ask him or her, alone, to make my building for me. But I wouldn't. I would hire a project leader and a team of overseers working together on various goals, who in turn would hire their teams of contractors and laborers to perform their various specialties to get the necessary tasks done, all working in parallel. This approach is faster and safer than the solo worker because its success doesn't depend nearly as much upon an individual. Sure, some work inefficiencies might occur when various steps can't proceed until other steps are completed by other workers, but overall, the teamwork is superior to the single “super builder” approach.
But no, in software development, we prefer that one entity (the CPU) do the entire job and then focus on trying to make him/her/it faster. That's a Bad Idea. If we're going to do anything to make software development more manageable, we need to divide up the work into more reasonable pieces.
To a software developer it may sound like I'm suggesting that we divide jobs into multiple threads, hardly a new idea. No, I emphatically do not suggest this. I have plenty of experience with multithreading and know when it helps and when it hurts. Threads always add complication, uncertainty, and difficulty in testing.
The problem with multithreading is easily illustrated when human beings do it. Suppose you have several tasks to perform, but you're constantly putting one down to work on another one, while being interrupted and redirected by phone calls and visitors. You may be more responsive to specific immediate demands but take longer to complete all your tasks and risk botching them because your attention is divided.
Asking a single processor to jump between tasks and to service interrupts introduces virtually the same reductions in efficiency and reliability. The admittedly greater ability of a processor to “concentrate” on its tasks is more than offset by the lack of “common sense” that enables humans to recognize when a neglected task is getting into trouble. Dividing a CPU across threads doesn't improve the overall picture. Rather, it means that the overall resulting “algorithm” is even more convoluted and less deterministic–hardly the goals we are seeking.
All this indicates to me that when our software development becomes unmanageable and untrustworthy we're probably asking the processor to do too much.
Choosing new models
We've looked at the internal workings and fallacies of the software developers' world. Let's look at how other engineering disciplines approach quality and see if we can apply some of these to software development.
In the mechanical world, complex systems are built out of less-complex assemblies built out of still-less-complex components. There are endless variations on the machine bolt because different needs call for different lengths, diameters, materials, and so on. When creating an assembly that needs to join two pieces of metal, an engineer doesn't need to rethink the question of how pieces can be joined; he or she draws from the existing body of work on bolts and other fasteners and selects a “right” choice on the basis of the specification requirements. Components can be characterized by their specific properties: how strong, how heavy, how durable, acceptable ranges of temperature or pressure, or whatever.
In software development we don't have much of this yet. Right now when we talk about creating “components” we mean complex assemblies that try to gain value by solving lots of variants on a problem. We may say a report-generation module or a spreadsheet-style grid control is a “component.” Nope, those are complex major assemblies. A software equivalent to the bolt might be, for example, the “pointer to next” in a linked list–a completely different level of thinking.
In the world of building construction, and indeed in most large endeavors, we employ a team of workers with differing skill sets working in an overlapping time domain to accomplish a large task. Beyond the timeframe and risk advantages, this approach reduces the breadth of skill required by any individual worker, improving the reliability of that worker's output. The efficiency and success of these projects involves the engineering of the process (design), the communication among members of the team (management), and the workers themselves (proficiency). More overall energy is expended than in a lone effort but it works out much better. In software development, both object orientation and threading have been proposed as comparable concepts but neither one really comes close. I submit that the nearest software equivalent would be to divide a task among a number of processors, each one handling its piece of the work as a member of a larger team.
So the basics of other engineering and design disciplines don't map very well onto our present software development. But recognizing this, I see a choice before us.
We can simply whine that software development “isn't like that,” and resign ourselves to inherently poor quality, striving merely to find the Best Process du Jour that might enable our efforts to suck less.
Or we can start with the requirement that software development should involve trustworthy components, specifications, and margins; that it should allow assemblies of increasing complexity to be built from trustworthy lesser components; it should involve a team approach to performing complex tasks; and it should be something that can be generally dependable and trustworthy. And then–and only then–start building the software development disciplines of the 21st century on these foundations.
Here's a thought. Suppose we pursued the teamwork model with a vengeance and decided that for every small task within a larger computing job, we assign another processor. Suppose that every software “object” or every subroutine call is a processor. Leaders, managers, specialists, laborers would all be implemented as separate processors, each with its own specialized “code.” Sure, this was impossible when computing started out, but in today's world simple processors are pretty cheap. (In fact, you can implement a whole bunch of modest processors inside a single CPLD or FPGA chip.) Sure, this doesn't fit today's reality of desktop PCs for a wide range of general use. It's easier to envision this approach in the embedded computing world where the tasks performed by the hardware are more static. But let's think it through.
Let's use as an example the common DVD player. Open one up and you'll see that there is an optical drive, a power supply, some controls, and so on, all managed by one super processor with a big heat sink doing all sorts of hard work. I don't care what DVD player you have, that processor contains a lot of software. I've also never seen a DVD player that is bug-free. From my cheapie Magnavox player that locks up after idling too long and sometimes forgets to acknowledge the remote control, to my beloved but aging Pioneer player that has decoding errors on just two commercial DVDs out of many hundreds tried, they all have software glitches. And that's overlooking widespread shortcuts in the MPEG-2 decoding that can result in subtly wrong visual effects on many different makes and models. It's software and it's imperfect.
Instead, imagine that the DVD player functionality was split across many component processors. One processor might handle seeking on the optical drive; another would handle focusing the laser. Turning the optical data into byte values, reading bytes into buffers of sectors, and presenting streams of information from consecutive sectors (which span tracks) are all tasks that might each warrant a processor. Some processor is keeping track of the progress through a disk (what is being displayed right now, in terms of title, track and time). Some processor is reversing the CSS encoding in the big VOB file representing one title on the disc. Another is breaking the decrypted VOB into the MPEG program stream packs and dispatching pieces to various other processors for video and audio processing. One video processor breaks out the video data into frames, another breaks the frames into slices, another breaks the slices into macroblocks, others crunch the bit compressed data into chunks of meaning, and still others run the DCT math to reconstruct the macroblocks from their encoding. There are processors keeping track of past frame data for handling the motion compensation for the “P” frames, processors precomputing frames in order to build the “B” frames, and processors reassembling the whole mess according to the presentation time stamps built into the data. You've got processors synchronizing the presentation of video with the work that those other audio processors were doing at the same time in breaking down the compressed audio. On top of this all you've got processors adding on-screen display elements (subtitle overlay for example). At a higher level you have a processor handling the interaction of the user with menus, which may involve the infrared remote receiver (another processor handles turning those blinks into commands) or buttons on the front (still another processor watches and de-bounces each button).
That's a lot of processors! But then, it's a lot of work. Today's DVD players are doing all these tasks already, just not putting a lot of workers on the job. Is it any wonder that there are bugs? And, supporting my premise, is it any wonder that the processors in modern DVD players generally have dedicated hardware accelerators for some of the low-level tasks like DCT math, onscreen display generation, or bit-stream unpacking?
But suppose we went crazy and made each distinct task run in its own processor. Setting aside for a moment some obvious complications like the lack of a single memory space holding the frame data required, the result of using lots of processors still looks more complicated than the current approach. And it is. But look inside each processor and you would see much simpler code. Actually you'd see a lot of today's code spread over a lot of different processors. But each processor would be much simpler in itself. The code, doing only one task, would be reasonably straightforward. And maintainable? Better than maintainable: once you got it right, it would never need to be maintained again! Why redesign a “bolt” that works?
In the overall system, you would end up with more overall complexity because in addition to all the various algorithm components in their own processors, you need overseers and a lot of communication among processors to pull it off. If there is only one YUV video data stream coming out of the system (composite or S-Video or component or HDMI or whatever) then there needs to be a processor managing just that stream, being fed information by a lot of different processors at different points, almost certainly with some “middle management” processors gathering data from underlings into fewer high-level elements to combine. The communication and management become the new challenges. So the total amount of code and silicon exceeds the current requirement. Is this progress?
Yes, it is progress because in this model the complexity has been so diluted that the only really complicated part is in putting together the pieces. Most of the pieces, once completed, never need to be touched again, so although the first DVD player built this way would be a major undertaking, the next model would not be. Want to add better on-screen configuration features or progressive scan generation? Most of the system is untouched. And more importantly, when tackling the next generation design (say, a Blu-Ray player, which does all of these things but also adds MPEG-4, H.264 and VC-1 decoding, a Java interpreter for user interface, and new protection systems), you've got components to reuse without fear. Look at it this way. The jump from DVD (480i, at 10 million active pixels per second) to Blu-Ray (1080p, at 124 million pixels per second) is a twelve-fold jump in throughput. Such a jump can be accomplished by radically boosting the performance of a system that already has heat-dissipation concerns or by just adding more low-complexity team members to pitch in on doing the same kinds of work. Which sounds easier?
I submit that dividing complex tasks over multiple processors is the way to reign in the mushrooming complexity, reliability, testability, and maintainability concerns.
Now, in this lots-of-processors model, just as in the real world of teams of human workers, communications and management of common resources become big issues. I don't wish to downplay these challenges and I don't have a ready-made solution for how they cooperate. But I know a model to throw away, one we generally use today across multiple processors: the Remote Procedure Call. That is, one processor asks another processor to do a task, waits until it gets the answer, and then moves on. This approach is comfortable to our Big Algorithm worldview but manages to give you the worst of both worlds. It's still a function call in a single algorithm that now has the added weakness of depending upon two processors to not screw up and the communications between them being perfect. No, the right model for multiple processors has to involve dispatching requests and signaling completions. This is a big deal. But it's what we need to focus on to get out of the trap of our present software development paradigm.
But think further. If we can work out the framework for all this communication and management, I'll bet we could work out the ability for the same processor to do different jobs at different times (just as, in constructing that office building, the same electrician w