Subtract software costs by adding CPUs - Embedded.com

Subtract software costs by adding CPUs

Click here for reader response to this article.

The best way to get a project done on time is to break it into smaller, more manageable chunks. That theory works better than some programmers suspect, extending even into multiprocessing systems. Jack Ganssle shows how breaking one big program into smaller programs that each run on their own processor can work wonders on performance, schedules, and sanity.

In 1946 programmers created software for the ENIAC machine by rewiring plug-boards. Two years later the University of Manchester's Small-Scale Experimental Machine, nicknamed Baby, implemented von Neumann's stored program concept, for the first time supporting a machine language. Assembly language soon became available and flourished. But in 1957 Fortran, the first high-level language, debuted and forever changed the nature of programming.

In 1964, Dartmouth BASIC introduced millions of (comparatively) non-techies to the wonders of computing while forever poisoning their programming skills. Three years later, almost as a counterpoint, OOP (object-oriented programming) appeared in the guise of Simula 67. The C language, still the standard for embedded development, and C++ appeared in 1969 and 1985, respectively.

By the 1990s, a revolt against big, up-front design led to a flood of new “agile” programming methods including eXtreme Programming (XP), SCRUM, Test-Driven Development, Feature-Driven Development, the Rational Unified Process, and dozens more.

In the 50 years since programming first appeared, software engineering has morphed to something that would be utterly alien to the software developer of 1946. That half-century has taught us a few pivotal lessons about building programs. Pundits might argue that the most important might be the elimination of goto, the use of objects, or building from patterns.

They'd be wrong. The fundamental insight of software engineering is to keep things small. Break big problems into little ones.

For instance, we understand beyond a shadow of a doubt the need to minimize function sizes. No one is smart enough to understand, debug, and maintain a 1,000-line routine, at least not in an efficient manner. Consequently, we've learned to limit our functions to around 50 lines of code (LoC). Reams of data prove that restricting functions to a page of code or less reduces bug rates and increases productivity.

But why is partitioning so important?

A person's short-term memory is rather like cache—a tiny cache, actually—that can hold only five to nine things before new data flushes the old. Big functions blow the programmer's mental cache. The programmer can no longer totally understand the code; errors proliferate.

The productivity crash
But there's a more insidious problem. Developers working on large systems and subsystems are much less productive than those building tiny applications.

Table 1: IBM productivity in lines of code per programmer per month

Project size
man months
Productivity
lines of code/month
1 439
10 220
100 110
1,000 55

Consider the data in Table 1, gathered from a survey of IBM software projects.1 Programmer productivity plummets by an order of magnitude as projects grow in scope! That is, of course, exactly the opposite of what the boss is demanding, usually quite loudly.

The growth in communications channels between team members sinks productivity on large projects. A small application, one built entirely by a single developer, requires zero communication channels—it's all in the solo guru's head. Two engineers need only one channel.

The number of communication channels between n engineers is:

This means that communication among team members grows at a rate similar to the square of the number of developers (see example in Figure 1). Add more people and pretty soon their days are completely consumed with email, reports, meetings, and memos.


Figure 1: Four workers have six comm channels

Fred Brooks in his seminal (and hugely entertaining) work “The Mythical Man-Month” described how the IBM 360/OS project grew from a projected staffing level of 150 people to over 1,000 developers, all furiously generating memos, reports, and the occasional bit of code.2 In 1975, he formulated Brook's Law, which states: adding people to a late project makes it later. Death-march programming projects continue to confirm this maxim, yet management still tosses workers onto troubled systems in the mistaken belief that an N man-month project can be completed in four weeks by N programmers.

Is it any wonder some 80% of embedded systems are delivered late?

Table 2 illustrates Joel Aron's findings at IBM.2 Programmer productivity plummets on big systems, mostly because of interactions required among team members.

Table 2: Productivity plummets as interactions increase

Interactions Productivity
Very few interactions 10,000 LoC/man-year
Some interactions 5,000 LoC/man-year
Many interactions 1,500 LoC/man-year

The Holy Grail of computer science is to understand and ultimately optimize software productivity. Tom DeMarco and Timothy Lister spent a decade on this noble quest, running a yearly “coding war” among some 600 organizations.3 Two independent teams at each company wrote programs to solve a problem posited by the researchers. The resulting scatter plot looked like a random cloud; there were no obvious correlations between productivity (or even bug rates) and any of the usual suspects: experience, programming language used, salary, and so forth. Oddly, at any individual outfit the two teams scored about the same, suggesting some institutional factor that contributed to highly—and poorly—performing developers.

A lot of statistical head-scratching went unrewarded till the researchers reorganized the data as shown in Table 3.

Table 3: Correlating interruptions with productivity

1st Quartile 4th Quartile
Dedicated workspace 78 sq ft 46 sq ft
Is it quiet? 57% yes 29% yes
Is it private? 62% yes 19% yes
Can you turn off phone? 52% yes 10% yes
Can you divert your calls? 76% yes 19% yes
Frequent interruptions? 38% yes 76% yes

The results? The top 25% were 260% more productive than the bottom quartile!

The lesson here is that interruptions kill software productivity, mirroring Joel Aron's results. Other work has shown it takes the typical developer 15 minutes to get into a state of “flow,” where furiously typing fingers create a wide-bandwidth link between the programmer's brain and the computer. Disturb that concentration with an interruption and the link fails. It takes 15 minutes to rebuild that link but, on average, developers are interrupted every 11 minutes.4

Interruptions are the scourge of big projects.

A maxim of software engineering is that functions should be strongly cohesive but only weakly coupled. Development teams invariably act in the opposite manner. The large number of communication channels makes the entire group highly coupled. Project knowledge is distributed in many brains. Joe needs information about an API call. Mary is stuck finding a bug in the interface with Bob's driver. Each jumps up and disturbs a concentrating team member, destroying that person's productivity.

COCOMO
Barry Boehm, the father of software estimation, derived what he calls the Constructive Cost Model, or COCOMO, for determining the cost of software projects.5 Though far from perfect, COCOMO is predictive, quantitative, and probably the most well-established model extant.

Boehm defines three different development modes: organic, semidetached, and embedded, where “embedded” means a software project built under tight constraints in a complex of hardware, software, and sometimes regulations. Though he wasn't thinking of firmware as we know it, this is a pretty good description of a typical embedded system.

Under COCOMO the number of man-months (MM) required to deliver a system developed in the “embedded” mode is:

where KSLoC is the number of lines of source code in thousands, and Fi are 15 different cost drivers.

Cost drivers include factors such as required reliability, product complexity, real time constraints, and more. Each cost driver is assigned a weight that varies from a little under 1.0 to a bit above. It's reasonable for a first approximation to assume these cost driver figures all null to about 1.0 for typical projects.

This equation's intriguing exponent dooms big projects. Schedules grow faster than code size does. Double the project's size and the schedule will grow by more than a factor of two—sometimes far more.

Despite his unfortunate eponymous use of the word to describe a development mode, Boehm never studied real-time embedded systems as we know them today so there's some doubt about the validity of his exponent of 1.20. Boehm used American Airlines' Sabre reservation system as a prime example of a real-time application. Users wanted an answer “pretty fast” after hitting the enter key. In the real embedded world where missing a deadline by even a microsecond means the 60 Minutes crew appears on the doorstep with multiple indictments, managing time constraints burns through the schedule at a prodigious rate.

Fred Brooks believes the exponent should be closer to 1.5 for real-time systems. Anecdotal evidence from some dysfunctional organizations gives a value closer to 2. The rule of thumb becomes: double the code and multiply man-months by four. Interestingly, that's close to the nearly n2 number of communication channels between engineers noted before.

Let's pick an intermediate and conservative value, 1.35, which sits squarely between Boehm's and Brooks' estimate and is less than most anecdotal evidence suggests. Figure 2 shows how productivity collapses as the size of the program grows.


Figure 2: The productivity collapse

(Most developers rebel at this point. “I can crank out a thousand lines of code over the weekend!” And no doubt that's true. However, these numbers reflect costs over the entire development cycle, from inception to shipping. Maybe you are a superprogrammer and consistently code much faster than, say, 200 lines of code per month. Even so, that only shifts the curve up or down on the graph. The shape of the curve, the exponential loss of productivity, is undeniable.)

Computer science professors show their classes graphs like the one in Figure 2 to terrify their students. The numbers are indeed scary. A million LoC project sentences us to the 32 LoC/month chain gang. We can whip out a small system over the weekend but big ones take years.

Or do they?

Partitioning magic
Figure 2 shows software hell but it also illuminates an opportunity. What if we could somehow cheat the curve, and work at, perhaps, the productivity level for 20-KLoC programs even on big systems? Figure 3 reveals the happy result.


Figure 3: Cheating the schedule's exponential growth

The upper curve in Figure 3 is the COCOMO schedule; the lower one assumes we're building systems at the 20-KLoC/program productivity level. The schedule, and hence costs, grow linearly with increasing program size.

But how can we operate at these constant levels of productivity when programs exceed 20 KLoC? The answer: by partitioning! And by understanding the implications of Brook's Law and DeMarco and Lister's study. The data is stark: if we don't compartmentalize the project, divide it into small chunks that can be created by tiny teams working more or less in isolation, schedules will balloon exponentially.

Professionals look for ways to maximize their effectiveness. As professional software engineers we have a responsibility to find new partitioning schemes. Currently 70 to 80% of a product's development cost is consumed by the creation of code and that ratio is only likely to increase because firmware size doubles about every 10 months. Software will eventually strangle innovation without clever partitioning.

Academics drone endlessly about top-down decomposition (TDD), a tool long used for partitioning programs. Split your huge program into many independent modules; divide these modules into functions, and only then start cranking code.

TDD is the biggest scam ever perpetrated on the software community.

Though TDD is an important strategy that allows wise developers to divide code into manageable pieces, it's not the only tool we available for partitioning programs. TDD is merely one arrow in the quiver, never used to the exclusion of all else. We in the embedded systems industry have some tricks unavailable to IT programmers.

One way to keep productivity levels high—at, say, the 20-KLoC/program number—is to break the program up into a number of discrete entities, each no more than 20KLoC long. Encapsulate each piece of the program into its own processor.

That's right: add CPUs merely for the sake of accelerating the schedule and reducing engineering costs.

Sadly, most 21st-century embedded systems look an awful lot like mainframe computers of yore. A single CPU manages a disparate array of sensors, switches, communications links, PWMs, and more. Dozens of tasks handle many sorts of mostly unrelated activities. A hundred thousand lines of code all linked into a single executable enslaves dozens of programmers all making changes throughout a Byzantine structure no one completely comprehends. Of course development slows to a crawl.

Transistors are cheap. Developers are expensive.

Break the system into small parts, allocate one partition per CPU, and then use a small team to build each subsystem. Minimize interactions and communications between components and between the engineers.

Suppose the monolithic, single-CPU version of the product requires 100K lines of code. The COCOMO calculation gives a 1,403 man-month development schedule.

Segment the same project into four processors, assuming one has 50KLoC and the others 20KLoC each. Communication overhead requires a bit more code so we've added 10% to the 100-KLoC base figure. The schedule collapses to 909 man-months, or 65% of that required by the monolithic version.

Maybe the problem is quite orthogonal and divides neatly into many small chunks, none being particularly large. Five processors running 22 KLoC each will take 1,030 man-months, or 73% of the original, not-so-clever design.

Transistors are cheap—so why not get crazy and toss in lots of processors? One processor runs 20 KLoC and the other nine each run 10-KLoC programs. The resulting 724 man-month schedule is just half of the single-CPU approach. The product reaches consumers' hands twice as fast and development costs tumble. You're promoted and get one of those hot foreign company cars plus a slew of appreciating stock options. Being an engineer was never so good.

Firmware worth its weight in gold

Firmware is the most expensive thing in the universe. In his book Augustine's Laws, Lockheed Martin's Chairman Norman Augustine relates a tale of trouble for manufacturers of fighter aircraft.6 By the late '70s it was no longer possible to add things to these planes because adding new functionality meant increasing the aircraft's weight, which would impair performance. However, the aircraft vendors needed to add something to boost their profits. They searched for something, anything, that weighed nothing but cost a lot. And they found it—firmware! It was the answer to their dreams.

The F-4 was the hot fighter of the '60s. In 2004 dollars these airplanes cost about $20 million each and had essentially no firmware. Today's F-22, just coming into production, runs a cool $257 million per copy. Half of that price tag is firmware.

The DoD contractors succeeded beyond their wildest dreams.

Reduce NRE, save big bucks
Hardware designers will shriek when you propose adding processors just to accelerate the software schedule. Though they know transistors have little or no cost, the EE zeitgeist is to always minimize the bill of materials. Yet since the dawn of the microprocessor age, it has been routine to add parts just to simplify the code. No one today would consider building a software UART, though it's quite easy to do and wasn't terribly uncommon decades ago. Implement asynchronous serial I/O in code and the structure of the entire program revolves around the software UART's peculiar timing requirements. Consequently, the code becomes a nightmare. So today we add a hardware UART without complaint. The same can be said about timers, pulse-width modulators, and more. The hardware and software interact as a synergistic whole orchestrated by smart designers who optimize both product and engineering costs.

Sum the hardware component prices, add labor and overhead, and you still haven't properly accounted for the product's cost. Non-recurring engineering (NRE) is just as important as the price of the printed circuit board. Detroit understands this. It can cost more than $2 billion to design and build the tooling for a new car. Sell one million units and the consumer must pay $2,000 above the component costs to amortize those NRE bills.

Similarly, when we in the embedded world save NRE dollars by delivering products faster, we reduce the system's recurring cost. Everyone wins.

Sure, there's a cost to adding processors, especially when doing so means populating more chips on the printed circuit board. But transistors are particularly cheap inside of an ASIC. A full 32-bit CPU can consume as little as 20,000 to 30,000 gates. Interestingly, users of configurable processor cores average about six 32-bit processors per ASIC, with at least one customer's chip using more than 180 processors! So if time to market is really important to your company (and when isn't it?), if the code naturally partitions well, and if CPUs are truly cheap, what happens when you break all the rules and add lots of processors? Using the COCOMO numbers a one million LoC program divided over 100 CPUs can be completed five times faster than using the traditional monolithic approach, at about one-fifth the cost.

Extreme partitioning is the silver bullet to solving the software productivity crisis.

Intel's pursuit of massive performance gains has given the industry a notion that processors are big, ungainly, transistor-consuming, heat-dissipating chips. That's simply not true; the idea only holds water in the strange byways of the desktop world. An embedded 32-bitter is only about an order of magnitude larger than the 8080, the first really useful processor introduced in 1974.

Obviously, not all projects partition as cleanly as those described here. Some do better, when handling lots of fast data in a uniform manner. The same code might live on many CPUs. Others aren't so orthogonal. But only a very few systems fail to benefit from clever partitioning.

The superprogrammer effect
Developers come in all sorts of flavors, from somewhat competent plodders to miracle workers who effortlessly create beautiful code in minutes. Management's challenge is to recognize the superprogrammers and use them efficiently. Few bosses do; the best programmers get lumped with relative dullards to the detriment of the entire team.

Table 4: The superprogrammer effect

Size in KLoC Best programmer
(months/KLoC)
Worst programmer
(months/KLoC)
1 1 6
8 2.5 7
64 6.5 11
512 17.5 21
2,048 30 32

Table 4 summarizes a study done by Capers Jones.7 The best developers, the superprogrammers, excel on small jobs. Toss them onto a huge project, and they'll slog along at about the same performance as the very worst team members.

Big projects wear down the superstars. Their glazed eyes reflect the meeting and paperwork burden; their creativity is thwarted by endless discussions and memos.

The moral is clear and critically important: wise managers put their very best people on the small sections partitioned off of the huge project. Divide your system over many CPUs and let the superprogrammers attack the smallest chunks.

Though most developers view themselves as superprogrammers, competency follows a bell curve. Ignore self-assessments. In “Unskilled and Unaware of It: How Difficulties in Recognizing One's Own Incompetence Lead to Inflated Self-Assessments,” Justin Kruger and David Dunning showed that though the top half of performers were pretty accurate in evaluating their own performance, the bottom half are wildly optimistic when rating themselves.8

Other benefits
Smaller systems contain fewer bugs, of course, but they also tend to have a much lower defect rate. A recent study by Chu, Yang, Chelf, and Hallem evaluated Linux version 2.4.9 In this released and presumably debugged code, an automatic static code checker identified many hundreds of mistakes. Error rates for big functions were two to six times higher than for smaller routines. Partition to accelerate the schedule, and ship a higher quality product.

NIST (the National Institute of Standards and Technology) found that poor testing accounts for some $22 billion in software failures each year.10 Testing is hard; as programs grow the number of execution paths explodes. Robert Glass estimates that for each 25% increase in program size, the program complexity—represented by paths created by function calls, decision statements and the like—doubles.11

Standard software-testing techniques simply don't work. Most studies find that conventional debugging and QA evaluations find only half the program bugs.

Partitioning the system into smaller chunks, distributed over multiple CPUs, with each having a single highly cohesive function, reduces the number of control paths and makes the system inherently more testable.

In no other industry can a company ship a poorly-tested product, often with known defects, without being sued. Eventually the lawyers will pick up the scent of fresh meat in the firmware world.

Partition to accelerate the schedule, ship a higher quality product and one that's been properly tested. Reduce the risk of litigation.

Adding processors increases system performance, not surprisingly simultaneously reducing development time. A rule of thumb states that a system loaded to 90% processor capability doubles development time (versus one loaded at 70% or less).12 At 95% processor loading, expect the project schedule to triple. When there's only a handful of bytes left over, adding even the most trivial new feature can take weeks as the developers rewrite massive sections of code to free up a bit of ROM. When CPU cycles are in short supply an analogous situation occurs.

More is more
In 1946 only one programmable electronic computer existed in the entire world. A few years later dozens existed; by the 1960s, hundreds. These computers were still so fearsomely expensive that developers worked hard to minimize the resources their programs consumed.

Though the microprocessor caused the price of compute cycles to plummet, individual processors still cost many dollars. By the 1990s, companies such as Microchip and Zilog were already selling complete microcontrollers for sub-dollar prices. For the first time most embedded applications could cost-effectively exploit partitioning into multiple CPUs. However, few developers actually do this; the non-linear schedule/LoC curve is nearly unknown in embedded circles.

Today's cheap transistor-rich ASICs coupled with tiny 32-bit processors provided as “soft” IP means designers can partition their programs over multiple—indeed many—processors to dramatically accelerate development schedules. Faster product delivery means lower engineering costs. When NRE is properly amortized over production quantities, lower engineering costs translate into lower cost of goods. The customer gets that cool new electronic goody faster and cheaper.

Or, you can continue to build huge monolithic programs running on a single processor, turning a win-win situation into one where everyone loses.esp

Jack G. Ganssle is a lecturer and consultant on embedded development issues. He conducts seminars on embedded systems and helps companies with their embedded challenges. Contact him at .

End notes

  • Jones, Capers. Applied Software Measurement: Assuring Productivity and Quality , McGraw Hill, New York, NY, 1991.
  • Brooks, Frederick. The Mythical Man-Month: Essays on Software Engineering, 20th Anniversary Edition , Addison-Wesley Professional, New York, NY, 1995.
  • Demarco, Tom and Timothy Lister. Peopleware : Productive Projects and Teams , Dorset House Publishing Company, New York, NY, 1999.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.