Embedded Agile: A Case Study In Numbers
Although Agile methodologies have been in use since the late 1990's it is still rare to find more than anecdotal evidence for how well they really work. In evaluating whether to use agile methods, engineers have very little on which to base a judgment unless they happen to have some direct knowledge of an agile software project.
This paper describes a developer-led conversion to agile methods where the software team themselves recorded detailed data throughout the project. They used a very simple home-made unit test framework for development in C. Since the close of that project the senior members of the software team built a better unit test framework intended for doing agile software development in C. This paper gives a brief overview of the Catsrunner framework (CATS = C Automated Test System).
More analysis has been done on the data collected during the project, and some additional work has been completed to compare the team's results with the software industry in general. The purpose of this paper is two-fold:
1) close the gap in
quantitative understanding of Agile methods for embedded software
development (at least, as much as can be done given that the scope
covers only one project).
2) describe a test framework for embedded C that was developed based on our agile experience.
The Grain Monitor System (GMS) project entailed building a ruggedized, mobile spectrometer initially for farming applications. Using spectroscopy principles the technology could quantify the components of a material, e.g. how much protein is in a wheat sample.
The team size varied between 4 and 6 members, and development went on for three years. The initial field units were ready about 6 months into the project but so much was learned in the process of deploying them with a partner farm equipment company, that the team continued on to support further work and implement many more new features.
At the start of the project there were many unknowns and technology risks that made it impossible to use waterfall techniques for this work. They include:
1) New scientific algorithm
to decode near infra-red signals from grain samples
2) Early customer for new MPC555 microprocessor
3) First use of this operating system by the team
4) First customer of operating system port to 555
5) New prototype near infra-red sensor hardware
6) Early algorithms use too much MIPS for any known microprocessor
7) Must handle extremes of temperature, vibration
8) Very low-noise circuitry required
9) No experience with CAN bus protocol
10) CAN bus protocol standard not in finalized form
11) Difficulty getting early MPC555 chips
12) Team lacked experience in multitasked apps
The team used generic agile practices at first " strong unit tests, iterations, common ownership of the code " and transitioned to full Extreme Programming methods during the project. As with the vast majority of agile teams, this one didn't implement every practice fully or flawlessly.
This was a "green field" project. Using the practices described here for legacy software would not be easy but might be worthwhile, especially if it is a safety-critical application. The advantages of agile methods for safety critical applications are covered in another paper . Whether to back-fit agile unit tests to a legacy code base is a question that can only be answered case by case, and is outside the scope of this paper.
Data Gathering Methods
My role in the project was as Software Technical Lead. As such I compiled a list of all the defects that were found in integration test or later stages, including any found after delivery to our customer (the partner company that was conducting field trials of the GMS units on real farm machinery). For each defect I wrote up a root cause analysis at the time the defect was resolved.
We had independent testers engaged for the later part of the project, but for most of it, the software team delivered their code directly to internal users and to the partner company. Labor data was reported weekly by the team members themselves, and tracked by the team. The company's official time records were not available to us, and weren't broken out in categories useful to us. Data on source code size and cyclomatic complexity was obtained using C-Metric v. 1.0 from Software Blacksmiths.
At the end of three years of development the product was fully ready for manufacturing. There had been a grand total of 51 software defects since the start of the project (see Figure 1, below). There were never more than two open defects at any one time throughout the project. The team had produced 29,500 (non comment) lines of tested, working embedded code, plus several sets of related utility software that is outside the scope of this paper.
|Figure 1. Defects and software releases over three years|
The embedded GMS C code was equivalent to 230 function points (per the conversion given here. The team's productivity in the first iteration was just under three times the industry norm for embedded software teams. The team became increasingly adept at delivering code on time according to the commitments made at the start of each iteration. Early iteration lengths varied from two to eight weeks but two weeks became typical, and toward the end of the project, a new release could be turned around in one day.
Team size varied during the project from 4 to 6 people, and is shown in the below staffing profile, in Figure 2, below.
|Figure 2. Staffing profile for the GMS project|
The team used the following set of categories in Figure 3 below to track labor through the first two years of the project. Year three's labor was not tracked, but there is no reason to think it would vary much from the rest.
|Figure 3. Labor tracking categories used for GMS software|
The labor distribution charts below in Figure 4, Figure 5, and Figure 6 give a view into the activities of the first iteration, first full year (including first iteration), and the second full year. Note that the labor in iteration 1 reflects the activities of a new team that has not worked together before, e.g. much time spent working out team processes.
|Figure 4. Team's labor distribution for first iteration|
|Figure 5. Team's labor distribution for first year of the project|
|Figure 6. Team's labor distribution for second year of project|
The code base for the GMS embedded software grew from zero to a raw lines count of 60,638 (see Figure 7, below). C-Metric does a count that omits blank lines, comment lines, and lines with a single brace "}" on them. That filtered count of "effective source lines of code" (ESLOC) was 29,500 for the software at the end of the project. Short header files with long preambles, and lengthy change history blocks in all files is mainly the cause for the high percentage of non-code lines.
|Figure 7. Growth of the code base over three years|
It isn't possible to directly compute a figure for labor per line of code for two reasons: Much of the coding was change activity, not net additional code; and the team worked on utility applications to let users create and load calibration tables, exercise the hardware for test, or import new algorithm test data into our test harness. The labor for those utility code bases was not broken out separately.
Early in the project, before changing to Extreme Programming methods, the team had difficulty delivering by a target date. There aren't any figures to illustrate this. One of the reasons that Extreme Programming seemed appealing is its practice called "the Planning Game", which brings developers and management into partnership to negotiate the deliverables for each iteration.
The early use of the Planning Game gave us some difficulty. That experience is described in detail in an earlier paper . Once the team mastered the Planning Game technique, their releases were never more than a couple days late unless there was some drastic unforeseen circumstance (happened only once).
The defect rate remained fairly constant over the development period, despite the growing size of the code base. The team averaged about 1.5 defects per month. The open bug list never held more than two items all through development. In the below Figure 8, defects are grouped according to the quarter in which they were reported.
|Figure 8. Absolute number of defects per quarter|
Because the defect rate stayed low, independent of the code size, I conclude that the team's techniques of software development were effective at handling complexity. C-Metric was used to take a look at cyclomatic complexity. Four of the later releases were analyzed and the result was an average cyclomatic complexity of 6 or 7 for each of the releases. For more detail on this metric in our code, refer to .
Although we used agile development, the software still had phases such as detailed design, coding, test, and so on. In agile there is a tight loop of doing requirements, design and coding all in short increments of time so that you can re-run your unit tests about every 10 to 30 minutes.
Illustrated in Figure 9 below is a look at the phase where bugs were inserted and where they were found. This information comes from the root cause analysis of each defect. More discussion on the nature of the defects found is given in .
|Figure 9. Defect life span, year 1 of project|
It should be mentioned that the numerous software releases shown toward the end of the project (in Figure 1) do not represent panicky bug fix activity. Rather this was the software team creating custom releases to help electrical and optics engineers to isolate difficult system-level problems that only appeared when the whole system was running. The software was very stable and the team could deliver well-tested releases on a 1-day turnaround.
Comparison With Software Industry
I was able to make use of three industry sources of data for comparison of this team's performance. The first two are covered briefly since they will not be generally available to readers for measuring their own team's capability. The third (the data from Capers Jones) is something that anyone can make use of if their code can be characterized in terms of function points. This paper will therefore discuss that in some detail.
SEER SEM Estimation Data. Before the start of the project, our management considered an estimating tool called SEER SEM from Galorath. Consultants from that company did an estimate as part of demonstrating the tool. It gave a breakdown of staffers needed for each waterfall style phase and the hours that would be used by each, all based on a figure for lines of code at completion, which they got from me.
The one thing the prediction software could not foresee is the completed size of the application. The point is that with this data I could figure out the value for ESLOC/developer-hour that their database uses for this type of project. It was 1.2 ESLOC/hour. That's for fully tested, working embedded code in C. When iteration 1 was complete, the numbers showed the team had delivered 3.5 ESLOC/hour, or 292% of the industry norm, as given by Galorath's database. 9.2.
QSM Industry Data. QSM Associates Inc. also supplies software planning tools, and used to offer a free service via their website to compare your team's project data with their database of thousands of projects. I took the opportunity to input data for our iteration 1, such as the number of people on the team, duration, lines of code delivered, defects found, etc. The result was that the "Productivity Index" they calculated for the GMS Iteration 1 ranked us in the 90th percentile! This index, as they compute it, covers code complexity (based on size), schedule, efficiency, effort, and reliability.
Capers Jones Industry Data. Capers Jones, a principal at Software Productivity Research has accumulated data from a wide variety of software projects, expressed in Figure 12 below.
|Figure 10. Software defect data from Capers Jones with GMS data point added|
The only thing necessary for anyone to compare their team's data with the information from Capers Jones is to be able to state their defects per function point. We did not count function points in our project. Knowing the ESLOC, you can simply look up a conversion to function points on the SPR website. See http://www.theadvisors.com/langcomparison.htm
The data in Figure 12 can be expressed in terms of defects delivered to the customer. The "Best In Class" software teams had 2.0 defects per function point (FP), and a defect removal efficiency of 95%. Defects to customer = Total FP * defects per FP * (1.0 - defect removal efficiency).
|Table 1. Defects delivered to customer per Capers Jones, tabular form|
Let's look at how the "Best In Class" teams would perform if their code was the same size as GMS, that is, 230 function points. Their total number of defects would be 230 * 2, or 460. Then they'd remove 95% of those: 460 *(1.0 " 0.95) = 23 per Table 1, above. They would deliver 23 defects to the customer. The GMS embedded team delivered 21 bugs to their customer, according to Figures 9 - 11.
How to Achieve These Results for
Lean Thinking is the fundamental concept underlying Agile software development practices . The two essentials you must have in place to succeed with this approach are:
1) You must match the amount
of work undertaken to your capacity
2) You must mistake-proof the steps you use to produce the work
The first item is satisfied by using agile iteration planning techniques, and is outside the scope of this paper. For a developer-led agile conversion, regulation of the work stream is often very difficult to achieve because management must support it " or at least tolerate it. The second item is covered by a previous paper on agile test techniques for embedded software .
The remaining sections of this paper discuss the most powerful way of mistake-proofing your software:; the use of an appropriate test harness to efficiently catch bugs early.
Dual-platform Unit Testing as Key
For embedded software the hardware represents an extra dimension that must be addressed in the testing strategy. The GMS team built all the code as "dual target" software. It could run on a desktop PC as well as on the target MPC555 microprocessor, through the use of compile-time switches.
This strategy allowed the software to be tested first on the PC where hardware was stable. Timing would be incorrect but the logic could be fully exercised. Other compile-time switches would bypass sensor hardware and inject dummy grain data to drive computations.
The team's unit tests consisted of a conditionally-compiled "main()" within each file that held a set of related functions. This 'tester main' had calls to each function in the module, often multiple calls to the same function but with parameters intended to test boundary cases.
There were perl scripts to execute the 'tester main' routines of all the modules and report the pass-fail status of all the tests. This simple test framework had tests designed to run on both platforms, and was used throughout the duration of the project.
Catsrunner " A Better Technique
The experience gained via the simple unit test framework of GMS led, a few years later, to the development of Catsrunner and CATS (C Automated Test System) by the partners at Agile Rules, some of whom were on the GMS project. Catsrunner has a more consistent way of inputting test parameters, and its output is easier to interpret. It allows separation of test code from production code. Also it behaves exactly the same on the PC and the target platform.
In short, it's the test framework we wish we'd had time to write during the GMS project. Catsrunner is a C software unit and acceptance testing suite based on CATS (see Figure 11, below). CATS is a cross-platform testing framework for C, especially designed to work well in embedded and multi-platform environments. Catsrunner provides the wrapper that calls the test and reports the results. Catsrunner is open source software released under GPL 2.0. See  for downloading the Catsrunner software.
|Figure 11. Top: Catsrunner executing on Host, Bottom: executing on Target|
Catsrunner does three basic things: 1) Reads, from the host PC, a list of unit tests to be run; 2) Runs each unit test, in turn, and 3) Sends the results of each test back to the host PC The middle step " running each unit test " can occur on either platform. Platform is determined by environment variable settings when building the Catsrunner executable. The present version of Catsrunner runs on a PC and on an ARM7 core.
Catsrunner calls CATS, which looks up the name of the test in a table holding pointers to the testing functions. At the heart of the CATS unit testing framework is an array of structures associating the names of functions with pointers to those functions.
When the name of a test function and its input parameters are passed to CATS, it looks up the function name in this array. When Catsrunner executes on the target hardware, it must communicate with the host to know which test to run next, and then to store the result of the test.
A module named "Hostops" is part of Catsrunner, and in the case of the ARM7 target, hostops makes use of the Angel background debug monitor to accomplish the data transfer to the host. A user wishing to port Catsrunner to a new target will have to create a version of hostops that makes use of its I/O capabilities to do the equivalent data transfers.
A Catsrunner Test Examined
Catsrunner's approach to testing divides all the software into two categories: software that is inherently platform-independent, and software that "touches hardware". Platform-independent code can easily be run in an automated fashion but when software drives a motor or turns on a LED, the result of that test cannot be captured without special test hardware (which was out of the question for us).
When testing hardware-related code on the target platform, we used manual tests. That is, the test code is contained in the unit test file, but when testing actual hardware we'd step through it by hand to watch the behavior of the hardware. Catsrunner uses this philosophy, as illustrated in Figure 12 below ("pure software" indicates platform-independent code).
|Figure 12. Unit test concept for software that drives hardware|
When testing hardware-related software on the PC, we'd capture outputs that would otherwise go to hardware, and the tester code could validate their correctness. For sensor input data, we'd just bring in dummy data in order to let the software continue on.
These practices are reflected in the code by having some modules with layered directories. For a LED module, the main directory would contain the platform-independent parts of the code and be called "led". Below that are directories for each platform, in this case ARM and PC, which contain functions having the same names which are implemented differently on the platforms.
The linker will bring in the platform-independent code from "led"
directory, and only one of the code sets from the lower directories,
either "ARCH_ARM" or "ARCH_PC". The prefix "ARCH" indicates
architecture-specific software. The directory layout is illustrated in Figure 13 below.
|Figure 13. Directory of LED driver|
It would seem that manually stepping through hardware-related code would slow development unacceptably. In practice, the GMS team found it to be no problem because those parts of the code changed little once they were written, and they were well encapsulated. (The team used a more primitive test framework that had this same philosophy for testing hardware-related code.)
This has been a brief introduction to the Catsrunner agile test framework. A complete user manual with much more detail is available with the open source download package .
The GMS team was a group of ordinary developers who achieved highly extraordinary results through the power of an idea. The team did not work excessive hours. Most needed to learn some significant skill on the job. They didn't follow the agile practices 100%, and didn't have any outside coaching or mentoring in how to use agile development practices.
It has been said that in order to do Extreme Programming you need a team of hand-picked gurus. Not so. All you need is people empowered to govern their work. The powerful idea is simply this: If you make it easier to find bugs than it is to create new ones, you have the possibility of producing bug-free software.
Bug-free software lets you build trust with your sponsors and customers, spend more of your time productively (troubleshooting is waste!), and stay in control of your project. These results are within reach for every software team whose management will support sufficient empowerment.
Nancy Van Schooenderwoert of Agile Rules / XP
Embedded, has extensive
experience in building large-scale, real-time systems for flight
simulation and ship sonars, as well as software development for
safety-critical applications such as factory machine control and
 Van Schooenderwoert, Nancy "Embedded Extreme Programming: An Experience Report", Embedded Systems Conference Boston, 2004.
 Poppendieck, Mary with Ron Morsicato "XP In A Safety-Critical Environment" Cutter IT Journal, Boston, Sept. 2002.
 Van Schooenderwoert, Nancy "Embedded Agile Project by the Numbers With Newbies", Agile 2006 Conference, 2006.
 Poppendieck, Mary and Tom, Lean Software Development, Addison-Wesley Professional, 2003
 Van Schooenderwoert, Nancy, and Morsicato, Ron "Taming the Embedded Tiger " Agile Test Techniques for Embedded Software", Agile Development Conference, 2004.
 Catsrunner test framework for embedded C software is available from http://www.agilerules.com/projects/catsrunner/index.phtml
This article is excerpted from a paper of the same name presented at the Embedded Systems Conference Boston 2006, and is used here with permission.