“When it comes to software quality and reliability, prevention is always better than cure.” So said Professor Roberto V. Zicari, editor of data management resource portal ODBMS.org and professor of database and information systems at Goethe University Frankfurt.
However, commercial pressure often means software development teams have to make trade-offs between code quality and pressure to ship new features. “No matter what we do, bugs always end up slipping in and being deployed into the field. So, what do you do when bugs do happen? Just as with our own health, investing in prevention is the right thing to do; but we will always need the hospitals. We need cure, just as much as prevention.”
Based on this premise, he and Greg Law, founder of software failure replay firm Undo, co-authored and released an e-book at the beginning of this year entitled “10 Tips to Accelerate Time to Resolution of Software Defects”. Based on interviews with senior engineers building enterprise software systems to find out what they do when things go wrong, the book explore how they measure and reduce mean time to resolution (MTTR) when bug reports come in, and how they go about reducing the average cycle time it takes to resolve bugs.
This article highlights some of the key points raised in a recent panel discussion to launch the book and understand the issues raised.
But before that here’s a sampling of the key findings:
- “We always have at least one, maybe two, branches that are ready to release. If something does come up, we want to be ready for it. We’re essentially always ready to release.” Bryan Bowyer, director of engineering, Mentor (a Siemens business)
- “We have a zero-tolerance policy against defects. So, we never end up with a backlog of issues building up. That, in itself, saves us a lot of time and effort. When a defect creeps in, we’re pretty sure it’s a new bug we haven’t seen before and not an intermittent failure […] we haven’t previously dealt with.” Roisin McMahon, engineering director, Renesas Electronics Europe
- “Our robust continuous delivery pipeline allows us to develop a fix […] and minimize our mean time to resolution, which is a key metric we use to track the progress of our DevOps journey.” Ken Dickinson, VP of enterprise quality, SAS.
Hosted by Professor Zicari, the panel consisted of:
- Greg Law, Founder/CTO at Undo
- Snehal Khandkar, former senior engineering manager at Rubrik (now at Facebook)
- Haricharan Ramachandra, senior director of engineering, Salesforce
- Ken Dickinson, VP of enterprise quality, SAS
Prevention is better than cure
Is prevention always better than cure? The question raised arguments both ways.
Dickinson: You are assuming that you can prevent all the degradations that can happen to your system. Honestly, I don’t think that you can. I think that you have to account for entropy, you have to account for chaos. No matter how beautiful your testing suite is, there will always be things you cannot account for. You have to invest in both: robust testing within your delivery pipeline and your ability to recover. How quickly can you solve a problem that’s detected in production? That’s why mean time to resolution is important. You can’t predict everything, so how quickly can you recover?
Law: Humans are really bad at writing software. Engineering is always about trade-offs. How early can you get bugs out? And how much do you invest in that? The cost of a bug in an airplane is catastrophic (costs hundreds of lives and can come close to costing the company), compared to the cost of a game or an app where you won’t invest anything like as much in prevention.
Ramachandra: ‘Every feature is a bug in a tuxedo’. Every line of code we add increases the probability of a new bug being formed. That’s why building for failures is important. There are lots of causes of failures, software failures, network failures, hardware failures. Google actually pioneered this concept of designing for failure. There’s a famous paper by Jeff Dean that explains how they designed from the ground up assuming that their hardware will fail. Resilience in engineering is an important concept.
Automated tests: do they work?
Dickinson: Test automation is terrible at finding bugs. But it’s really good at preventing regressions and freeing up time from some of my creative humans to go find those bugs and think of those things that automation does not cover. Automated tests are still scripted tests. They do not discover what you haven’t thought of. They can only validate what you have already thought of. If you install an adequate layer of automated testing, it frees up hours of your week for your humans to go after what you can’t easily reproduce with automation and what you haven’t thought of yet. Where I’ve seen the most value in automation is in what can humans do when they have that as a tool at their disposal.
Metrics: how do you use them in practice?
What do terminologies like mean time to toot cause and mean time to resolve mean? How do you use them in practice?
Ramachandra: These metrics relate to how we’re handling things when stuff happens in production.
- How long does it take to figure what is happening – that’s mean time to root cause. From the time we acknowledge to the point we figure out what’s the problem.
- When do we fix it – that’s mean time to resolve.
- And before all of that, we also have mean time to acknowledge: is it my problem or your problem? Sometimes it takes a long time for teams to acknowledge that it is somebody’s problem.
The reason why we break it down this way is to understand where the bottlenecks are and help us build processes around these measures. If we don’t measure it, we can’t fix it.
Law: The above describes state of the art best practices. My experience is that most software organizations, even large companies with greater resources are not at that level yet. At the most basic level – and this always makes me wince a bit – is where people have ‘open defect count’ as a primary KPI on how they are doing. If I go on a big testing spree, and I uncover 500 bugs and I put them in my bug tracking system, my software hasn’t got worse. I just know a little bit more about the ugly truth. Going back a few years, I’ve even seen a resistance to filing a bug because that’s going to make my KPI worse. If you’re going to measure that kind of thing, it’d be much more useful to measure ‘closed rate’ and the age of defects in the system. One of the insightful points that surfaced in one the interviews I did is this: the older a bug is languishing in your bug tracking system and you haven’t touched it for a long time, the harder it is to reproduce.
Issues with automated tests
Dickinson: We have a quarantine policy on automated tests. It’s different with every team, but for example if a test fails 3 times within 2 weeks, it’s a flaky test; that test gets pulled out of the deployment pipeline and it is continuously executed – just not gating anyway – and that test has to prove stability before it can be reintroduced into the pipeline. Stability matters.
Law: If everybody just ignored these flaky tests when they go off, it’s like having a smoke alarm that’s always sounding and people start ignoring things. But you do need a strategy for dealing with them. It is a smoke alarm: “I’m smelling smoke in the codebase”. It’s easy to dismiss but it can be costly. One of things we’ve seen our customers do very effectively is quarantine those flaky tests and run them again and again to get the information they need in order to root cause them – whether that’s running them under recording with Software Failure Replay or whatever means you have available. You need to get it out of the pipeline but not ignore it.
Ramachandra: With reference to flaky tests – or transient failures – one of the challenges that most software engineers in our organization face when it comes to quality is separating the signals from the noise. There’s so much noise and not enough scrutiny on transient failures. Why are these tests flaky? It’s not just about the test code. It might be application behavior or the test environment. Often our test environment is different from the production environment. We cannot think of all the scenarios that can happen in production.
Khandkar: Automated tests safeguard you against regressions; they are testing against the bugs you already know of. They are not helping you find new bugs. At Rubrik, we found it valuable to introduce chaos in our large-scale systems. If you have a large-scale test setup, don’t execute it to plan; add some chaos to it, some unexpected behavior and see how your system responds to it. That was most effective in finding out new bugs.
Reality vs lab testing
Zicari: We do all the tests in the lab and it works, and the same software in production might not work in the way I was thinking. Any of us who are patients going to hospital, and there’s software used in the hospital for doing something serious. How would I feel about it if I heard that they released the software, but it might not work as expected? What is your reaction to people like me who will think ‘gosh, this is scary!’?
Dickinson: The cost of a failure is directly influenced by how quickly we can recover from it. If we introduce a bug, but we can detect it and resolve it – oftentimes before the customer notices it – then the cost of failure is in the basement; so we can afford to be more innovative, and more aggressive in the changes that we introduce. If we’re operating in a market where it’s not the case, then we ratchet back the amount of aggressiveness.
Khandkar: The cost of a bug is very different if you’re looking at a failure in a hospital software or software that goes in an airplane versus a gaming app. If I’m an airplane software writer, I will put in a lot more safeguards and checks; maybe not so much if I’m on the other hand of the spectrum.
Law: You’re right to be scared. You should be terrified. One of the earlier cases of software bugs killing people, there was a machine in the hospital giving radiation treatment for cancer. They tested it and the software worked fine. Once deployed in production, the operators started punching the keys to control the dosage quicker and quicker the more experienced the operators became; they started doing it too quickly and there was some integer overflow and the patient got fried and got killed by the device sending hundred times too much radiation into the patient. The testing environment turned out to be different from the reality in an unexpected way.
“The vast majority of software in the world is not really understood by anybody.”
The full hour-long panel discussion can be viewed here.
The e-book, 10 Tips to Accelerate Time to Resolution of Software Defects, can be accessed here.
- Software tracing in field-deployed devices
- Software testing is crucial for embedded system safety and security
- Compilers in the alien world of functional safety
- Ensuring software timing behavior in critical multicore-based embedded systems
- A logical method of debugging embedded systems