Designing for safety and security in a connected system
Good embedded software has always been designed for both safety and security. However, connectivity has introduced intolerable levels of security vulnerability in safety-critical applications such as medical, autonomous vehicles, and Internet of Things (IoT) devices.
The tight coupling of safety and security, combined with heightened threat levels, requires developers to fully understand the difference between safety and security; also, to apply industry best practices to ensure that both are designed into a product, right from the start (Figure 1).
Figure 1. Filtering out defects: Good software and hardware design demands that many layers of quality assurance, safeguarding, and protection be employed throughout the design process (Source: Barr Group)
The implications of poor design
With the IoT, systems are now vulnerable to "action at a distance." A recent case involved Sony networked security cameras that were found to have backdoor accounts. These ports could be used by hackers to infect the systems with botnet malware and to launch more attacks. Sony since developed a firmware patch that users can download to close the backdoor, but many instances of coding or design errors are non-recoverable and can be catastrophic.
To prove this point, two security researchers hacked an automobile while it was in motion, taking over the dashboard functions, steering, transmission, and brakes. The hack wasn't malicious, but was instead performed with permission of the driver to let the researchers demonstrate how easy it was to hack the network carrier's connection. Nonetheless, this hack led to 1.4 million vehicles being recalled.
Of course, systems don't have to be connected to the Internet to be vulnerable and rendered unsafe: poorly written embedded code and design decisions have already taken their toll. Take the case of the Therac-25, a radiation therapy machine introduced in 1983 to treat cancer. This is now a recognized case study in what not to do with regard to system design. A combination of software bugs, a lack of hardware interlocks, and generally poor design decision making led to fatal doses of radiation.
The main culprits in the case of Therac-25 were found to include:
- Immature and inadequate software development process ("untestable software").
- Incomplete reliability modeling and failure mode analysis.
- No (independent) review of critical software.
- Improper software re-use from older models.
One of the main failure modes involved a 1-byte counter in a test routine that frequently overflowed. If an operator provided manual input to the machine at the moment of overflow, the software-based interlock used by the system would fail.
In June of 1996, Ariane 5, Flight 501, departed from its intended flight plan and self-destructed because overflow checks were omitted for efficiency. When a variable holding horizontal velocity overflowed, there was no way to detect it and respond appropriately.
Still, critical code and security vulnerabilities continue to remain unchecked. In fact, Barr Group's 2017 Embedded Systems Safety and Security Survey revealed that, of engineers working on projects that are connected to the Internet and can kill if hacked:
- 22% do not have security as a design requirement.
- 19% of them follow no coding standard.
- 42% conduct only occasional code reviews or none at all.
- 48% do not bother to encrypt their communications over the Internet.
- More than 33% are not performing static analysis.
Understanding the true meanings of safety and security is a good start down the path toward remedying this situation.
Defining safety and security
The terms safety and security, are often comingled. Some developers are under the misconception that if they write good code, then it will be both safe and secure. However, that's clearly not the case.
A "safe" system is one that, during normal operation, itself causes no harm to the user or anyone else. A "safety critical" system is a system that can cause injury or death when it malfunctions. The designer's goal then, is to ensure -- as much as possible -- that a system doesn't malfunction or fail.
Security, on the other hand, is primarily concerned with a product's ability to make its assets available to authorized users while also protecting from unauthorized access, such as hackers. These assets include in-flow or dynamic data, code, and intellectual property (IP), processors and system control centers, communications ports, memory, and storage with static data.
By now it should start to become clear that, while a system can be secure, it's not automatically safe: a dangerously designed system can be just as secure as a safely designed system. However, an insecure system is always unsafe, because even if it's functionally safe at the outset, its vulnerability to unauthorized access means it can be rendered unsafe at any point.
Designing for safety and security
When it comes to designing for safety, there are many factors to consider, as the Therac-25 example showed. However, designers can control only their aspect of the design, and the focus in this article is on firmware.
A good example of a mission-critical application is the modern automobile. These can have upward of 100 million lines of code, yet are placed in the hands of often-undertrained or distracted users (drivers). To compensate for such users, even more safety features and code are being added, in the form of cameras and sensors, as well as vehicle-to-infrastructure (V2I) and vehicle-to-vehicle (V2V) communications. The amount of code keeps increasing. Exponentially.
While the sheer quantity of code makes coding and debugging such a system more difficult, much of the debug time can be eliminated if some core tenets are followed, such as:
- Hardware/software partitioning that factors in real-time performance, cost, upgradability, security, reliability, and safety.
- Implementation of fault-containment regions.
- Avoidance of single points of failure (Figure 2).
- Handling exceptions caused by code bugs, the program itself, memory management, or spurious interrupts.
- Inclusion of overflow checks (omitted by the Therac-25 and Ariane rocket).
- Sanitization of tainted data from the outside world (use range checking and CRCs).
- Testing at every level (unit test, integration test, system test, fuzzing, verification and validation, among others)
Figure 2. Safety-critical systems avoid single points of failure. (Source: Professor Phil Koopman)
For security, a designer or developer needs to become familiar with the intricacies of user and device authentication, public key infrastructure (PKI), and data encryption. Along with making assets available to authorized users and protecting them from unauthorized access, security also means a system does not do unexpected or unsafe things in the face of an attack or malfunction.
Of course, attacks come in various forms, including basic denial of service (DoS) and distributed DoS (DDoS). While developers can't control what attacks the system, they can control how the system reacts to the attack and that awareness of how to react must apply system wide. A system is only as secure as its weakest link and it's wise to assume that the attacker will find that link.
An example target weak link is firmware updating, with the device's Remote Firmware Update (RFU) feature enabled. This can easily be attacked, so it's good to have a policy in place, such as: Have the user choose to either disable RFU or load an update that requires subsequent images to be digitally signed.
It may seem counterintuitive, but cryptography is rarely the weakest link. Instead, attackers can look elsewhere for attack surfaces made vulnerable due to implementation, protocol security, APIs, usage, and side-channel attacks.
How much effort, time, and resources go into each of these areas depends on the type of security threat, with each having specific defenses. Some general measures a developer can take to reduce product vulnerability are as follows:
- Use a microcontroller with no external memories.
- Disable the JTAG interface.
- Implement secure boot.
- Use a master key to generate each unit's device-specific key.
- Use object code obfuscation.
- Implement power-on self-test (POST) and built-in self-test (BIST).
Speaking of "obfuscation," there is a school of thought that teaches "security through obscurity." This thinking can literally be fatal if it is solely relied upon, as every secret creates a potential point of failure. Sooner or later, secrets escape, whether through social engineering, disgruntled employees, or through techniques such as dumping and reverse engineering. There is a role for obscurity, of course, such as keeping cryptographic keys secret.
Ensuring safety and security
While there are many techniques and technologies that can help developers and designers achieve a high level of both safety and security, there are a few fundamental steps that can ensure a system is optimized as much as reasonably possible. First, design to established coding, functional safety, and industry and application-specific standards. These include MISRA and MISRA-C, ISO 26262, Automotive Open System Architecture (Autosar), IEC 60335, and IEC 60730.
Adopting a coding standard like MISRA not only helps keep bugs out, it also makes code more readable, consistent, and portable (Figure 3).
Figure 3. The benefits of adopting coding standards like MISRA go beyond helping to keep bugs out: it also makes code more readable, consistent, and portable (Source: Barr Group)
Second, use static analysis (Figure 4). This involves analyzing the software without executing the program. It's a symbolic execution, so it's essentially a simulation. In contrast, dynamic analysis uncovers defects during runtime execution of the actual code on a target platform.
Figure 4. Static analysis tools run a "simulation" of the source file, analyze for syntax and logic, and output warnings instead of object files (Source: Barr Group)
While static analysis is not a silver bullet, it does add another layer of assurance as it is very good at detecting potential bugs, like the use of uninitialized variables, possible integer overflow/underflow, and the mixing of signed and unsigned data types. Plus, static analysis tools are constantly improving.
Usually, static analysis implies use of a dedicated tool such as PC-Lint or Coverity, but developers should also consider re-analyzing their own code.
Third, undertake code reviews. This will improve code correctness while also helping with maintainability and extensibility. Code reviews also help in instances of recalls/warranty repairs and product liability claims.
Fourth, perform threat modeling. Start by using an attack tree. This requires a developer to think like an attacker and perform the following actions:
- Identify attack goals:
- Each attack has a separate tree.
- For each tree (goal):
- Determine the different attacks.
- Identify the steps and options for each attack
Note that this type of analysis benefits greatly from having multiple perspectives.
But who has time to do it right?
As straightforward as it seems to perform the four basic steps presented above so as to minimize errors and enhance safety and security, they do take time, so a developer must budget accordingly. While project sizes vary, it's important to be as realistic as possible.
For example, add 15 to 50 percent to the design time for code review. Some systems need complete code reviews; some don't. Static analysis tools can take ten to hundreds of hours for initial set up, but once part of the development process, there is no additional time added to product development and they end up paying for themselves through better systems.
Connectivity has added a big new concern to embedded systems design that requires extra emphasis on security and safety. A good understanding of these two concepts, combined with the proper application of best practices early in the design cycle, dramatically improves the overall safety and security of a product. These best practices include: adopting coding standards, use of static analysis tools, code reviews, and threat modeling.
- Backdoor accounts found in 80 Sony IP security camera models.
- After Jeep Hack, Chrysler Recalls 1.4M Vehicles for Bug Fix.
- Killed By a Machine: The Therac-25.
- A Case Study of Toyota Unintended Acceleration and Software Safety.
- Quote by Charles Mann, paraphrasing Bruce Schneier (Atlantic Monthly, Sept. 2002).
Andrew Girson, Barr Group co-founder and CEO, has over 20 years of experience in the embedded systems industry, first as a senior embedded software engineer and subsequently in executive roles as a CTO, VP of Sales and Marketing, and CEO. Andrew is the author of dozens of printed articles and conference presentations regarding high-quality embedded, wireless, and handheld systems, and he has served on the Board of Directors and Advisory Boards for several organizations. Andrew holds B.S. and M.S. degrees in electrical engineering from the University of Virginia. Andrew may be contacted via firstname.lastname@example.org
Dan Smith, Principal Engineer at Barr Group, has more than two decades of product development and project management experience in embedded systems design for the consumer electronics, industrial controls, telecommunications, medical devices, and automotive electronics industries. Dan is an expert in the industry's best practices used for embedded systems design for safety-critical applications, and has been a speaker at several industry conferences. He is currently leading Barr Group's "Best Practices for Designing Safe and Secure Embedded Systems" training course. Dan has a B.S. in Electrical Engineering from Princeton University. Andrew may be contacted via email@example.com