Current programming technology is rather fragile: programs regularly crash, hang, or even allow viruses to have free reign. An important reason is that a lot of programs are developed using low-level programming languages. One of the most extreme instances is the widespread use of the C programming language.
Whereas most modern programming languages require a compiler to throw an exception when exceptional behavior occurs (e.g. de-referencing a NULL pointer, integer overflow, accessing an array out of its bounds ), C does not impose such requirements.
Instead, it classifies these behaviors as undefined and allows a program to do literally anything in such situations. On the one hand, this allows a compiler to omit runtime checks and to generate more efficient code, but on the other hand these undefined behaviors often lead to security vulnerabilities. There are two main approaches for improving this situation:
(1) Switch to a more modern and higher level programming language . This approach reduces the number of program- ming errors, and if there still is an error, the chance of it being used by an exploit is much lower.
One disadvantage of this approach is that there will be a thicker layer between the program and the hardware of the system. This costs performance, both in execution speed and in memory usage, but it also means a reduction in control over the behavior of the system. Especially for embedded systems and operating system kernels this is an undesired consequence.
(2) Stick to a low-level programming language like C , but add a formal methods layer on top of it to establish that programs do not exhibit undefined behavior.
Such a layer might allow the developer to annotate their programs with invariants, and to prove that these invariants indeed hold. To be practical most of these invariants should be proven automatically, and the remaining ones by interactive reasoning.
This second approach is an extension of static analysis. But whereas static analysis tools often yield false-positives, this approach allows the developer to prove that a false-positive is not an actual error. For functional correctness, this approach has also been successful. There have been various projects to prove the C source code of a microkernel operating system correct.
There are many tools for the second approach. However, they do not use an explicit formal C semantics and only implicitly ‘know’ about the semantics of C. Therefore the connection between the correctness proof and the behavior of the program when compiled with a real-world compiler is shallow. The soundness of these tools is thus questionable.
For this reason, as described in this paper, we started the Formalin Project in 2011 at Radboud University to provide a formal semantics of the C programming language. Our goal was to formalize the ‘official’ semantics of C, as written down in the C11 standard .
We intended not to skim the more difficult aspects of C and to provide a semantics of the whole language. Unfortunately, the Formalin project has turned out to be much harder than we anticipated because the C11 standard turned out to be very difficult to formalize.
We describe various issues by small example programs, and discuss what the C11 standard says about them, and how a formal semantics may handle these. We argue here that the C11 standard does not allow Turing complete implementations, and that its evaluation semantics does not preserve typing.
Finally, we claim that no strictly conforming programs exist. That is, there is no C program for which the standard can guarantee that it will not crash.
To read this external content in full, download the complete paper from the author archives at the Open Standards site sponsored by the Technical University of Denmark.