Updated: Jul 22, 2019
Designing software that never fails is a lofty goal.
The real world is messy, and even with the structure and simplifications that working with software brings, deploying in the real world is still messy. This means that getting your inputs nice and tidy is a task, not to mention getting all the logic under the hood right.
Though we do our best by creating and using design patterns and programming paradigms, most modern software languages still provide many ways to shoot yourself in the foot. For software that is designed to never fail, shooting yourself in the foot isn’t something you want to do.
On top of our own failures as software developers coming out and causing our software to fail, there are also “real-world aspects” that could make it fail. As much as we would love to believe that our software exists in a vacuum, it does not. It runs on real hardware in a noisy environment, and increasingly it has to share resources and time with other pieces of software.
Even if I as a software developer would do everything right, but my neighbor has created a memory leak and accidentally hogs all the memory, then I’m out of luck. My program will stall and fail.
And that’s just normal failure.
When we’re dealing with real-time, safety-critical systems, there’s also a time aspect to failure. If my software doesn’t react fast enough, then the car moves too far before it can start braking, and you have an accident. I’ll talk more about this in another post.
There are lots of ways for software to fail. So how do you design software that practically never fails?
Luckily, there are experts in the field of making things not break. These are the engineers in the aerospace and automotive fields where, if something goes wrong, very bad things happen as a consequence. These men and women have distilled their wisdom into safety standards: IEC 61508 for technical systems in general, DO-178B for aerospace, and ISO 26262 for automotive systems.
For now, because I’m in the field of autonomous driving software, I’ll focus on ISO 26262.
ISO 26262 Part 6 specifies a number of recommendations for the development of software in an automobile. These recommendations, broadly distilled in sixteen tables, cover a range of topics, from design and documentation, to development and testing, and beyond.
For now, I’ll highlight some of the recommendations for designing and developing software for the automotive use case, in particular recommendations for the SIL-4/ASIL-D rating, which is the highest level of functional safety, allowing only for a probability of less than 10 safety-critical failures per approx. 1 billion hours of operation or more than a billion miles driven.
There are a number of recommendations, many of which constitute good, rigorous software development practices, but the two most important recommendations are related to the time aspect of software that is designed to never fail.
The first recommendation is to “avoid dynamic objects”, which in other words means no dynamic memory allocation is allowed during your core runtime loop. This is sometimes referred to as making your application memory static.
The second recommendation is to avoid blocking calls. This means you can’t do something with an indeterminate or unbounded runtime, such as file I/O.
While the second is not terribly difficult to adhere to, the first goes against many standard programming paradigms, and thus requires good engineering discipline to avoid reverting to standard practice. For your troubles, this awkward practice buys you, when done correctly, an application free from memory leaks. What’s more, even if a neighboring application leaks memory and hogs everything, your application will be safe, at least from a memory perspective.
Following the guidelines of ISO 26262 to the letter doesn’t guarantee you will automatically have safe, fail-free code, but it certainly helps to mitigate, and sometimes even eliminate common software failure modes. This frees up developer brainpower for the harder and more esoteric problems involved in making software fail-free.
Robotic algorithms that are designed to not fail is even harder
Moving from general software to robotic algorithms makes the problem of building software that are designed to never fail even harder. Where software cares nothing for the niceties of the real world: noise on the line, cosmic rays and so on, the mathematical and theoretical world of robotic algorithms cares even less for the needs and wants of a production software system.
Take, for instance, the RRT* algorithm, which is one of the most important and popular motion planning algorithms in recent memory. If you also recall some CS 101 algorithms basics, the property of completeness is an important one. It guarantees that the algorithm will produce a correct answer for any input. For general problems where you cannot constrain the inputs, completeness is a fantastic and important algorithm property. RRT*, of course, is not a complete algorithm, but a probabilistically complete algorithm — it will almost surely produce the correct answer for any input in the limit of an infinite number of samples, runtime, and memory.
But that doesn’t fly in a safety-critical system that must never fail.
It’s hard enough choosing robotics algorithms for even a nominally working autonomous vehicle. Trying to make that vehicle safety-critical and live up to the moniker of “never failing” complicates this choice immensely. You now have to consider time and memory bounds, and how you can get those theoretical guarantees to hold in reality.
But it can be done.
Returning to planning algorithms, we want completeness and theoretical guarantees, but we don’t want, nor can we satisfy the requirement of infinite time and space. Luckily, another family of planning algorithms exist: optimization-based methods. These methods converge to a local optima — global if your problem setting is convex — and generally run in bounded memory. What’s more, if the problem is convex, the optimization process generally converges in a bounded number of iterations.
So it is possible to have a planning algorithm with theoretical guarantees that is bounded in memory and empirically bounded in runtime. Provided your problem setting is convex, of course.
If you cannot make your direct problem convex, there are still a number of other tricks you can employ. You can formulate your problem as a convex relaxation — a lower fidelity view of the world that is still safe to plan in. You could even throw away optimality and only focus on constraint satisfaction, relying on coarse solution initialization and local optimization techniques.
In short, building robotic algorithms for the safety critical context is hard, but it’s certainly doable.
Once you shift your thinking about what the parameters of the problem are, choosing the right algorithm, and building it up right is a tractable problem. The change in the problem parameters invalidates some of your prior knowledge, so a lot of research and reading has to be done. But once the “right” algorithm is chosen, building it is a matter of engineering know-how, discipline, and a good safety culture.
Building safe software and safe algorithms isn’t easy, but it isn’t impossible once you understand the problem, and know who to ask for help. At Apex.AI, we started from square one with the ideal of making safe software, and this is reflected in our culture and our software. We’re poised to make software fail-free for the safety-critical problem of autonomous driving, and you can be too.