Building Safe Algorithms in the Open, Part 1 - Design
Consider the waterfall. Powerful. Implacable. Always surging forward towards the inevitable bottom. Powered by one of the few fundamental forces in the universe.
Waterfalls are truly awe-inspiring constructs, so it’s no wonder engineers have a bit of an obsession with them. The old DOD-STD-2167A standard recommended the use of a waterfall model, and my out-of-date engineering education suggested the use of a Phase-Gate model, which looks an awful lot like a waterfall to me. On the flip side, those of us who studied computer science at a university probably know that the waterfall model is a bit of an anti-pattern. Our friends in the ivory tower of academia tell us that, no-no, agile is the way to go, and that seems to have rung true in industry.
So between the old-fashioned, antiquated waterfall model, and the new hotness that is agile, what’s a developer to choose? Does the equation change when you’re trying to build algorithms? Or some safety-critical software?
As with most things in life, the answer is somewhere in between.
Hybrids, V’s, and Spirals
Hybrid development is that in between. Where the waterfall model doesn’t permit going back and changing requirements, the hybrid model permits it. And where agile development doesn’t necessarily lend itself well towards up-front planning, hybrid development also gives you space for that. What’s more, hybrid development is touted to reduce the number of defects in a final product, something we probably want when developing algorithms for safety-critical applications.
Sounds good, but how good is it really?
To answer that question, we’re taking a stab at hybrid development as we develop the NDT algorithm for localization in the open. Localization is an integral part of any autonomous driving stack that goes beyond purely reactive control. If you don’t believe me, or you’re not familiar with localization, I gently recommend that you take a look at some of the design documents that came out of this process.
So what is hybrid development in a nutshell? From my layman’s view, I’d argue it’s the idealized V model, or the Spiral Model. You plan, design, implement, and test, and iterate on that whole process based on the lessons you’ve learned and the friends you’ve made along the way.
Putting it into practice
A little bit more concretely, with the NDT working group in Autoware.Auto, we’ve currently finished up our first swing down the left side of the V (aka the first iteration through the design phase) in preparation for the Autoware Hackathon in London (hosted by Parkopedia!). Our first pass through the design phase consisted of the following steps:
You can have a gander at each of the resulting documents if you’re interested in those kinds of words, but for the remainder of this post, I’ll try to break down some of the why’s and what’s that went into each of these steps.
Literature review and existing implementations
The first step in any worthy undertaking (within which I would categorize the implementation of NDT) is to see what other people have done. Humans, after all, are social creatures, and in all of our achievements, we stand on the shoulders of giants.
Allusions aside, there are two important directions to look at when reviewing “prior art”: academic literature, and functional implementations.
It’s always a good idea to look at what starving graduate students have worked on in the height of their hunger. Best case, you find out there’s a wholly superior algorithm that you can implement instead. Worst case, you understand the space of, and variations in solutions (which can help inform architecture), and you can be made aware of some of the theoretical underpinnings of the algorithm (and thus which invariants you need to protect).
On the flip side, it’s similarly good to look at what other people have done--after all, it’s always easiest to start solving something with a good initial guess. Not only do you get to gratuitously borrow good architectural ideas, but you might also discover some gotchas and dirty tricks you might need to use in order to get the algorithm to work in practice (which you might even be able to holistically integrate into your architecture).
From our literature review of NDT, we gleaned the following useful bits of information:
The NDT family of algorithms has a couple of variations: · P2D · D2D · Constrained · Semantic
There are a bunch of dirty tricks you can use to make the algorithm work better
NDT is generally compared to ICP
NDT is a bit faster and a bit more robust
NDT works robustly (high rate of success) within a specified local neighborhood
Nothing earth-shattering, but useful points of information to file away for later use, in both design and in implementation.
Similarly, from our review of an existing implementation, not only did we see concretely what the steps are, but we also saw some interesting initialization strategies.
Use cases, requirements, and mechanisms
An integral part of any “design-or-plan-first” development process is to look at the problem you’re trying to solve at a high level. Broadly, from a functional safety perspective (which I am admittedly far from an expert on), “looking at the high-level problem” is organized something like the following process:
What are the use cases you’re trying to solve?
What are the requirements for (or constraints on) a solution in order to satisfy the above-stated use cases?
What mechanisms satisfy the above requirements?
What the above process does is provide a disciplined way of looking at the problem from a high level, and progressively getting more granular.
To get a picture of what this might look like, you can take a look at the high-level localization design document that we put together in preparation for our development of NDT. If you’re not in the mood for some bedtime reading, then read on below.
I like to think about use cases in three different ways (caveat emptor, I am not a functional safety guy):
What is the component supposed to do in general? (remember SOTIF!)
What are the different ways I might get inputs into the component (input or upstream use cases, as I like to term it)?
What are the different ways I might use the output (output or downstream use cases)?
Bonus question: What kind of whole system architectures might this component reside in?
Taking these all together, we came up with the following:
Most algorithms (can) use localization, but ultimately they can be divided into algorithms which operate locally or globally
Local algorithms need continuity in their history of transforms
You can use nearly any sensor as a localization source
We need a way to initialize, and we need a way to handle outages in our different localization modalities
On top of the different types of use cases you can think of, I also like to think about some general use cases with very strict requirements. For this, I have the use case (or problem) of a fully-autonomous cross-country road trip going through multiple tunnels and traveling in a caravan. Baked into this use case is a couple of nasty things for consideration, such as accumulating odometry error, floating point error, localization corrections, and service outages.
The purpose of use cases, beyond providing a top-level view of whatever problem you’re trying to solve, is to inform the definition of requirements. In order for a use case to happen (or be satisfied), there are probably some things that need to happen or need to be possible. In other words, each use case has some set of requirements.
At the end of the day, the general requirements for a localization system aren’t too bad:
Provide transforms for local algorithms
Provide transforms for global algorithms
Provide a mechanism for relative localization algorithms to initialize
Ensure transforms don’t grow too big
Comply with REP105
Skilled functional safety personnel can likely find many, many more requirements. The value of this exercise is that we explicitly spell out some requirements for (or constraints on) our design, aka the mechanisms which will satisfy our requirements.
The final result of any sort of analysis should be an actionable set of lessons or items. If there’s nothing we can use as a result of an analysis (even a negative result!), then an analysis was a lot of hot air for no particular purpose.
For a high-level design doc, this is a set of mechanisms or a design that encapsulates these mechanisms, that can appropriately satisfy our use cases.
This particular localization high-level design resulted in a set of software components, interfaces, and behaviors that comprise a localization architecture. A simple block diagram of the proposed architecture is below.
If you’re interested in more details of the architecture or design, I’d highly encourage you to read the full document.
Since we’re in the business of trying to build components in safety-critical systems, failure is something we should try to avoid, or at the very least mitigate. So before we try to design or build anything in earnest, we should at the very least be aware of how things might break.
It’s helpful in a failure analysis, as in most things, to look at the component from a couple of different perspectives. For the failure analysis of the NDT algorithm, we looked at it in two different ways: as a general (relative) localization mechanism, and specifically as an instance of the NDT algorithm.
Viewed as a general localization mechanism, the main failure mode is “what do you do when your inputs are wrong?” Indeed, from the perspective of an individual component, there’s not a whole lot you can do beyond some basic sanity checking. On a system level, there’s more stuff you can do (such as enabling security features).
When looking at NDT as an individual algorithm, it’s helpful to abstract the algorithm details an appropriate amount. Using a pseudocode version of the algorithm is helpful here (in addition to helping you, the developer, to understand the algorithm). Here, we stepped through the algorithm and considered all the ways it might break.
Saying an implementation might be wrong is a perfectly valid failure mode, though one that is (more or less) easily mitigated through appropriate testing. What came up a little more frequently and insidiously was some gotchas regarding numerical algorithms. In particular, the inversion of matrices, or more generally, the solving of linear systems is susceptible to numerical errors. That’s a neat failure mode that should probably be covered.
Two other important failure modes we also identified were making sure certain expressions aren’t unbounded in magnitude (washing out floating point precision), and making sure the magnitude or size of the inputs are controlled.
In all, we came up with 15 recommendations. I’d encourage you to check them out.
As an aside, though we didn’t use it here, a fault tree analysis is an excellent tool for structuring and quantifying the problem of failure analysis.
“What is measured gets managed”
Popular manager catchphrase
When building something professionally, it’s unfortunately not enough to shrug your shoulders when you’re sick of working on something and call it done. Fundamentally, any package of work (which again, building NDT falls into) needs acceptance criteria that both the customer and the vendor must agree on (if you are both customer and vendor, skip this). The whole legal profession exists to support this, but as engineers, we can just cut out the middlemen by defining metrics for the performance of our components. After all, numbers are (mostly) unambiguous and inarguable.
Even if acceptance criteria are unnecessary or irrelevant, a well-defined set of metrics are still nice to have as a way to characterize and improve the quality and/or performance of a project. After all, what is measured gets managed.
For our NDT implementation, we’ve broken the metrics into four broad groups:
General software quality metrics
General embedded software quality metrics
General algorithm metrics
Localization specific metrics
I won’t go into the metrics in any detail here because they’re all relatively standard. The important thing is that the metrics were defined and identified for our particular problem, and that’s about as far as we can get as designers for an open-source project. Ultimately, the bar for acceptance must be defined on a per-project basis by whoever is deploying the system.
One last thing I’ll repeat here is that while metrics are fantastic for smell tests, they are no replacement for actually inspecting an understanding an implementation, and a use case’s requirements.
Architecture and API
After laboriously trying to characterize the problem we’re trying to solve, and understanding the space of solutions, we can finally dip our toes into something verging on implementation.
Of late, I’ve been a fan of test-driven development. Like most engineers, I like to build things, and I found the notion of writing tests first to be cumbersome. By default, when I started programming professionally, I went ahead and did the standard test-after-development methodology (in spite of my professors in university telling me to do the opposite). Research seems to indicate that writing tests before implementing tend to result in fewer errors, higher test coverage, and generally higher quality code. Perhaps more importantly, I find that test-driven development helps to decompose the weighty problem of implementing an algorithm.
So what does this look like?
Instead of a monolithic ticket called “Implement NDT” (including tests), which would result in an addition of several thousand lines of code (something more or less impossible to effectively review), we can instead break up the problem into something a little bit more tenable:
Write out the classes and public methods for the algorithm (aka architecture)
Write the test cases for the algorithm using the public API (they should fail!)
Implement the algorithm logic
The first step, then, is unequivocally to write out the architecture and API for an algorithm. I’ll talk about the other steps in another post.
While there are a lot of works that tell you how to “do architecture”, that mostly tells me that designing a software architecture is a bit of a black art. Personally, the way I like to think about software architecture is drawing boundaries between concepts, and trying to characterize the degrees of freedom in a problem statement and solution method in terms of concepts.
What are the degrees of freedom in NDT then?
Our literature review tells us that there are different ways of representing the scan or observation (i.e. P2D-NDT and D2D-NDT). Similarly, our high-level design document tells us that we have multiple ways of representing the map (i.e. static, or dynamic), so that too is a degree of freedom. More recent literature also tells us that the optimization problem can be modified. Even then, we see from comparing a practical implementation and the literature that even the details of the optimization solver can vary.
And the list goes on and on.
For our first pass design, we settled on the following concepts:
Algorithm and node interfaces
With some subdivision within.
The ultimate hope for an architecture is that it is extensible and maintainable. Whether our proposed architecture fulfills this hope is something only time will tell.
It needs repeating that what was represented here was only the first pass through the design phase. It has been famously said that no battle plan survives an encounter with the enemy, and the same is very much true with software design. The ultimate failure of the waterfall model came from the assumption that the specification and design were implemented perfectly. Needless to say, no specification or design is perfect, and as implementation and testing occur, flaws will be discovered, and amendments will have to be made to the designs and documents outlined here.
And that’s fine. We as engineers aren’t our work, and all we can try to do is iterate our way towards more perfect systems. With all the words produced for the development of NDT, I think we’ve made a good first step.