Building Safe Algorithms in the Open, Part 3 - Integration
Updated: Feb 19, 2021
All good things must come to an end. Not this series of blog posts. Though it may be ending, it can’t strictly be considered to be a good thing.
We left off before having dug the trenches of code, so to speak, and we were left with something that was minimally functional--at least as far as the compiler and some tests are concerned. Now comes the hard part--making it work for real.
I’ll preface all further discussion in this post with the disclaimer: there’s no silver bullet for getting your code integrated (if there were, I would be consulting and much wealthier) — instead, all I have are recommendations that can hopefully make the process less painful.
And it can be painful. In my limited experience, the first time an engineer tries to integrate a big piece of code with other big pieces of code, it ends up being a long and painful process of banging your head against the wall.
But out from the process comes something that works, which is cool. But perhaps even more importantly, you improve and become a little better having done it the first and the worst time. (I would argue writing an academic paper is very similar in this regard.)
Integration by another name
At the end of the day, integrating a chunk of code into other code is just plugging things in, figuring out why it doesn’t work, and fixing those issues. In other words, it’s a big and nasty troubleshooting problem.
One of my colleagues recently sent around a nice article on solving hard problems — I encourage you to look at it.
In my own words, I would suggest the following:
Understand the event chain in your process or algorithm
Develop tests to determine where in the event chain your problem is
Fix the problem
Fundamentally, computers are pretty much sequential. When you start squeezing in multithreading and get to the low level of assembly instructions, things can get kind of murky and have some weird orderings and asynchronous madness.
Still, at a macro level, algorithms are pretty much sequential. (That is if you followed our earlier advice)
Because our problem has this linear structure, we can then quickly nail down where the problem is (remember our dear friend binary search), be it at the input level, output level, or somewhere in between.
Testing whether something is right or wrong along the event chain can be as simple as spamming print statements and inspecting the results, to as complex as having a dedicated test case to ensure consistent behavior with respect to whatever functional specification you have.
While in an ideal world, we would have a perfect specification, and perfect, beautiful test cases to ensure compliance with the spec, in the real world, this is not always the case. Specifications can be wrong or ambiguous, APIs might rightly or wrongly not lend themselves well to testing, and there’s always the pressing issue of expediency. Things need to be out of the door and bills need to get paid.
So do whatever works to test if something is wrong or right, but my preference goes towards whatever is easiest and quickest to readily reproduce and test.
Once you can reproduce the issue, then fixing it usually isn’t too bad. It’s the same logic as before: break down the event chain to find the offending section, and stare hard until you understand the issue. If staring fails, you can always play the part of a machine learning algorithm and start throwing in pairs of inputs with expected results until you build an intuition for the chunk of code and where it’s going wrong.
And then the problem is solved.
Except, it isn’t. Unless we’re lucky (or good at coding!). Typically there are more issues that the first issue was hiding, and you iterate until you can find no more issues, or at the very least, your code is working how it needs to within the given ODD.
And that’s integration in a nutshell. It’s nothing more than troubleshooting by another name. It sounds long and arduous and painful, and the process certainly can be. That said, there are some things you can do that might lessen the pain, and I’ve also got some hints that might help in case you get stuck.
Tips to make integration less painful
Surprise! We’re back to testing.
Generally speaking, an integration test is when you have multiple components working in concert. Here, I like to define a component as being one ROS Node.
In an integration test, you want to reasonably emulate the operation of the component or components in a wider system. For me, this means spoofing some reasonable input (e.g. a sequence of related trajectories and vehicle states to test a controller) and expecting some reasonable output on the rear side. What you get at the cost of complexity and loss of control is a much more powerful test that can more fully exercise your code and expose issues in an isolated manner.
Many tools exist to do integration testing. In the ROS domain, there’s the fantastic launch_testing framework (developed by our friend Pete who is an absolute wizard). If that’s not your flavor, you can also program up integration tests in the standard GTest framework.
As an aside, I was responsible for the first test at Apex.AI. It was a horrible one-line bash command which ran an executable and made sure it didn’t crash.
And that’s fine (though there are better ways to do it). Smoke testing is a perfectly valid form of testing, and any testing you do generally saves you pain later as things change.
Back to State
Some readers might recall in the previous blog post that I harped on a lot about the state of the objects, and how understanding and protecting that state is (in my opinion) the key to writing good classes.
Well, we’re back to that again.
Subtle bugs can pop up when the state or invariants of an object are not adequately protected. The main implications here being that you should:
Know what the invariants are
Make sure you keep the invariants satisfied
Minimize the state of the object
The first two are fairly straightforward (and can be enforced by contract mechanisms), and the last is an application of KISS. If there’s less state, then there’s less that can go wrong.
Concretely, a form of testing that can help identify if your issues are related to the state is back-to-back testing. Interestingly enough, this is a form of testing that is also recommended by ISO26262, so that’s another good reason to use it.
As a live example, one of the major bugs in the mpc implementation was related to state and would have been caught by a back-to-back test. All the single-shot tests in the world only really check that your algorithms works once, so adding a back-to-back test is a good way to make sure you cover the i+1 cases.
As a corollary to ensuring the integrity of your object’s state between calls to methods, it’s also important to keep a keen eye on the inputs to your algorithm.
From the perspective of your algorithm, you typically don’t control the inputs, and as such, there are many things that can go wrong there. The responsibility ultimately falls on the algorithm to ensure the inputs meet whatever requirements your algorithm has.
Generally speaking, it’s good to do this check before doing any computation, and before you modify any state of your object (recall the fastest code is the code that never runs, and also recall the notion of strong exception guarantees).
A concrete example of this is in the trajectory input for MPC, where two big things are checked: the heading, and sample periods.
While the complex representation of heading makes it easy to avoid the singularities in the SO(2) space, the representation is only valid when the norm of the two numbers is one. For some controllers, the heading doesn’t even matter, but for others, such as MPC, it’s quite important. To be on the safe side, all of the heading values are checked, and if necessary, normalized.
By a similar tack, not all controllers strictly need a timestamp for a reference trajectory. Some only need a set of time-invariant points in space representing a path. Not so with MPC. Here, we check that each of the time points associated with each trajectory point is reasonably close to the time discretization MPC wants, and if not, we try to repair it via interpolation.
Literature and Prior Art
If you’ve been following along with this series, then you might remember that I’ve previously advocated for writing some lengthy design documentation. Well, guess what? It’s back!
The main star here in the context of integration is the literature review and prior art document.
If you’re all out of ideas on how to get something working, it’s always a good idea to compare your nonfunctional implementation to a functional implementation.
As an example, the NDT implementation in Autoware.Auto had significant trouble with a center corridor in the map we were using for testing. A comparison of features showed that (in the interest of expediency) we were missing a couple of improvements from the original Autoware.AI version, namely:
Ensuring covariance matrices are well-conditioned
More-Theunte line search
So my dear (frustrated) colleague Yunus went ahead and tried each of those out to see which improved performance. To cut a long story short, the important features needed were the covariance conditioning and More-Theunte line search.
Similarly, the MPC controller had stability and jitter issues. Again, cutting a long story short, the key missing feature was using integral action in the dynamics model, which is a well-known tip for improving performance in MPC controllers.
To sum up this section, we’ve got the following recommendations:
Compare with working implementations
Consult literature: I like the search terms “practical <algorithm name>”, “<algorithm name> in practice”, and “applied <algorithm name>”
Tighten the Validation Loop
Now comes the sad part of the story. I’m unfortunately out of concrete and helpful tips that might push you forward in the process of integration. Instead, all I have left are tips to ease the pain.
The first of these tips (which goes along well with most modern software development practices) is to tighten the loop for your fix-check-repeat process. This includes:
One-button building process (or command, for those of you who prefer the shell)
While one-button building is relatively par for the course in most modern projects, for esoteric problems that show up in integration, one-button testing isn’t quite so easy. For this, ROS 2’s new launch system is very useful. It can execute arbitrary commands, and what’s more, they can be sequenced. This goes a long way towards automating a test case.
Do Something Else
Lastly, I can only offer the advice of packing up your toys and going home.
Do something else. Best case, the cogs of your mind will spin on in the background and you’ll eventually get some nice epiphany, and in the worst case, you’ll at least save yourself some immediate headache and give yourself time to relax.
What Comes Next
So that was NDT and MPC, and our first swing down and up the V cycle of development. We started with an idea, bulked the idea up with a plan, built the thing according to the plan, and made sure it all worked. Now, at the end of the day, we’re left with two algorithms that minimally do what they say they do--localize against some reference map and follow some reference trajectory.
So what comes next?
In general, there are two opposing directions we can go: to first resolve technical debt and improve reliability, or to add features (i.e. the availability/reliability tradeoff).
Technical debt is a fact of life, especially as we push hard to build a minimally viable product. And there’s always plenty of technical debt to resolve. In the context of NDT, we can improve some architectural considerations, add more integration tests, and improve reliability and robustness. My slightly famous colleague, Igor, has already started working on improving the architecture. For MPC, we can similarly add more integration, gracefully handle edge cases and improve the reliability and stability of the solver.
In my opinion, technical debt is one of those things that will bite you in the butt if left alone, and resolving it will probably make development faster in the long term (provided you keep the new features and use cases in mind). It’s a little bit like the marshmallow problem, but I’m not sure if it’s of any relevance as a litmus for the success of an open-source project.
The allure of new features, by contrast, is always present. New features are always cooler and sexier and provide the promise of more. But new features typically come with new technical debt, and then we’re back to the same question. Just as with technical debt, there’s never a shortage of cool and new features to add. For NDT and MPC both, we can play with the solver and the optimization problem. We could play around with D2D NDT or any of the many flavors of NDT to further improve the runtime and robustness of our localization algorithm.
Ultimately, though, the direction a project goes is dictated by the people keeping the gates, and the people paying the bills. I am neither of those people, so my say doesn’t matter too much.
But if nothing else, I would choose to resolve technical debt because, at the end of the day, a job well done, and a product well built is pretty satisfying in itself, and I think with the algorithms we’ve built in the open, we’ve done a pretty good job.