In our recent Apex.Grace 2.2.0 and Apex.Ida 1.2.0 releases, we made a number of improvements in the data recording path. We conducted a data flow analysis and found that the number of data copies drastically impacts the CPU load and final recording performance. We then made several optimizations in the core of the Apex.Ida and Apex.Grace to eliminate extra copies during data recording and data transfer. The largest advantages will be seen when publishers and subscribers are running on the same CPU without external communication over the wires where it is possible to use shared memory transport. However, we were still able to eliminate a few extra data copies even with external DDS communication.
Recording and replay play a very crucial role in developing and debugging complex robotics applications. Therefore, we are often asked by our customers:
What are the capabilities and limits of the rosbag2 recorder?
What kind of improvements has Apex.AI made to rosbag2? Is it compatible with what is available in the open-source version?
What are the limits of rosbag2?
What is the maximum throughput we can expect on our system?
What kinds of improvements can we expect from the new MCAP format? Is it worth switching from SQLite3?
We decided to do some research in this area and answer most of these questions in this blog post.
In today’s robotics applications, cameras and LIDAR sensors produce the largest data flows, generating hundreds of megabytes per second of data. In order to effectively debug complex computer vision (CV) applications, developers and system architects need the ability to record the raw data from these sensors.
As an example, we will look at a classic CV pipeline as implemented in a typical ROS 2 system. For the sake of simplicity, this pipeline will be vision based and use only cameras.
In a typical system, there is input from multiple cameras, which are all connected via some high-speed bus to the ECU. Running on the ECU are camera drivers, which translate camera images from vendor-specific formats to the common ROS 2 Image message format. This allows downstream nodes in the pipeline to work with the images in a vendor-independent manner.
In order to capture the whole raw data stream, we will use `rosbag2` to record the output of the camera driver nodes. You may wonder how much data will be recorded in this scenario. To answer this, let us look at a typical camera setup for a CV application. In order to cover a 360-degree field of view, you will need 4-8 cameras. If depth information is needed, some of these cameras may need to be in pairs, but generally, the maximum number of cameras
is 8-12.
For our example, we will consider a case with 8 cameras. Of course, the next question that comes up is that of resolution and frame rate.
CV engineers usually want images with high enough resolution to be able to read text on signs, see markers on the ground, and detect moderately small objects. In most applications, 1920x1080 RGB “FullHD” images work well and are widely supported by most cameras on the market today.
The frame rate is another important consideration since it defines the minimum time required for the system to recognize and react to an external event. As with everything, there is a trade-off here between reducing the latency and keeping the computational requirements manageable. For many applications, 25 fps, which results in an image every 40 mSec is a reasonable choice. Although in our experiments we decided to go a little bit beyond this and choose a data stream with 8 FullHD color cameras streaming at 30 fps.
What kind of throughput should we expect in a system like this? Each FullHD RGB camera image is represented as 1920 * 1080 * 3 = 6220800 bytes. If the camera is running at 30 frames per second, we will end up with 6220800 * 30 = 186624000 bytes per second. Since there are 8 cameras, the total will be 186624000 * 8 = 1 492 992 000 bytes per second or roughly 1.4 Gbyte/sec. We will run each test for 13 seconds which translates roughly to the 18 GBytes of data. Already pretty impressive right? If we record continuously, a 1 TByte disk will become full in just 12 minutes.
In order to best simulate a real-world use case, we decided to run our tests on an ordinary NMVe drive, rather than on a RAM disk as is often done in other benchmarks.
Ok, let's have some fun and see what we get!
Testing system and test methodology
All tests were run with 8 publishers, each publishing a 1920x1080 FullHD image at 30 fps. Each publisher was on a separate topic, and the publishers were isolated to 4 dedicated CPU cores. The recording was done using one instance of rosbag2 running on 2 separate, dedicated CPU cores.
We allocated 2 dedicated cores for rosbag2 since it is divided into two major layers internally via a double-buffered cache mechanism.
The first "transport" layer, or producer, is responsible for receiving incoming messages. The second "storage" layer, or consumer, is responsible for saving these messages to disk. Each layer runs independently in its own thread. When the storage thread finishes dumping its buffer, it is swapped with the buffer that the transport thread has been writing messages into in the meantime. With this architecture, they can run completely independently in parallel as long as they are given enough CPU resources.
HW and SW used during tests
● Lenovo P1 Gen4 with
○ CPU 11th Gen Intel Core I7-11800H 2.30GHz with boost up to 4.6 GHz and
enabled HyperThreading.
○ Ubuntu 22.04.2 LTS Kernel 5.14.0-1051-oem
○ RAM 32GB of DDR4 3200 MHz. Throughput approx. 27.5 GByte/s.
○ NVME SAMSUNG_MZVL21T0HCLR-00BL7 with non-encrypted volume.
The measured recording speed is around 2.7 GByte/s for
block size = 6220848 bytes.
● Apex.Grace 2.2.0 and Apex.Ida 1.2.0 before and after release. For
performance evaluation, we used the rosbag2_performance_benchmarking
package with some modifications to be able to run with the MCAP storage
plugin and to allow us to set the CPU affinity of threads, as well as measure the
average CPU load per core and per running process.
We ran multiple tests both before and after our improvements and with both the SQLite3 and MCAP backends for comparison. The test results labeled as “before improvements” will be very similar to what would be expected if run with baseline ROS 2 rolling using rmw_cyclonedds_cpp. We intentionally did not take advantage of any Apex.Grace-specific loaned message APIs on the publisher side in order to have a more fair comparison with ROS 2.
We used the fastest configuration for SQLite3 with “PRAGMA journal_mode=MEMORY; PRAGMA synchronous=OFF;”. However, it should be noted that this configuration has a high risk of data loss or file corruption if the system crashes or power is lost.
For MCAP, we also used the fastest configuration with no chunking, no compression, no CRC calculation and without indexes. This is the highest throughput MCAP configuration with the caveat that recordings will need to be reindexed later in order to be read efficiently. This MCAP format configuration is more crash resilient and can only lose messages that are in either the rosbag2 or OS kernel cache.
Please note that there are two possible scenarios for data transfer over shared memory when using iceoryx. The first is true zero-copy where data is transferred in non-serialized form. The second, less-optimal scenario is where data is first serialized before being transferred over shared memory. Serialization is required when there are existing subscribers with DDS transport among subscribers on shared memory for the same topic, or when the message type is not bounded or fixed in size.
Test results for true zero-copy with no serialization
The number of messages recorded with rosbag2 before and after improvements, and with mcap and sqlite3 backends.
As you can see, even before our improvements MCAP was able to record 92.15-68.8=23.35% more messages compared to SQlite3. After our improvements, SQlite3 became much better, but it still lost almost 8% of messages.
Average CPU usage per core (2 cores total) before and after improvements, and with mcap and sqlite3 backends
The numbers on the graph represent the average CPU load for the process where we were running rosbag2, divided by the number of cores used. In our case, all tests were run with 2 CPU cores. Overall, we were able to reduce the average CPU load by a factor of 2 for MCAP, and by a factor of 1.5 for SQLite3. Pretty impressive isn’t it?
Still, there are questions about why we are losing messages with SQlite3 even after our improvements, as well as how exactly the CPU load was spread across the two cores over time.
Let’s take a closer look and go under the hood.
The following graphs show the CPU load over time for two experiments run back to back. Each experiment was run for approximately 13 seconds, with a pause between them in order to let the recorder finish writing data to disk and to write the test summary. In these experiments, CPU cores 15 and 16 marked with red and dark blue colors were dedicated to rosbag2.
CPU load per core before improvements with MCAP backend
From this graph we can see that sometimes both cores are completely saturated. This could explain the message loss seen when using MCAP before our improvements. Not many but still around 8%.
CPU load per core after improvements with MCAP backend
This graph looks much better. On average, each CPU core has an average load below 50% with only occasional spikes up to 55-65%. This indicates that we have a good bit of headroom and should be capable of running well even on less powerful CPUs. Alternatively, we could also increase the amount of data being recorded on this system by adding even more cameras or LIDAR sensors.
CPU load per core before improvements with SQLite3 backend
Here, both CPU cores spend the vast majority of the experiment completely saturated. This likely explains the significant loss of messages when using SQLite3 before our improvements. In this case, we just can’t keep up with the speed of the data stream.
CPU load per core after improvements with SQLite3 backend
After our improvements, the situation with SQLite3 did get noticeably better but is still far from ideal. One core spends most of its time below 35% utilization, but the other core is still nearly saturated. This results in message loss even though the average CPU load dropped to 54%.
Test results for serialized data over shared memory
The improvements seen in the following experiments are very similar to improvements made for the data path when sending messages over the wires with RTPS (Real Time Publish Subscribe protocol is part of DDS protocol) since serialization is required in both cases.
In our test setup, we can’t measure improvements for cases when messages transfer over the network with such high data rates of 1.4 GBytes/s. Although, we can estimate its bases on these experiments which share a similar data path.
The number of messages recorded with rosbag2 before and after improvements, and with mcap and sqlite3 backends
As expected, the extra data copies required in this data path put a larger load on the system, and as a result, caused a larger number of messages to be lost. However, after our improvements, MCAP is performant enough to record 100% of published messages successfully.
Average CPU usage per core (2 cores total) before and after improvements, and with mcap and sqlite3 backend
The numbers on the graph represent the average CPU load for the process in which we were running rosbag2, divided by the number of cores used. In our case, all tests were run with 2 CPU cores. When using MCAP, you can see that we have increased the CPU load from 37% to 57% compared to the true zero-copy configuration above.
CPU load per core before improvements with MCAP backend
This graph looks similar to what was seen above in the true zero-copy scenario, though with the CPU cores being saturated somewhat more often.
CPU load per core after improvements with MCAP backend
Here, the data does not look as good as with true zero-copy, but it is still a noticeable improvement. Neither CPU core is ever completely saturated, and we are able to reliably record 100% of the published messages as a result. There is a small spike at the end of the second experiment, but this is likely just the final dump of cached data to disk when the experiment finishes and should not cause any data loss.
CPU load per core before improvements with SQLite3 backend
As in the true zero-copy scenario before, here the CPU is completely saturated most of the time, and will likely lose a large number of messages since it cannot keep up.
CPU load per core after improvements with SQLite3 backend
Here, as in the zero-copy scenario discussed above, there is some improvement, but one core is still saturated much of the time. Compared to true zero-copy, the other core #16 (the red line on the graphs) also has a 20% higher load which results in us recording 15% fewer messages in this scenario.
Summary
With our improvements, we were able to reliably record 100% of published messages when using the MCAP storage backend, even when not taking advantage of true zero-copy. The CPU load was reduced by a factor of 2, going down from 74% to 37% with MCAP and true zero-copy. The performance with SQLite3 was also improved, although to a lesser degree by the factor 1.5 going down from 83% to 55%, and we were unable to reliably record all messages in both scenarios when using SQLite3 storage backend.
If you are interested in Apex.AI products for your projects, contact us. We’re always looking for talented people to join our international team. Visit our careers page to view our open positions.
Comentarios