Introducing ROS2 Sanitizer Report and Analysis
AWS Robotics has been working on ROS2 code quality and infrastructure improvements since January 2019. Our project specifically focused on issues reported by the AddressSanitizer (ASan) and ThreadSanitizer (TSan) C/C++ runtime analysis tools. Issues the sanitizers surfaced include:
- lock order inversion (potential deadlocks)
- data races
- heap-use-after-free (accessing heap memory after it is freed)
- memory leaks
- signal handler spoils errno (a signal handler overwrites errno)
The scope of our project was to analyze as much of the ROS2 core code base as possible (everything included when building with —packages-up-to system_test), fix issues, develop a process so the community can also fix issues, and integrate sanitizers into nightly CI. These topics, as well as what we have done, are outlined in this document.
ROS2 Core Code Report
Sanitizers capture very detailed information during runtime about code quality issues and print them to stderr. Many of the surfaced issues are duplicates as the code path that has them can be encountered multiple times. Just by looking at stderr output, it’s difficult to see how many issues there are and which ones are duplicates. To make it easy to see, we implemented a colcon plugin that parses the output of sanitizer issues as they are printed to stderr, deduplicates them, and writes them in a readable CSV. If the same issue is printed 500 times to stderr during colcon test, it shows up as a single CSV line with a count of 500. The line also includes the relevant information needed for debugging the issue.
When we started the project, we created two reports — one for ASan and one for TSan — using the colcon tooling we created. The general workflow for generating the reports is as follows.
First, ROS2 code needs to be compiled with ASan or TSan enabled. We created colcon mixins to make it easier to do that.
- colcon build with flag for ASan mixin:
colcon build --mixin asan-gcc
- colcon build with flag for TSan mixin:
colcon build --mixin tsan
Then, run the tests with the sanitizer_report event handler enabled. This event handler parses all printed sanitizer issues, consolidates duplicates, and writes them to a CSV file.
-
colcon test --event-handlers sanitizer_report+
See our tutorial for a complete walk-through of building and testing with ASan or TSan enabled.
We limited our testing to a subset of ROS2 core packages including those in rcl, rclcpp, rmw (Fast-RTPS only), rosidl (Fast-RTPS only), and system_tests repositories, which contain 83,760 source lines of code. While we started with only Fast-RTPS, we will include all DDS implementations in the future as we need to surface sanitizer issues in their associated rmw implementations.
The raw ASan report created on 2019-05-01 showed that the sanitizer raised 1,128 issues. Our deduplication logic reduced them to 37 unique (root) issues.
- 34 memory leaks
- 3 heap-user-after-free
The raw TSan report created on 2019-05-01 showed that the sanitizer raised 7,656 issues. Our deduplication logic reduced them to 61 unique (root) issues.
- 47 data races
- 10 potential deadlocks
- 2 heap-use-after-free
- 2 signal handler spoils errno
TSan report includes errors that originate from ROS2 core packages and from DDS libraries.
Issues We’ve Fixed
To date, we’ve opened 21 pull requests in ROS2 fixing sanitizer issues in low-level ROS2 Core packages. Our strategy is to fix all sanitizer issues in ROS2 Core packages starting with the lowest-level dependencies and working up the ROS2 stack. While fixing issues, we created a tutorial which can be used by the community to discover and fix sanitizer issues. Below is a breakdown of our pull requests.
- 4 fixes in source code
- heap-use-after-free in rclcpp - Fix heap-use-after-free and memory leaks reported from test_node.cpp
- memory leaks in rcl
- 17 fixes in test code
- memory leaks
- rosidl - Fix leak in test_interfaces.c
- rcpputils - Fix leak in test_basic.cpp
- rcutils
- rcl
- system-tests - Fix memory leaks in test_communication tests
- memory leaks
With the above fixes, tests in rcutils and rcpputils run without raising any sanitizer issues. The total number of reported ASan issues dropped from 1128 to 117 (an 89% decrease) and the number of deduplicated root issues dropped from 37 to 19.
Most of the above fixes are in test code, as sanitizers are runtime analysis tools with significant overhead and are only practical to use during tests. They capture issues from any code that is exercised (including test code) and we can’t initially tell if an issue is in source or test code. Though we need to resolve all sanitizer issues for CI to be green, issues in source code are more concerning for production scenarios.
We reviewed our results with eProsima and they already submitted fixes for the following TSan issues in Fast-RTPS.
- data race
The New Sanitizer Nightly CI Jobs
One important aspect of this project was to integrate these sanitizers into the nightly ROS2 CI jobs. Once we have them integrated and all issues resolved (jobs are green), we can begin to block the build if any new issue is detected in these packages (new regressions). You can see the sanitizer nightly jobs here
- https://ci.ros2.org/view/nightly/job/nightly_linux_address_sanitizer/
- https://ci.ros2.org/view/nightly/job/nightly_linux_thread_sanitizer/
Initially, we focused on fixing sanitizer issues in the rcpputils and rcutils packages as they’re at the lower level in ROS2 dependencies and we knew we could address all the issues in these packages within the project timeline. As a result, both jobs are green (they run with zero sanitizer issues). Going forward, we want to work our way up the ROS2 stack, adding packages to these jobs while keeping them green.
Next Steps
We created new tools to make it easy to use ASan and TSan with ROS2, used those tools to identify issues in the ROS2 core code base, fixed many of those issues, created ASan and TSan nightly CI jobs, and got a few of the base ROS2 packages to green in those jobs. We will continue to improve our sanitizer tools and explore other means of ensuring good quality code.
We feel we’re in a state where we can also solicit input and involvement from the ROS2 community. We encourage the community to use our tutorial to learn how to use the new ASan and TSan tools to improve code quality of ROS2 and any project built on top of it.