Introducing ROS2 Sanitizer Report and Analysis

Introducing ROS2 Sanitizer Report and Analysis

AWS Robotics has been working on ROS2 code quality and infrastructure improvements since January 2019. Our project specifically focused on issues reported by the AddressSanitizer (ASan) and ThreadSanitizer (TSan) C/C++ runtime analysis tools. Issues the sanitizers surfaced include:

  • lock order inversion (potential deadlocks)
  • data races
  • heap-use-after-free (accessing heap memory after it is freed)
  • memory leaks
  • signal handler spoils errno (a signal handler overwrites errno)

The scope of our project was to analyze as much of the ROS2 core code base as possible (everything included when building with —packages-up-to system_test), fix issues, develop a process so the community can also fix issues, and integrate sanitizers into nightly CI. These topics, as well as what we have done, are outlined in this document.

ROS2 Core Code Report

Sanitizers capture very detailed information during runtime about code quality issues and print them to stderr. Many of the surfaced issues are duplicates as the code path that has them can be encountered multiple times. Just by looking at stderr output, it’s difficult to see how many issues there are and which ones are duplicates. To make it easy to see, we implemented a colcon plugin that parses the output of sanitizer issues as they are printed to stderr, deduplicates them, and writes them in a readable CSV. If the same issue is printed 500 times to stderr during colcon test, it shows up as a single CSV line with a count of 500. The line also includes the relevant information needed for debugging the issue.

When we started the project, we created two reports — one for ASan and one for TSan — using the colcon tooling we created. The general workflow for generating the reports is as follows.

First, ROS2 code needs to be compiled with ASan or TSan enabled. We created colcon mixins to make it easier to do that.

Then, run the tests with the sanitizer_report event handler enabled. This event handler parses all printed sanitizer issues, consolidates duplicates, and writes them to a CSV file.

See our tutorial for a complete walk-through of building and testing with ASan or TSan enabled.

We limited our testing to a subset of ROS2 core packages including those in rcl, rclcpp, rmw (Fast-RTPS only), rosidl (Fast-RTPS only), and system_tests repositories, which contain 83,760 source lines of code. While we started with only Fast-RTPS, we will include all DDS implementations in the future as we need to surface sanitizer issues in their associated rmw implementations.

The raw ASan report created on 2019-05-01 showed that the sanitizer raised 1,128 issues. Our deduplication logic reduced them to 37 unique (root) issues.

  • 34 memory leaks
  • 3 heap-user-after-free

The raw TSan report created on 2019-05-01 showed that the sanitizer raised 7,656 issues. Our deduplication logic reduced them to 61 unique (root) issues.

  • 47 data races
  • 10 potential deadlocks
  • 2 heap-use-after-free
  • 2 signal handler spoils errno

TSan report includes errors that originate from ROS2 core packages and from DDS libraries.

Issues We’ve Fixed

To date, we’ve opened 21 pull requests in ROS2 fixing sanitizer issues in low-level ROS2 Core packages. Our strategy is to fix all sanitizer issues in ROS2 Core packages starting with the lowest-level dependencies and working up the ROS2 stack. While fixing issues, we created a tutorial which can be used by the community to discover and fix sanitizer issues. Below is a breakdown of our pull requests.

With the above fixes, tests in rcutils and rcpputils run without raising any sanitizer issues. The total number of reported ASan issues dropped from 1128 to 117 (an 89% decrease) and the number of deduplicated root issues dropped from 37 to 19.

Most of the above fixes are in test code, as sanitizers are runtime analysis tools with significant overhead and are only practical to use during tests. They capture issues from any code that is exercised (including test code) and we can’t initially tell if an issue is in source or test code. Though we need to resolve all sanitizer issues for CI to be green, issues in source code are more concerning for production scenarios.

We reviewed our results with eProsima and they already submitted fixes for the following TSan issues in Fast-RTPS.

The New Sanitizer Nightly CI Jobs

One important aspect of this project was to integrate these sanitizers into the nightly ROS2 CI jobs. Once we have them integrated and all issues resolved (jobs are green), we can begin to block the build if any new issue is detected in these packages (new regressions). You can see the sanitizer nightly jobs here

Initially, we focused on fixing sanitizer issues in the rcpputils and rcutils packages as they’re at the lower level in ROS2 dependencies and we knew we could address all the issues in these packages within the project timeline. As a result, both jobs are green (they run with zero sanitizer issues). Going forward, we want to work our way up the ROS2 stack, adding packages to these jobs while keeping them green.

Next Steps

We created new tools to make it easy to use ASan and TSan with ROS2, used those tools to identify issues in the ROS2 core code base, fixed many of those issues, created ASan and TSan nightly CI jobs, and got a few of the base ROS2 packages to green in those jobs. We will continue to improve our sanitizer tools and explore other means of ensuring good quality code.

We feel we’re in a state where we can also solicit input and involvement from the ROS2 community. We encourage the community to use our tutorial to learn how to use the new ASan and TSan tools to improve code quality of ROS2 and any project built on top of it.

25 Likes

I’d be interested to see this work linked to a steppable simulator. Then, even with the overhead, you’d be able to test application code.

3 Likes

Thanks for this contribution @cevans87, @Thomas_Moulard and everyone in the AWS Robotics team!

After putting together a few nights and weekends of free time, I managed to reproduce your work and got some additional understanding on several of the sanitizers for dynamic bug finding. I’ve documented my attempts for ASan while applying it with the Dashing release in here (there’s an attempt to run it in OS X but I eventually gave up since I had very limited time, the Dockerfile should help reproduce the setup I’ve used).

Beyond applying this to the ROS 2 codebase, I thought this could have a decent impact in the MoveIt 2 development and wanted to see whether I could gain some quick insights including this together with the MoveIt 2 alpha release (which already includes some tests we’ve ported). My second attempt is described in here. By solely analyzing moveit_core, I was able to find several memory leaks so it seems we definitely want to include this in our development process. Not sure how much bandwidth myself or my colleagues will have in the short term but ideally, we should aim to get this into the MoveIt 2 CI (raised an issue for it).

This would definitely be of my interest as well (at least sounds like a fun project to try during a few nights). Particularly in combination with Gazebo, such toolset could become a rather powerful helper to diagnose vulnerabilities while simulating the actual robot behaviors intended to happen in the real world.

I’ve given it a thought but I’d like to ask for advice. I can certainly benefit from additional input. Do you have any intuition @gbiggs (or anyone else) on how you’d start with this in Gazebo? Do you foresee any modification required in Gazebo or rather, simply build the application code with the right set of flags to enable sanitizers’ reports (as demonstrated already)?

1 Like

Thanks for trying this out @vmayoral! This would be a great addition to MoveIt 2 IMHO, motion planning can be memory intensive so fixing leaks could lead to noticeable improvements.

Regarding simulation, a “normal” Gazebo could communicate with ASan/TSan enabled nodes. The only issue I expect would be that any code relying on timeout values will be fragile as the sanitizers induce a performance penalty. I saw a few ROS 2 tests relying on timeouts failing when ASan or TSan is enabled.

1 Like

I think it would work best with code that is designed to run in real-time with known execution times per step. That could possibly allow you to adjust the execution rate to account for the overhead of running the analysers. You would probably need to profile how much overhead the analysers add to your code first.

I don’t have any ideas off-hand for how to handle timeouts.

1 Like

Sorry for replying old topic.

I’m trying to fix remaining sanitizer issues.
I first tied rcl and found almost of them are about test code but a few are about rcl code.

I sent PR about these(I related them with [rcl isses 469 “Memory leak in test_subscription__rmw_fastrtps_cpp”](Memory leak in test_subscription__rmw_fastrtps_cpp)).

Also I found rcl line-coverage by UT is 75.2%.
Some functions are not tested in graph.c, init_options.c, node_options.c, publisher.c, subscription.c, timer.c.
So I plan to write more tests and send PR.

The same can be said about rmw.
Rmw passes ASan test, but test coverage is 58.6%(lines), 43.3%(functions).
I plan to address them too.

If there is a goal for test coverage or quality, please tell me.

Thank you.

Thanks a lot for your work on this @y-okumura-isp!

Did you take a look at the (future) REP-2004 ?

For ROS core packages, there is an ambitious goal set to reach 95% line code coverage.

On another note, we setup an ASAN nightly job there: https://ci.ros2.org/view/nightly/job/nightly_linux_address_sanitizer/

Currently, it only builds until rcpputils as only those few packages are 100% clean.
If merging your PRs could allow us to get more of ROS 2 core packages green, please consider updating the job to start building more code with ASAN every night by replacing rcpputils in the following code:

Thanks again for your help with this!

@Thomas_Moulard

Thank you for reply.

Did you take a look at the (future) REP-2004

OK, I’ll read it!