Adding clang thread safety analysis for ROS2 core packages

Hi all :wave:

There’s been some discussion about adding clang’s compile time thread safety analysis for ROS2 core packages, to help us find mistakes automatically and early on. I’ve talked to @tfoote and @Thomas_Moulard about this a bit offline, we wanted to move it into a public forum before moving further.

Annotation Macros

The thread safety features are enabled by some annotations that are best used by macros that can be turned off when using non-clang compilers (see mutex.h in the clang doc page), as a general utility. The first question is, where should these be defined?

  • One thought was to add it to rclcpp, but since we definitely want to use it in the rmw layer, that would be too high
  • Another candidate is rcutils, but these macros are C++ specific (extending functionality of std::mutex, and that packages is meant to be C-specific

Given that, I personally think creating a new rcpputils package makes the most sense. It hasn’t been needed yet but could make sense now given we have something that would fit there.

Usage

In terms of using these annotations, we have a Ci.ros2.org clang nightly job now available that could raise thread safety warnings for annotated code, at least in the nightly build.

For usage and roll-out, the rmw packages are prime candidates for first packages to start using the annotations, since they are highly threaded code. The thinking on guidelines

  • private mutexes
  • use GUARDED_BY on mutex-protected members to surface the rest of the usage needs
  • copy small things to avoid blocking

I’ve put together a proof of concept in rmw_fastrtps to illustrate

Feedback?

Before changing anything, want to hear thoughts on 1) where the macros should live, whether we should create that new package, and 2) usage patterns as put together in that WIP commit

4 Likes

How does the analysis handle threads that are not annotated? For example, threads inside the DDS implementation.

I noticed in your proof of concept that there are places where locks have been added and places where they have been removed. Is this for the analysis or because the analysis found errors?

Anything not annotated doesn’t get analyzed. The llvm libcxx implementation has the Capability annotations on its std::mutex implementation, but if you don’t mark any data as GUARDED_BY the mutex, the analysis has nothing to operate on

I kept all existing locking logic intact but shifted ownership of the locking around - The main difference is that in favor of a private mutex ownership patterm, I removed the “lock-wrapper” LockedObject pattern, because it spread out locking and safety in such a way that it would be easy to make more mistakes, spreads ownership of safety to any user of the data, instead of centralizing in the thread-accessed data. That may not be the way we want to go, it’s just what seems to make the most sense to me

Yep, that makes sense. What I’m wondering is does this have any impact on the analysis of threads that are annotated? If an annotated thread interacts with a non-annotated one, how trustworthy are the analysis results?

I haven’t looked at the details, but what you say makes sense from a maintainability point of view.

Good question - the way I understand it the annotations are more about data than the threads themselves - so if you annotate data as being guarded, the analysis will spit out warnings for other code/packages only if it is built with -Wthread-safety, and accesses that data unsafely (according to the the analysis rules). So we could keep “non-ported” code warning-free by not enabling that flag on it until it’s been looked at specifically.

OK, that makes sense to me. Thanks for the clarifications!

Proposal for phasing/staging moving forward on this:

  1. Create rcpputils package with annotation macros
  2. Enable -Wthread-safety on rmw_fastrtps and migrate to the private-mutex pattern for TopicCache
  3. Enable -Wthread-safety-negative on rmw_fastrtps and add negative capability annotations where needed
  4. Perform 2&3 on the other rmw implementations

@emersonknapp Thanks for your complete and well thought out proposal. Hopefully this can be a good example for others going forward.

Overall we have general support for a utility package with a scope like rcpputils in parallel with rcutils here. There are several utility functions in rclcpp that could be refactored out into this package too in the future.

There was some discussion about whether there might be another more generic package scope. But there wasn’t any strong alternative proposals so lets go ahead with this as proposed.

@tfoote what’s the process for creating the new repo? I have a take at a an empty package on my account, with a PR against it adding the macros

But I realized I can’t transfer ownership to ros2 because I don’t have permissions to create repos in the org. Advice?

Moving into an organization is a bit of a challenge. The most common thing to do is to transfer it to one of the individuals who has access and then they transfer it into the organization. Luckily github is relatively good about setting up redirects so most people won’t notice the indirect path.

If you can transfer it to me, I can then transfer it to the organization.

Great! Done, thanks!

Another thought, it seems a little unwieldy to add this block to the CMakeLists.txt for every package that wants to use this analysis

# Enable thread safety annotations and analysis when using clang
if(CMAKE_CXX_COMPILER_ID MATCHES "Clang")
  set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -stdlib=libc++ -std=c++14")
  add_definitions(-D_LIBCPP_ENABLE_THREAD_SAFETY_ANNOTATIONS)
  add_compile_options(-Wthread-safety)
endif()

Would there be a better way to either

  1. Put that block in for all ament packages, so it’s enabled globally, but only affects code that is annotated using the C++ macros
  2. Add a cmake macro/function to enable it as a one-liner, like ament_enable_thread_safety_analysis?

@Thomas_Moulard maybe you have thoughts on this?

We have two ways IMHO:

  1. Keep TSA optional and write a colcon mixin to enable this behavior.
  2. Inject the statements you quoted everywhere using the *-extras.cmake mechanism.

My feeling is that, keeping this opt-in per-package will be challenging because mixing the clang and GNU standard libraries is probably unsafe. So it would need to be a global on/off flag. 2. is most likely too brutal of a change so 1. could be the best option.

I’ve hit a wall with enabling this - given the following constraint

  • we do not want to make our own Mutex wrapper, in favor of using std::mutex

High level problem -

  1. To get the capability annotations on std::mutex, I need to use libc++ (LLVM)
  2. On Linux, by default, all libraries use libstd++ (GNU)
  3. The build, using libc++ on an annotated package, can succeed just fine, BUT
  4. Any std objects passed across .so boundaries into this library now have runtime undefined behavior, because you call libc++ functions on libstdc++-created objects

Specific example case that I ran into -

  • I add thread safety annotation to rmw_fastrtps_shared_cpp
    • link it to libc++ to have std::mutex be analyzable
  • upstream, an rclcpp test creates a Node, which eventually resolves to fastrtps_shared_cpp::rmw_node.cpp::__rmw_create_node, which:
    • creates a ParticipantAttributes from fastrtps (which has not been modified, so it uses libstdc++ std::vector creation)
    • calls vector::resize on the member vector (which jumps into the libc++ implementation)
    • program hangs
    • slight variations on operations can cause a crash

Workarounds I’ve tried:

  • colcon build --cmake-args -DCMAKE_CXX_FLAGS=-stdlib=libc++ (mixin approach)
    • FAILS (on Linux) because poco_vendor is not built in the ROS2 workspace, and we get stdlib linker errors against it
  • rmw_shared_cpp-extras.cmake with the -stdlib=libc++ block
    • FAILS for same reason

In conclusion, here are a list of options I have come up with to move forward (I am also very open to other suggestions)

  1. When we run this analysis, libc++ linking flag that is turned on by the build caller, not specified at the package level
    • Static analysis runs on compilation, giving useful warning messages
    • We cannot expect linking or running code to work
    • So, somehow disable linking?
  2. Build all code from source, allowing forcing the standard library for all code
    • looking at libpoco specificially here
  3. Run this analysis on Mac only, where all the code will be linked against LLVM libc++
  4. Remove the initially stated constraint and provide our own annotated mutex wrapper, removing the need for libc++
  5. Modify Fast-RTPS interfaces to remove the need to interact with std:: containers directly
    • This may not be the only place where std:: objects are used across API boundaries, but it’s the place I identified in my first annotations

@tfoote thoughts?

I think this analysis is valuable, but it’s definitely got its technical hangups here

@Thomas_Moulard

Should the analysis always be run for best results, or can we limit it to a CI run? If the latter, then perhaps we can set up a separate CI run that builds everything from source using libc++ and run that once a day or something, rather than making it part of every CI build.

That sounds like the commonly used approach for this.

In general the same applies when mixing libraries built with different C++ compilers, e.g. g++ vs. clang. While you might be lucky this is not something guaranteed.

1 Like

I think that we gain value by running at all. Though it would be nice to run for every build, it seems prohibitively expensive given the current situation - we’re only running a Clang build nightly today as it is.

Based on @dirk-thomas input, my thinking now is to get all code building from source in the nightly Clang build and force that build to use libc++

As we talked offline, I would do as follow:

  1. Add and merge PRs adding the TSA colcon mixin, the rcpputils code for the macros and the class annotation. We can do that now and this is a no-op if the mixing is not being used.
  2. Try to find a way for the whole build + test run to succeed:
  • Only target Mac OS X for now (if clang stdc++ is the default one on OS X? otherwise I guess brew can pass custom flags?)
  • Try to build everything from source on Linux and check that it works. If yes, we can figure a way to build everything from source in the CI.