There’s been some discussion about adding clang’s compile time thread safety analysis for ROS2 core packages, to help us find mistakes automatically and early on. I’ve talked to @tfoote and @Thomas_Moulard about this a bit offline, we wanted to move it into a public forum before moving further.
Annotation Macros
The thread safety features are enabled by some annotations that are best used by macros that can be turned off when using non-clang compilers (see mutex.h in the clang doc page), as a general utility. The first question is, where should these be defined?
One thought was to add it to rclcpp, but since we definitely want to use it in the rmw layer, that would be too high
Another candidate is rcutils, but these macros are C++ specific (extending functionality of std::mutex, and that packages is meant to be C-specific
Given that, I personally think creating a new rcpputils package makes the most sense. It hasn’t been needed yet but could make sense now given we have something that would fit there.
Usage
In terms of using these annotations, we have a Ci.ros2.org clang nightly job now available that could raise thread safety warnings for annotated code, at least in the nightly build.
For usage and roll-out, the rmw packages are prime candidates for first packages to start using the annotations, since they are highly threaded code. The thinking on guidelines
private mutexes
use GUARDED_BY on mutex-protected members to surface the rest of the usage needs
copy small things to avoid blocking
I’ve put together a proof of concept in rmw_fastrtps to illustrate
Feedback?
Before changing anything, want to hear thoughts on 1) where the macros should live, whether we should create that new package, and 2) usage patterns as put together in that WIP commit
How does the analysis handle threads that are not annotated? For example, threads inside the DDS implementation.
I noticed in your proof of concept that there are places where locks have been added and places where they have been removed. Is this for the analysis or because the analysis found errors?
Anything not annotated doesn’t get analyzed. The llvm libcxx implementation has the Capability annotations on its std::mutex implementation, but if you don’t mark any data as GUARDED_BY the mutex, the analysis has nothing to operate on
I kept all existing locking logic intact but shifted ownership of the locking around - The main difference is that in favor of a private mutex ownership patterm, I removed the “lock-wrapper” LockedObject pattern, because it spread out locking and safety in such a way that it would be easy to make more mistakes, spreads ownership of safety to any user of the data, instead of centralizing in the thread-accessed data. That may not be the way we want to go, it’s just what seems to make the most sense to me
Yep, that makes sense. What I’m wondering is does this have any impact on the analysis of threads that are annotated? If an annotated thread interacts with a non-annotated one, how trustworthy are the analysis results?
Good question - the way I understand it the annotations are more about data than the threads themselves - so if you annotate data as being guarded, the analysis will spit out warnings for other code/packages only if it is built with -Wthread-safety, and accesses that data unsafely (according to the the analysis rules). So we could keep “non-ported” code warning-free by not enabling that flag on it until it’s been looked at specifically.
@emersonknapp Thanks for your complete and well thought out proposal. Hopefully this can be a good example for others going forward.
Overall we have general support for a utility package with a scope like rcpputils in parallel with rcutils here. There are several utility functions in rclcpp that could be refactored out into this package too in the future.
There was some discussion about whether there might be another more generic package scope. But there wasn’t any strong alternative proposals so lets go ahead with this as proposed.
Moving into an organization is a bit of a challenge. The most common thing to do is to transfer it to one of the individuals who has access and then they transfer it into the organization. Luckily github is relatively good about setting up redirects so most people won’t notice the indirect path.
If you can transfer it to me, I can then transfer it to the organization.
Keep TSA optional and write a colcon mixin to enable this behavior.
Inject the statements you quoted everywhere using the *-extras.cmake mechanism.
My feeling is that, keeping this opt-in per-package will be challenging because mixing the clang and GNU standard libraries is probably unsafe. So it would need to be a global on/off flag. 2. is most likely too brutal of a change so 1. could be the best option.
I’ve hit a wall with enabling this - given the following constraint
we do not want to make our own Mutex wrapper, in favor of using std::mutex
High level problem -
To get the capability annotations on std::mutex, I need to use libc++ (LLVM)
On Linux, by default, all libraries use libstd++ (GNU)
The build, using libc++ on an annotated package, can succeed just fine, BUT
Any std objects passed across .so boundaries into this library now have runtime undefined behavior, because you call libc++ functions on libstdc++-created objects
Specific example case that I ran into -
I add thread safety annotation to rmw_fastrtps_shared_cpp
link it to libc++ to have std::mutex be analyzable
upstream, an rclcpp test creates a Node, which eventually resolves to fastrtps_shared_cpp::rmw_node.cpp::__rmw_create_node, which:
creates a ParticipantAttributes from fastrtps (which has not been modified, so it uses libstdc++std::vector creation)
calls vector::resize on the member vector (which jumps into the libc++ implementation)
Should the analysis always be run for best results, or can we limit it to a CI run? If the latter, then perhaps we can set up a separate CI run that builds everything from source using libc++ and run that once a day or something, rather than making it part of every CI build.
That sounds like the commonly used approach for this.
In general the same applies when mixing libraries built with different C++ compilers, e.g. g++ vs. clang. While you might be lucky this is not something guaranteed.
I think that we gain value by running at all. Though it would be nice to run for every build, it seems prohibitively expensive given the current situation - we’re only running a Clang build nightly today as it is.
Based on @dirk-thomas input, my thinking now is to get all code building from source in the nightly Clang build and force that build to use libc++
Add and merge PRs adding the TSA colcon mixin, the rcpputils code for the macros and the class annotation. We can do that now and this is a no-op if the mixing is not being used.
Try to find a way for the whole build + test run to succeed:
Only target Mac OS X for now (if clang stdc++ is the default one on OS X? otherwise I guess brew can pass custom flags?)
Try to build everything from source on Linux and check that it works. If yes, we can figure a way to build everything from source in the CI.