When using ROS2 in combination with FIFO/realtime scheduling, as defined by the POSIX standard (see sched(7) - Linux manual page), the std::mutex and std::recursive_mutex instances used internally by ROS2 might cause a deadlock due to a priority inversion (see Priority inversion & its solutions – Tech Access Info). This situation is inacceptable for applications with high demands on reliability and determinism. The most prominent case of a priority inversion occured in the Pathfinder spacecraft on Mars (see What really happened to the software on the Mars Pathfinder spacecraft? | Rapita Systems ).
Ideally, ROS2 would not use mutex lock calls, i.e. avoiding them by design through clever use of atomics, lockfree queues or by restricting mutex usage to try_lock calls, but that would presumably require a substantial redesign effort. Fortunately, the situation can be mitigated by using the priority inheritance capability for mutexes. This change would still not create an ideal situation for realtime applications since in general it’s not possible to predict how long a mutex lock call will block a thread, but priority inheritance will be good enough for many situations and at least prevent deadlocks.
Unfortunately, std::mutex and std::recursive_mutex do not support priority inheritance, but it’s possible to derive from these classes and recreate the internal mutex inside the respective constructor with the PTHREAD_PRIO_INHERIT flag. To demonstrate this I created a pull request which contains two new mutex classes that use priority inheritance: Avoid priority inversions when using FIFO scheduling by WideAwakeTN · Pull Request #174 · ros2/rcpputils · GitHub (see rcpputils::PIMutex and rcpputils::RecursivePIMutex)
As demonstrated for rclcpp in Avoid priority inversions in rclcpp by WideAwakeTN · Pull Request #2078 · ros2/rclcpp · GitHub it’s easy and straightforward to replace std::mutex and std::recursive_mutex with the respective new mutex class. In case the proposed concept is accepted it would be easy to make the same substitution in other ROS2 C++ components which are expected to be realtime capable.
- Not every OS supports POSIX or priority inheritance. Moreover, priority inheritance is only really important for systems with realtime/FIFO scheduling in order to avoid deadlocks, fair scheduling will not cause total deadlocks. For systems which do not support POSIX or priority inheritance the new mutex classes will just be an alias for their respective C++ std counterpart.
- Deriving from std::mutex and std::recursive_mutex has the advantage that the source code that gets executed for lock, try_lock and unlock calls remains unchanged. This results in maximum compatibility. Thus the runtime behaviour of existing ROS2 applications should remain largely unaffected by the proposed changes, but the deadlock risk gets avoided.
- Should the C++ standard offer mutex priority inheritance one day it would be easy to get back to the C++ standard by simply modifying rcpputils::PIMutex and rcpputils::RecursivePIMutex. The source code of all other ROS2 components would remain unchanged.
- Using std::mutex, std::recursive_mutex, rcpputils::PIMutex and rcpputils::RecursivePIMutex inside the same application is possible and each of them will behave as expected/specified.
- Using PTHREAD_PRIO_INHERIT seems to make more sense than using PTHREAD_PRIO_PROTECT (see c - What is the difference between PTHREAD_PRIO_INHERIT and PTHREAD_PRIO_PROTECT? - Stack Overflow).
- Why does std C++ not offer priority inheritance? I guess the C++ committee omitted priority inheritance since not all systems support that feature and they didn’t want platform specific features. I don’t know if there are any plans to change that situation.
- Be aware that mutexes which were created by 3rd party code/libraries, including the OS itself, will likely not have priority inheritance enabled. If you need reliable realtime in your application you should simply not run code which has unknown or unpredictable runtime behaviour. In particular, calls to the OS heap or to a logger could bump into a mutex without priority inheritance.
- The changes proposed here to me are a small but sensible step towards improving the realtime behavior of ROS2 to reach industrial grade reliability, determinism and performance without larger code base changes.
- Should std::PIMutex and std::RecursivePIMutex be named differently?
Side discussion: thread configuration
I wrote a unit test which provokes a thread priority inversion (see rcpputils/test_mutex.cpp at d455916c60c0d018c3a4d664c51fa1252d0300fa · ros2/rcpputils · GitHub ). With the new mutex classes the potential deadlock is successfully avoided. But in order to implement that testcase I need realtime priorities/FIFO scheduling. I impemented that functionality in rcutils, see Added realtime thread configuration support by WideAwakeTN · Pull Request #406 · ros2/rcutils · GitHub , but there are at least two major ways to add thread configuration functionality to ROS2:
- 1: Put the basic thread configuration functionality into rcl or rcutils. This will be OS dependent code.
- 2: Add a thread creation factory interface to rcpputils or rclcpp which can be passed to ROS2 entities, like the executors.
I am not sure yet which approach is better. Solution 2 would keep OS specific code out of the ROS2 code base. In any case it should be possible to configure the thread priority, cpu core affinity and scheduling type. Thoughts, suggestions, opinions? Is there already a discussion on that matter somewhere? Until a consensus has been reached on the thread configuration question I could revoke my rcutils draft pull request 406 and confine my thread configuration code to the source code of my unit test.