Avoiding priority inversions (includes draft pull requests)

The issue
When using ROS2 in combination with FIFO/realtime scheduling, as defined by the POSIX standard (see sched(7) - Linux manual page), the std::mutex and std::recursive_mutex instances used internally by ROS2 might cause a deadlock due to a priority inversion (see Priority inversion & its solutions – Tech Access Info). This situation is inacceptable for applications with high demands on reliability and determinism. The most prominent case of a priority inversion occured in the Pathfinder spacecraft on Mars (see What really happened to the software on the Mars Pathfinder spacecraft? | Rapita Systems ).
Ideally, ROS2 would not use mutex lock calls, i.e. avoiding them by design through clever use of atomics, lockfree queues or by restricting mutex usage to try_lock calls, but that would presumably require a substantial redesign effort. Fortunately, the situation can be mitigated by using the priority inheritance capability for mutexes. This change would still not create an ideal situation for realtime applications since in general it’s not possible to predict how long a mutex lock call will block a thread, but priority inheritance will be good enough for many situations and at least prevent deadlocks.

Solution proposal
Unfortunately, std::mutex and std::recursive_mutex do not support priority inheritance, but it’s possible to derive from these classes and recreate the internal mutex inside the respective constructor with the PTHREAD_PRIO_INHERIT flag. To demonstrate this I created a pull request which contains two new mutex classes that use priority inheritance: Avoid priority inversions when using FIFO scheduling by WideAwakeTN · Pull Request #174 · ros2/rcpputils · GitHub (see rcpputils::PIMutex and rcpputils::RecursivePIMutex)
As demonstrated for rclcpp in Avoid priority inversions in rclcpp by WideAwakeTN · Pull Request #2078 · ros2/rclcpp · GitHub it’s easy and straightforward to replace std::mutex and std::recursive_mutex with the respective new mutex class. In case the proposed concept is accepted it would be easy to make the same substitution in other ROS2 C++ components which are expected to be realtime capable.

Implementation details

  • Not every OS supports POSIX or priority inheritance. Moreover, priority inheritance is only really important for systems with realtime/FIFO scheduling in order to avoid deadlocks, fair scheduling will not cause total deadlocks. For systems which do not support POSIX or priority inheritance the new mutex classes will just be an alias for their respective C++ std counterpart.
  • Deriving from std::mutex and std::recursive_mutex has the advantage that the source code that gets executed for lock, try_lock and unlock calls remains unchanged. This results in maximum compatibility. Thus the runtime behaviour of existing ROS2 applications should remain largely unaffected by the proposed changes, but the deadlock risk gets avoided.
  • Should the C++ standard offer mutex priority inheritance one day it would be easy to get back to the C++ standard by simply modifying rcpputils::PIMutex and rcpputils::RecursivePIMutex. The source code of all other ROS2 components would remain unchanged.
  • Using std::mutex, std::recursive_mutex, rcpputils::PIMutex and rcpputils::RecursivePIMutex inside the same application is possible and each of them will behave as expected/specified.

Discussion points

  • Using PTHREAD_PRIO_INHERIT seems to make more sense than using PTHREAD_PRIO_PROTECT (see c - What is the difference between PTHREAD_PRIO_INHERIT and PTHREAD_PRIO_PROTECT? - Stack Overflow).
  • Why does std C++ not offer priority inheritance? I guess the C++ committee omitted priority inheritance since not all systems support that feature and they didn’t want platform specific features. I don’t know if there are any plans to change that situation.
  • Be aware that mutexes which were created by 3rd party code/libraries, including the OS itself, will likely not have priority inheritance enabled. If you need reliable realtime in your application you should simply not run code which has unknown or unpredictable runtime behaviour. In particular, calls to the OS heap or to a logger could bump into a mutex without priority inheritance.
  • The changes proposed here to me are a small but sensible step towards improving the realtime behavior of ROS2 to reach industrial grade reliability, determinism and performance without larger code base changes.
  • Should std::PIMutex and std::RecursivePIMutex be named differently?

Side discussion: thread configuration
I wrote a unit test which provokes a thread priority inversion (see rcpputils/test_mutex.cpp at d455916c60c0d018c3a4d664c51fa1252d0300fa · ros2/rcpputils · GitHub ). With the new mutex classes the potential deadlock is successfully avoided. But in order to implement that testcase I need realtime priorities/FIFO scheduling. I impemented that functionality in rcutils, see Added realtime thread configuration support by WideAwakeTN · Pull Request #406 · ros2/rcutils · GitHub , but there are at least two major ways to add thread configuration functionality to ROS2:

  • 1: Put the basic thread configuration functionality into rcl or rcutils. This will be OS dependent code.
  • 2: Add a thread creation factory interface to rcpputils or rclcpp which can be passed to ROS2 entities, like the executors.

I am not sure yet which approach is better. Solution 2 would keep OS specific code out of the ROS2 code base. In any case it should be possible to configure the thread priority, cpu core affinity and scheduling type. Thoughts, suggestions, opinions? Is there already a discussion on that matter somewhere? Until a consensus has been reached on the thread configuration question I could revoke my rcutils draft pull request 406 and confine my thread configuration code to the source code of my unit test.

4 Likes

Very interesting. A few questions:

  • In what situation are you encountering multiple threads of an rclcpp enabled process having different priorities and contending over the same mutexes? Are you using components, or are you running into issues while running a single node, but with perhaps a MultiThreadedExecutor?
  • Which Real-Time OS’s specifically are you targeting?
1 Like

In what situation are you encountering multiple threads of an rclcpp enabled process having different priorities and contending over the same mutexes? Are you using components, or are you running into issues while running a single node, but with perhaps a MultiThreadedExecutor?

According to my experience realtime components usually have a critical path/code which is realtime, but also other code (usually RPCs) which are not realtime (and might take longer to process). There is a nice example which shows how to implement such an approach in ROS2:

Moreover, I suspect there are several ROS2 components which should use priority inversion safe mutexes, like the DDS middleware implementations.

I did not use realtime thread priorities yet with ROS2, but I will at some point and instead of just patching my local ROS2 system I thought it would be better to propose pull requests. I know that the probability for priority inversions is usually quite low, especially on multicore cpu machines, but in realtime applications some threads might be assigned to a specific core. Actually, I have seen thread priority inversions happening in a previous project. Therefore using priority inheritance enabled mutexes by default is a kind of best practice to me.

Which Real-Time OS’s specifically are you targeting?

The proposed changes should work on Linux with RT patch, VxWorks and QNX. Personally, I don’t have access to VxWorks or QNX systems. Hopefully someone else volunteers to get the code compiling there (which shouldn’t be a problem).
Note: on RT Linux elevated user rights are required for switching a thread to realtime/FIFO scheduling.

1 Like

Thanks for the context. When you say RPCs, you mean you have some sort of listener thread that is running on a system apart from ROS (ex: gRPC) but also have a thread doing ROS stuff in the same process?

Also, if you haven’t already, I would suggest looking up the meeting schedule for the real-time working group and bringing this up there.

With RPC I mean services and actions. RPC is just the classical umbrella term to me.

The processing of services and actions (through callbacks) is usually not part of the main/core cyclic dataflow (aka critical path) and their processing could disturb the processing of the cyclic dataflow to the point where deadlines are violated. For simple nodes this usually isn’t an issue, though. The multithreaded executor is one way of solving the disturbance problem, but the more fine grained solution is to use two or more executors, via the callback groups, and configure their threads individually (which is not possible yet through built in functionality and requires some “hacking”). The callback group approach is demonstrated in the cdb_executor example that I mentioned.

PS: I added the real-time working group meetings to my personal calendar. Next date seems to be January 30th.

Hi @WideAwakeTN ,

without having done a deep dive review of your PR, I very much appreciate it, because this issue exists and is real. You are right, that the chances are low to see it, but be sure it will happen if you have applications / robots that run for months without reboot.

In the past, I avoided ROS code in realtime loops or patched PI into it myself.

1 Like

After some helpful discussions I decided to close my rcutils draft pull request 406 and changed my rcpputils draft pull request 174 accordingly. Since the new thread configuration functionality in PR 406 was only required for the units tests in PR 174 that was no big deal. I simply moved the new functionality to the test implementation. I think this is ok until ROS2 offers some built in functionality for configuring cpu priority, cpu core affinity and scheduler type for threads.

1 Like

Another open question: should the code of my priority inheritance enabled mutex classes be moved from ros2/rcpputils to ros2/realtime_support ? Which of these two repos/packages is more suitable?
(Draft PR link: ros2/rcpputils#174 )