The general question is, does anyone know of any reasonably up to date documentation or overview of concurrency, race conditions, threading in ros1 python nodes. I have pursued this as far as I could, without satisfaction. So I thought I’d try here.
I have put together a series of beliefs based on questions, answers, and old google results regarding how to think about this. Forgive my need to understand in more detaiul than I probably need to do my hob, it’s just me.
Beliefs (all relate to rospy nodes): Note some may be obvious and some may be wrong. I am not looking necessarily for you to answer, but more to help me find if anyone has written it down.)
One rospy node always runs as a separate process
Callbacks are invoked in a thread for each topic subscibed. If somehow you have two callbacks for the same topic, still one thread (although I dont understand how this would happen.)
A rospy.sleep() does what a regular python sleep does (with some details that are different in sim time). A rospy.spin() is an infinite look calling rospy.sleep() every so many ms.
If processing of a callback is slow so that the next pub happens on that topic, then that pub will be queued up (which is why the Publisher constructor takes a queue length. (I am not sure what happens when the queue is full though.)
You can create a race condition by using a python global variable and reading and writing it from two callbacks on two different topics. (However I have not been able to get this to happen
I recognize this won’t directly answer your question, but I can explain how some of this is handled in ROS2, and maybe this could help you read through the rospy codebase to see how it handles it. I can’t speak for rospy, since I don’t use ROS 1, but I can speak to how generally you can handle concurrency without using parallelism.
ROS2 rclpy and rclcpp both manage subscriptions, timers, and other waitables with executors. An executor, or an event loop, is a way to manage asynchronous or concurrent programming while decoupling it with parallelism.
Node.js is famous for being single-threaded, and yet can handle many simultaneous requests. Why? Because it puts things in an event loop, and then can “tick” or “spin” each one at a time.
Golang has so-called “green” threads which don’t directly correspond to operating system threads. This is known as M:N threading or hybrid threading.
Similarly, rclpy and rclcpp have executors that loop through and “tick” their waitables, including subscriptions. What this looks like is that the subscriptions call functions that poll the RMW for data, and execute callbacks if data is received. So this is how concurrency (doing many things at once) can be achieved without using parallelism (using multiple threads on multiple CPU cores at the same time).
rclpy and rclcpp by default use single threaded executors, which means that even if you have many different callbacks or timers, you are only actually using one thread, which means that if you want to have multiple callbacks access the same variables, you won’t have race conditions as you theorize in point 5! However, that assumptions is broken as soon as you set your node to run with a multithreaded executor, in which case you either need mutually exclusive callback groups or mutexes to make your data accesses thread-safe.
I don’t know of anything specific for ROS 1, but handling concurrency and race conditions in a ros node is the same as in any other program. At one time I think that there was documentation for rospy, but I can’t find it now (the API documentation link on the wiki doesn’t go anywhere).
To confirm your beliefs:
Yes. rospy won’t start subprocesses, and I can’t think of a case where a single node would want to run multi-process. The possible exception here would be for nodelets, which are inherently subprocesses, but that’s mostly (or entirely) a C++ thing.
Yes. There is one thread to handle incoming connections per topic. When the topic receives information, sends the new data to each subscriber in a loop (check out receive_callback in topics.py)
Yes, with some machinery to stop sleeping if ROS is stopped.
Yes, processing has to be faster than the incoming topic data otherwise the queue will fill up. You can test this behavior by putting in a 2 second sleep in a topic handler, while publishing to it every second. I didn’t test this but I assume somewhere a queue.Full exception would be thrown.
If you had handlers for different topics (in different threads) mutating the same global variable, you will have a race condition. It might not always show up, but it’s there.
I was able to confirm this with a small test program that listened on two topics. I then published to the topics simultaneously.
The correct solution to accessing data from multiple threads is to either write your program so this can’t happen, or to lock around access to shared data (with a threading.Lock for example).
Locking in python isn’t a huge deal (compared to Java or C++), but I still try to avoid it. I tend to use topic only to push data onto a queue. Then I set up a separate worker thread to pop data off that queue to do whatever I need to do. That both avoids the race condition problem and also keeps the topic threads free to do their business.
Other from that, I agree with @seth in all points.
In Python, you just have to be aware that there are actually two queues - the normal message queue (configured via queue_size) and the receive byte buffer size (configured via buff_size). Both of them play an important role in the behavior when callbacks take too long.
Thanks for the very informative answers. I see a subtle difference (or at least in my understanding of) your explanations.
Is there a single thread for all callbacks or one for each distinct topic?
Are these threads subject to a context switch due to a time slice or only when they yield control through a sleep?
With my attempt to create a race condition I was using callbacks from Gazebo for /clock and /odom and I could never see a race condition, no matter how I changed their timing with short sleeps in the callback (Gist of my test)
Since they use threading.Thread, they are preemptible, and will context switch even without sleep.
You can definitely get race conditions, but they are a little harder to trigger in Python due to the GIL. I feel that your example should have revealed something though, but I don’t have a ROS1 install to test…