Hello.
We are investigating and studying the ROS2 real-time in a team.
After checking the basic ROS2 functionality and pendulum_demo, I’ve noticed that ROS2 generates multiple threads in the background.
In particular, the scheduling priority of DDS-generated threads may affect on timer and communication latency.
Therefore, we are investigating the effect of the priority of child threads generated by ROS2 on timers and communication.
This post presents a basic investigation and the experimental results of the effect of the thread’s priority on timers.
The results and source code for each measurement can be found on github here.
Environment
The main environment is as follows:
- Hardware : Raspberry Pi 3B +
- OS : ubuntu 18.04 4.19.55-rt24-v7+
- ROS Distro : ROS 2 Eloquent Elusor
- DDS : FastRTPS or CycloneDDS
All tasks were assigned to each core as follows.
- Core 0 : Kernel Thread
- Core 1~3: Measurement only (no CPU migration)
Building Environment shows detailed instructions.
To confirm the real-time capability, I measured the nanosleep wake-up latency with cyclictest.
The nanosleep wake-up latency falls within the range of 10~60 us, so that its jitter was small.
Basic investigation of ROS2 generated child threads
Threads num
I investigated the thread number in a variety of cases with different numbers of nodes and topics.
- Threads generated during initialization
- 1 thread in rclcpp::init
- Threads generated during the node decleration
- FastRTPS : 5 threads / node
- CycloneDDS : 7 threads
FastRTPS seems to generate threads on a per-node basis, and CycloneDDS seems to share 7 threads all nodes.
FastRTPS calls createParticipant and generates threads when ROS2 initializes a node.
Scheduling policies
It turns out that the child threads inherit the policy of the parent thread at the time of threads generation.
That is, the order to call functions will change the policy of the child threads as follows:
- sched_setsuler(RR) -> rclcpp::init -> node definition : all RR
- rclcpp::init -> sched_setscheduler(RR) -> node definition : rclcpp::init-generated thread only TS
- rclcpp::init -> node definition -> sched_setscheduler(RR) : main only RR
Here, TS and RR mean as follows.
RR: Round-Robin schedule, it’s Real-time scheduling policy
TS: Time-Sharing(CFS), it’s non Real-time scheduling policy
For example, the pendulum_demo is running in the order init->node->sched, so only the main thread is RR and the other threads generated by ROS2 are TS.
Thread’s policy can be confirm with the following command.
$ ps -em -o pid,tid,policy,pri,ni,rtprio,comm,psr --sort=+rtprio
In my opinion, the child threads need to have the same or higher priority than the thread performing the callback.
This is because the child threads generated by DDS are the triggers for the subscriber callbacks.
The communication processing of the TS threads may not be performed while a higher priority real-time process is running.
Experiment Summary
Pendulum_demo uses a custom RttExecutor to measure the wake-up latency of a nanosleep + spin_some.
Its child threads policy is TS.
Pendulum_demo keynote reports that latency measurement under IO and CPU stress observed large latency for three times in 7,000,000 loops.
I decomposed the pendulum_demo into “nanosleep accuracy” and “timer callback accuracy”.
Then, I performed the following experiments.
(1) The effect of child threads priority TS/RR on nanosleep wake-up latency.
(2) The effect of executor RttExecutor vs SingleThreadedExecutor on callback latency.
Each result is described below.
Comparison of nanosleep wake-up latency between child threads policies TS and RR
The effect of child thread priority on communication is currently under investigation.
As a first move, we experimented the effect of child thread priority on the timer wake-up latency.
The conditions are as follows:
- Child thread policy RR vs TS
- init only vs. init + node declaration
- FastRTPS vs. CycloneDDS
I imposed CPU-stress and IO-stress for each core.
The result is shown below.
The upper figures show results only rclcpp::init call, and the lower figures show rclcpp::init call and node declarations.
The lower left figure shows that the latency is in the range of 10~60 us and it reproduced to be equivalent to the cyclictest result.
The execution time of spin_some is very low jitter, if the child threads policies are TS and it uses nanosleep+spin_some, as in pendulum_demo.
The lower right figure shows that the latency is in the range of 10~500 us and the jitter of the wake-up latency got worse if the child threads policies were RR.
A reason for this is that DDS-generated threads wake up periodically, and the scheduler switches task from the main process to the DDS thread.
We captured data packets by Wireshark, and it observed FastRTPS periodically sents data packets of “RTPS: INFO_TS, DATA§”
FastRTPS also sents data packets of “RTPS: INFO_DST HEARTBEAT” if publish/subscribe run in two separate processes.
Periodically sent RTPS control packets may affect timer accuracy.
Timer callback latency between nanosleep + spin_some vs spin
ROS2 executes waiting callbacks sequentially after wake-up.
Therefore, the latency of callback execution is also important.
Then, we measured the latency of the timer callback under the following conditions.
- TS Policy for child threads (excludes RR because it is obvious that jitter get worse)
- FastRTPS vs. CycloneDDS
- nanosleep + spin_some vs. spin
I imposed CPU-stress and IO-stress for each core.
Each measured time is as follows.
- nanosleep + spin_some : The latency between the time nanosleep expected to wake-up and the callback is actually called.
- spin : The latency between the time callback expected to be called and actually called.
The result is shown below.
The results in the case of nanosleep + spin_some show that the latency range is around 150 ~ 500 us.
In the case of spin, the latency range is only 30-50us, and the jitter is lower than or equal to the cyclictest.
The reason for the high callback latency of nanosleep+spin_some may be that spin_some also does waitset-related processing after wake-up.
To confirm the reliability of the callback latency with spin, we measured 8.64 million loops with 10 ms cycles (24 hours).
The 24-hour measurement results were also low jitter, so the timing of the timer callback execution is accurate.
Conclusion
A basic investigation of the ROS2 generated child threads and the experimental results of the timer callback are as follows.
- ROS2 generates child threads during rclcpp::init call and node declarations.
- Child threads inherit the scheduling policy of the parent thread when they were generated.
- Timer jitter gets worse If the child threads policies are RR equal to the main thread.
- The wake-up latency of the nanosleep timer is the same level low jitter as the cyclictest result if the policies of child threads are TS.
- When the policies of child threads are TS, the timer callback execution latency with spin is as low jitter as or better than the cyclictest results.
For timer accuracy only, it is more accurate to set the child thread policy to TS.
Alternatively, it could be set to a priority lower than the main thread in the real-time policy (FIFO / RR).
However, if the child thread is set to low priority, higher priority processes may affect communication.
We will report on the effect of higher priority process on communication in the next post.
It depends on the results of the communication, ROS2 also need to consider the child thread’s priority to meet realtimeness.
Since the deterministic processing of each task is important to meet real-time, I verified the latency of the timer.
I’d like to continue to contribute to ROS2.
Questions, suggestions and advice are welcome.
Thank you.