ROS2 generated child thread scheduling policy affects timers

Hello.
We are investigating and studying the ROS2 real-time in a team.

After checking the basic ROS2 functionality and pendulum_demo, I’ve noticed that ROS2 generates multiple threads in the background.

In particular, the scheduling priority of DDS-generated threads may affect on timer and communication latency.
Therefore, we are investigating the effect of the priority of child threads generated by ROS2 on timers and communication.

This post presents a basic investigation and the experimental results of the effect of the thread’s priority on timers.

The results and source code for each measurement can be found on github here.

Environment

The main environment is as follows:

  • Hardware : Raspberry Pi 3B +
  • OS : ubuntu 18.04 4.19.55-rt24-v7+
  • ROS Distro : ROS 2 Eloquent Elusor
  • DDS : FastRTPS or CycloneDDS

All tasks were assigned to each core as follows.

  • Core 0 : Kernel Thread
  • Core 1~3: Measurement only (no CPU migration)

Building Environment shows detailed instructions.

To confirm the real-time capability, I measured the nanosleep wake-up latency with cyclictest.
The nanosleep wake-up latency falls within the range of 10~60 us, so that its jitter was small.

cyclictest_pure_kernel_vs_rt_kernel

Basic investigation of ROS2 generated child threads

Threads num

I investigated the thread number in a variety of cases with different numbers of nodes and topics.

  • Threads generated during initialization
  • 1 thread in rclcpp::init
  • Threads generated during the node decleration
  • FastRTPS : 5 threads / node
  • CycloneDDS : 7 threads

FastRTPS seems to generate threads on a per-node basis, and CycloneDDS seems to share 7 threads all nodes.
FastRTPS calls createParticipant and generates threads when ROS2 initializes a node.

Scheduling policies

It turns out that the child threads inherit the policy of the parent thread at the time of threads generation.
That is, the order to call functions will change the policy of the child threads as follows:

  • sched_setsuler(RR) -> rclcpp::init -> node definition : all RR
  • rclcpp::init -> sched_setscheduler(RR) -> node definition : rclcpp::init-generated thread only TS
  • rclcpp::init -> node definition -> sched_setscheduler(RR) : main only RR

Here, TS and RR mean as follows.
RR: Round-Robin schedule, it’s Real-time scheduling policy
TS: Time-Sharing(CFS), it’s non Real-time scheduling policy

For example, the pendulum_demo is running in the order init->node->sched, so only the main thread is RR and the other threads generated by ROS2 are TS.

Thread’s policy can be confirm with the following command.
$ ps -em -o pid,tid,policy,pri,ni,rtprio,comm,psr --sort=+rtprio

In my opinion, the child threads need to have the same or higher priority than the thread performing the callback.
This is because the child threads generated by DDS are the triggers for the subscriber callbacks.
The communication processing of the TS threads may not be performed while a higher priority real-time process is running.

Experiment Summary

Pendulum_demo uses a custom RttExecutor to measure the wake-up latency of a nanosleep + spin_some.
Its child threads policy is TS.
Pendulum_demo keynote reports that latency measurement under IO and CPU stress observed large latency for three times in 7,000,000 loops.

I decomposed the pendulum_demo into “nanosleep accuracy” and “timer callback accuracy”.
Then, I performed the following experiments.

(1) The effect of child threads priority TS/RR on nanosleep wake-up latency.
(2) The effect of executor RttExecutor vs SingleThreadedExecutor on callback latency.

Each result is described below.

Comparison of nanosleep wake-up latency between child threads policies TS and RR

The effect of child thread priority on communication is currently under investigation.
As a first move, we experimented the effect of child thread priority on the timer wake-up latency.

The conditions are as follows:

  • Child thread policy RR vs TS
  • init only vs. init + node declaration
  • FastRTPS vs. CycloneDDS

I imposed CPU-stress and IO-stress for each core.

The result is shown below.

The upper figures show results only rclcpp::init call, and the lower figures show rclcpp::init call and node declarations.

The lower left figure shows that the latency is in the range of 10~60 us and it reproduced to be equivalent to the cyclictest result.
The execution time of spin_some is very low jitter, if the child threads policies are TS and it uses nanosleep+spin_some, as in pendulum_demo.

The lower right figure shows that the latency is in the range of 10~500 us and the jitter of the wake-up latency got worse if the child threads policies were RR.

A reason for this is that DDS-generated threads wake up periodically, and the scheduler switches task from the main process to the DDS thread.
We captured data packets by Wireshark, and it observed FastRTPS periodically sents data packets of “RTPS: INFO_TS, DATA§”
FastRTPS also sents data packets of “RTPS: INFO_DST HEARTBEAT” if publish/subscribe run in two separate processes.
Periodically sent RTPS control packets may affect timer accuracy.

Timer callback latency between nanosleep + spin_some vs spin

ROS2 executes waiting callbacks sequentially after wake-up.
Therefore, the latency of callback execution is also important.

Then, we measured the latency of the timer callback under the following conditions.

  • TS Policy for child threads (excludes RR because it is obvious that jitter get worse)
  • FastRTPS vs. CycloneDDS
  • nanosleep + spin_some vs. spin

I imposed CPU-stress and IO-stress for each core.

Each measured time is as follows.

  • nanosleep + spin_some : The latency between the time nanosleep expected to wake-up and the callback is actually called.
  • spin : The latency between the time callback expected to be called and actually called.

The result is shown below.

The results in the case of nanosleep + spin_some show that the latency range is around 150 ~ 500 us.
In the case of spin, the latency range is only 30-50us, and the jitter is lower than or equal to the cyclictest.

The reason for the high callback latency of nanosleep+spin_some may be that spin_some also does waitset-related processing after wake-up.

To confirm the reliability of the callback latency with spin, we measured 8.64 million loops with 10 ms cycles (24 hours).

create_wall_timer_callback_latency_histgram_24h

The 24-hour measurement results were also low jitter, so the timing of the timer callback execution is accurate.

Conclusion

A basic investigation of the ROS2 generated child threads and the experimental results of the timer callback are as follows.

  • ROS2 generates child threads during rclcpp::init call and node declarations.
  • Child threads inherit the scheduling policy of the parent thread when they were generated.
  • Timer jitter gets worse If the child threads policies are RR equal to the main thread.
  • The wake-up latency of the nanosleep timer is the same level low jitter as the cyclictest result if the policies of child threads are TS.
  • When the policies of child threads are TS, the timer callback execution latency with spin is as low jitter as or better than the cyclictest results.

For timer accuracy only, it is more accurate to set the child thread policy to TS.
Alternatively, it could be set to a priority lower than the main thread in the real-time policy (FIFO / RR).

However, if the child thread is set to low priority, higher priority processes may affect communication.
We will report on the effect of higher priority process on communication in the next post.
It depends on the results of the communication, ROS2 also need to consider the child thread’s priority to meet realtimeness.

Since the deterministic processing of each task is important to meet real-time, I verified the latency of the timer.
I’d like to continue to contribute to ROS2.
Questions, suggestions and advice are welcome.
Thank you.

7 Likes

Hello @hsgwa,

that is a really cool analysis you provided us here! Great insight and easily understandable :slight_smile:

Just wanted to make sure that ePrisima is made aware of this. So I just add @Jaime_Martin_Losa here to have a look at it, if he not already did so. They are quite open to look for improvements :slight_smile: At least what I can see is happening with Micro-ROS.

I’m not sure whom to add for CycloneDDS, if they are interested…?

@hsgwa, have you also considered doing your tests with other DDS implementations? There exists quite a few of them: https://design.ros2.org/articles/ros_middleware_interface.html
Also, it would be worth pointing out which exact version from the different rmw’s have been used in your tests.
Maybe, FastRTPS would already be faster with their new shared-memory communication in their latest release (coming to Foxy): ROS2 Default Behavior (Wifi)
Personally I would also be interested how https://github.com/ros2/rmw_iceoryx (shared-memory only communication) would perform in your tests :slight_smile:

Also, I’m not sure if you are completely new to this community? Maybe you should check out the Real-Time Working Group: ROS 2 Real-time Working Group Online Meeting 16 - Apr 29, 2020 - Meeting Minutes @Dejan_Pangercic

They can probably help you out how you can continue to contribute to ROS2 :slight_smile:

@flo

Thank you for your valuable advice :smile:
I’m glad it’s my first time in the community.

If there is a continuous demand for this test, I’d like to expand this test to other DDSs and versions, and integrate it into other CI such as buildfarm_perf_test.

First, I add the specific DDS version used for this test.

I know that there are several projects running towards real-time, but I haven’t been able to track what kind of test and improvements are going.
I’m embarrassed to say that I’m not very good at speaking in English, can’t I communicate by text…?