ROS 2 Performance Benchmarking

In our testing of the speed-of-light performance of ROS 2 (Rolling) for moving messages through intra-process, we’re finding some bottlenecks that may or may not be already known. In order to flush out these issues, we developed a framework to measure performance of a synthetic graph similar to ros2_performance. In ros2_framework_perf (open to better naming suggestions), we can use normal launch graphs which unlocks rclpy nodes in the graph and we can track timer-based and synchronized messages along with their histories more cleanly.

We’ll be publishing the ros2_framework_perf framework shortly to reproduce our results, but here’s what we’ve found:

Identified Bottlenecks

We profile our code using perf, and the most prominent bottlenecks can be categorized into three main areas. At the system level, we observe significant overhead from memory management operations, with libc functions like _int_malloc and _int_free indicating frequent dynamic memory allocation and deallocation patterns. The kernel function asm_exc_page_fault indicates that page faults are not entirely eliminated. This suggests that the ROS 2 message passing pipeline is experiencing memory pressure, likely due to the high-frequency message publishing at 500 Hz. These memory-related bottlenecks might be contributing to the jitter we observe in the timing data.

Additionally, we’ve identified some timer management and intra-process communication functions that seem to appear in our profiling data. The functions rclcpp::experimental::TimersManager::execute_ready_timers_unsafe(), rclcpp::experimental::TimersManager::execute_ready_timer(), and rclcpp::experimental::TimersManager::WeakTimersHeap::validate_and_lock() suggest that timer execution overhead may be contributing to the periodic latency spikes we observe, though they might be necessary. Similarly, rclcpp::experimental::IntraProcessManager::add_shared_msg_to_buffers indicates that intra-process message handling could be adding latency to the message pipeline.

Finally, at the kernel level, we’re also seeing that functions like unmap_page_range and zap_page_range_single reveal that the application is causing memory churn, frequently allocating and deallocating pages from the operating system.

Some smaller functions that we think might be contributors to jitter:

FastDDS Layer

  • Key functions identified:

  • SharedMemTransport::select_locators

  • boost::interprocess::shared_memory_object::remove

Write/Read Operations

  • Key functions identified:

  • BaseWriter::update_cached_info_nts

  • WriterHistory::create_change

  • StatefulWriter::deliver_sample_to_network

Latency Analysis

The script generates a two-part plot to analyze message timing. The top graph shows the time difference between consecutive messages, with blue dots for deltas between received messages and red dots for deltas between published messages, which helps identify the source of message jitter by comparing the stability of the two. The bottom graph displays the end-to-end latency for each message, where green dots on the x-axis mark the publish time and blue 'x’s show the receive time; the vertical distance between them represents the latency, making it easy to spot outliers and trends. As we can see in the graph above, there is almost a periodic pattern of message publishing/receiving that doesn’t hit our 500hz target. We profile the entire message passing pipeline to identity what might be the cause of these spikes in timestamp deltas, as well as the jitter around 2000 microseconds.

Background - Test Overview

Purpose

The ros2_framework_perf measures and analyzes the performance characteristics of ROS 2’s core components, with a particular focus on the executor and transport layers. The framework sends messages of a fixed size through the ROS graph below to locate potential bottlenecks. To gain deeper insights, the entire benchmark execution is wrapped by the standard Linux perf tool, as orchestrated by run_benchmark_with_profiling.sh. This allows for a granular analysis of where time is spent, from the user’s application code down through the ROS Client Library (rcl), the middleware layer (rmw), and the underlying DDS implementation.

Benchmark Graph Details

The graph above illustrates our benchmark setup. It measures the end-to-end latency and performance characteristics of message passing between these nodes, with particular focus on the tensor_encoder_node to tensor_inference_node communication path. The tensor_encoder_node publishes processed tensor data at a consistent 500 Hz frequency on the /tensor_encoder_output topic, and the tensor_inference_node subscribes to this topic, processes the incoming messages, and publishes its results to /tensor_inference_output.

The benchmark captures comprehensive metrics including message publish/receive timing, end-to-end latency, and inter-message timing deltas. By measuring the time between when the tensor encoder publishes a message and when the inference node receives and processes it, we can identify bottlenecks in the ROS 2 message passing infrastructure, including serialization overhead, transport layer delays, and executor scheduling latency. The data shows that the system mostly maintains consistent performance, however some jitter does exist.

What’s Being Benchmarked

The framework is testing the end-to-end latency of message passing between publishers and subscribers, with particular attention to:

  • Message serialization/deserialization overhead

  • Memory management operations

  • Transport layer efficiency

  • Synchronization mechanisms

  • String handling and memory allocation patterns

Test Architecture

The test setup consists of:

  1. An emitter node that publishes messages

  2. A receiver node that subscribes to these messages

  3. Analysis tools that process the timing data

Ideal Case:

Ideally, we’d see all publish and receive deltas on the horizontal dotted green line in the top graph, and all blue X’s on the X axis in the bottom graph. This would indicate that every published and received message is hitting its’ desired frequency, and there is zero timestamp delta between when a given message was published, to when it was received. We believe that because our benchmark is intra-process, the jitter we’re seeing can be minimized further.

15 Likes

Great job on this. I’ll be interested to see what bottlenecks can be improved from your analysis.
Slightly unrelated - I am curious to see how you use perf to profile the system. I have had limited success profiling our C++ application using perf.

1 Like

Hi @MihirRaoNV ,

Without sharing too much, this is consistent with some internal tests we did for MELFA ROS2 Driver. Even though we are using ros2_control framework, we observe some jitters when controlling MELFA at its designed frequency. As such, we recommend our users to apply at least a 10ms low pass filter in MELFA robot controller parameters to the commands that are received from ROS2 for a validated, smooth and reliable performance.

Your results are interesting and very relevant. Thanks for sharing!

3 Likes

Your findings indicate that you use the events executor and not the default executor. You should mention this. It would be interesting to see the difference between the two executors.

2 Likes

It certainly would be interesting to see the differences, but you are correct. For the benchmarking above, we use the EventsExecutor. We created a PR here that includes these changes. Thanks for pointing this out.

That’s very interesting! Would it be possible to create a similar benchmark in ROS Noetic?

You didn’t mention the serialization overhead in the bottleneck list. In the rough benchmark I did on the serialization, I identified get_message_typesupport_handle to be a main contributor for the loss of performance of serialize_message compared to ROS 1. Is it because its contribution is small compared to the other bottlenecks? Or is the serialization bypassed because the graph runs in a single process?