In our testing of the speed-of-light performance of ROS 2 (Rolling) for moving messages through intra-process, we’re finding some bottlenecks that may or may not be already known. In order to flush out these issues, we developed a framework to measure performance of a synthetic graph similar to ros2_performance
. In ros2_framework_perf
(open to better naming suggestions), we can use normal launch graphs which unlocks rclpy
nodes in the graph and we can track timer-based and synchronized messages along with their histories more cleanly.
We’ll be publishing the ros2_framework_perf
framework shortly to reproduce our results, but here’s what we’ve found:
Identified Bottlenecks
We profile our code using perf
, and the most prominent bottlenecks can be categorized into three main areas. At the system level, we observe significant overhead from memory management operations, with libc functions like _int_malloc
and _int_free
indicating frequent dynamic memory allocation and deallocation patterns. The kernel function asm_exc_page_fault
indicates that page faults are not entirely eliminated. This suggests that the ROS 2 message passing pipeline is experiencing memory pressure, likely due to the high-frequency message publishing at 500 Hz. These memory-related bottlenecks might be contributing to the jitter we observe in the timing data.
Additionally, we’ve identified some timer management and intra-process communication functions that seem to appear in our profiling data. The functions rclcpp::experimental::TimersManager::execute_ready_timers_unsafe()
, rclcpp::experimental::TimersManager::execute_ready_timer()
, and rclcpp::experimental::TimersManager::WeakTimersHeap::validate_and_lock()
suggest that timer execution overhead may be contributing to the periodic latency spikes we observe, though they might be necessary. Similarly, rclcpp::experimental::IntraProcessManager::add_shared_msg_to_buffers
indicates that intra-process message handling could be adding latency to the message pipeline.
Finally, at the kernel level, we’re also seeing that functions like unmap_page_range
and zap_page_range_single
reveal that the application is causing memory churn, frequently allocating and deallocating pages from the operating system.
Some smaller functions that we think might be contributors to jitter:
FastDDS Layer
-
Key functions identified:
-
SharedMemTransport::select_locators
-
boost::interprocess::shared_memory_object::remove
Write/Read Operations
-
Key functions identified:
-
BaseWriter::update_cached_info_nts
-
WriterHistory::create_change
-
StatefulWriter::deliver_sample_to_network
Latency Analysis
The script generates a two-part plot to analyze message timing. The top graph shows the time difference between consecutive messages, with blue dots for deltas between received messages and red dots for deltas between published messages, which helps identify the source of message jitter by comparing the stability of the two. The bottom graph displays the end-to-end latency for each message, where green dots on the x-axis mark the publish time and blue 'x’s show the receive time; the vertical distance between them represents the latency, making it easy to spot outliers and trends. As we can see in the graph above, there is almost a periodic pattern of message publishing/receiving that doesn’t hit our 500hz target. We profile the entire message passing pipeline to identity what might be the cause of these spikes in timestamp deltas, as well as the jitter around 2000 microseconds.
Background - Test Overview
Purpose
The ros2_framework_perf
measures and analyzes the performance characteristics of ROS 2’s core components, with a particular focus on the executor and transport layers. The framework sends messages of a fixed size through the ROS graph below to locate potential bottlenecks. To gain deeper insights, the entire benchmark execution is wrapped by the standard Linux perf
tool, as orchestrated by run_benchmark_with_profiling.sh
. This allows for a granular analysis of where time is spent, from the user’s application code down through the ROS Client Library (rcl
), the middleware layer (rmw
), and the underlying DDS implementation.
Benchmark Graph Details
The graph above illustrates our benchmark setup. It measures the end-to-end latency and performance characteristics of message passing between these nodes, with particular focus on the tensor_encoder_node
to tensor_inference_node
communication path. The tensor_encoder_node
publishes processed tensor data at a consistent 500 Hz frequency on the /tensor_encoder_output
topic, and the tensor_inference_node
subscribes to this topic, processes the incoming messages, and publishes its results to /tensor_inference_output
.
The benchmark captures comprehensive metrics including message publish/receive timing, end-to-end latency, and inter-message timing deltas. By measuring the time between when the tensor encoder publishes a message and when the inference node receives and processes it, we can identify bottlenecks in the ROS 2 message passing infrastructure, including serialization overhead, transport layer delays, and executor scheduling latency. The data shows that the system mostly maintains consistent performance, however some jitter does exist.
What’s Being Benchmarked
The framework is testing the end-to-end latency of message passing between publishers and subscribers, with particular attention to:
-
Message serialization/deserialization overhead
-
Memory management operations
-
Transport layer efficiency
-
Synchronization mechanisms
-
String handling and memory allocation patterns
Test Architecture
The test setup consists of:
-
An emitter node that publishes messages
-
A receiver node that subscribes to these messages
-
Analysis tools that process the timing data
Ideal Case:
Ideally, we’d see all publish and receive deltas on the horizontal dotted green line in the top graph, and all blue X’s on the X axis in the bottom graph. This would indicate that every published and received message is hitting its’ desired frequency, and there is zero timestamp delta between when a given message was published, to when it was received. We believe that because our benchmark is intra-process, the jitter we’re seeing can be minimized further.