ROS 2 benchmark open source release

We are releasing benchmark tooling for ROS 2 which provides performance measurement of graphs of nodes in open source. This tooling allows for realistic assessments of robotics applications under load including message transport costs in RCL for practical benchmarking indicative of your real-world performance. This tooling does not require modification of nodes to measure results, and standardizes input rosbag datasets for independent verification of benchmark results.

ros2_benchmark for Humble is available now at github.com/NVIDIA-ISAAC-ROS/ros2_benchmark.

r2b dataset 2023 for standard input data to benchmarking is available for download from NGC.

r2b_turtlebot_takeoff

The ros2_benchmark uses industry best practices and is professionally hardened for throughput and latency measurement of graphs of nodes in real-time robotics applications including:

  • Dependable results - automated performance measurements are performed for multiple seconds N times (default N = 5), discarding min and max results to reduce variability. Benchmark results are reported in log files for import into your visualization tool of choice.
  • Input Dataset - available for download from NGC under CCv4 Attribution License the r2b dataset 2023 provides a consistent input to the graph from rosbag. Additional input data can be added when needed.
  • Input image resolutions - with a broad range of computing hardware available, image processing is performed at different resolutions depending on the robotics application.
  • Input & output transport time - Time spent in RCL publishing and receiving messages for inter and intra process is included in the measurement results; this accurately represents what can be expected in a robotics application and avoids inflated results that remove message passing costs.
  • Input & output type adaptation - input data is injected using standard ROS types, or using type adaptation and type negotiation.
  • Benchmark parameters - the parameters used for testing including data input length, publishing rate, input size and can be customized with a configuration file.
  • Throughput auto finder - to measure peak throughput of the graph, with <1% topic drops, requires automatically finding the peak throughput of the graph. The throughout auto finder efficiently finds the input data publishing rate for peak throughput.
  • Real-time latency - fixed topic publisher rate for real-time measure of latency. This shows what is provided to the real-time system at the target fixed rate whereas throughput shows what peak performance is possible for the robotics application.
  • Cloud native - measurements can be performed on Kubernetes as part of automated testing or CI|CD nightly testing as part of modern software development. Measurements can also be performed on local developer systems.
  • Opaque testing - graphs of nodes are tested as binaries with all performance measurement tooling in benchmark directly and does not modify code in the graph under test. This enables performance measurement inclusive of open source to proprietary solutions with the same tools in an unobtrusive way.
  • Transparency - results in JSON include the parameters used to run the benchmark, including a MD5 of the input rosbag for independent and verification of results.

In development of Humble type adaptation (REP-2007) and type negotiation (REP-2009) to improve performance of graphs of ROS 2 nodes we needed a way to consistently measure performance and improvements. Prior benchmarks for ROS focused on measuring the performance of publishing and receiving topics in the same process, or across processes using the DDS. We needed to measure the performance of graphs of nodes including the time spent publishing topics and computation within the graph in an automated, cloud-native, consistent way to best represent what can be expected in a robotics application. ros2_benchmark was created to provide this.

r2b_lounge_sequence r2b_storage_sequence r2b_hope_sequence r2b_hallway_sequence
r2b dataset of sensor data captured compressed to rosbag in multiple scenarios are provided for input data to benchmarking. This includes high precision time-synchronized sensor data from 3D LIDAR, stereo cameras and IMUs.

ros2_benchmark is our professional grade tool to measure all of the results we publish for Isaac ROS. This is part of our CI|CD process to measure performance nightly, and catch regressions; we run this tool on 7 platforms using aarch64 and x86_64 architectures across hundreds of graph configurations. The tool has been run tens of thousands of times in automated cloud native testing for over a year, and on developer systems.

A sample subset output JSON log for Stereo Image Proc node (CPU-only) running on Jetson AGX Orin is provided below.

{ "BasicPerformanceMetrics.RECEIVED_DURATION": 4981.019205729167, "BasicPerformanceMetrics.MEAN_PLAYBACK_FRAME_RATE": 66.44940807620169, "BasicPerformanceMetrics.MEAN_FRAME_RATE": 66.45226335469312, "BasicPerformanceMetrics.NUM_MISSED_FRAMES": 0.0, "BasicPerformanceMetrics.NUM_FRAMES_SENT": 331.0, "BasicPerformanceMetrics.FIRST_SENT_RECEIVED_LATENCY": 14.696207682291666, "BasicPerformanceMetrics.LAST_SENT_RECEIVED_LATENCY": 14.484619140625, "BasicPerformanceMetrics.MAX_JITTER": 2.924560546875, "BasicPerformanceMetrics.MIN_JITTER": 0.000244140625, "BasicPerformanceMetrics.MEAN_JITTER": 0.2223148983662614, "BasicPerformanceMetrics.STD_DEV_JITTER": 0.28068860820818814, "CPUProfilingMetrics.MAX_CPU_UTIL": 27.633333333333336, "CPUProfilingMetrics.MIN_CPU_UTIL": 25.233333333333334, "CPUProfilingMetrics.MEAN_CPU_UTIL": 25.897777777777776, "CPUProfilingMetrics.STD_DEV_CPU_UTIL": 0.6463663508576992, "CPUProfilingMetrics.BASELINE_CPU_UTIL": 26.008333333333336,

“custom”: {
“data_resolution”: “Quarter HD (960,540)”
},
}

Clone the repositories you need into your ROS workspace to build from source with colcon alongside your other ROS 2 packages to use ros2_benchmark and better understand the performance of graphs of nodes in your ROS 2 applications.

Thanks

5 Likes

Happy to see this contribution :clap: . Specially, well done contributing and disclosing this under Apache 2.0 :partying_face:.

The values you highlight above are well aligned with REP-2014 PR, and with the ones we’re using for the RobotPerf project. The major difference being that RobotPerf is that 1) a vendor-neutral benchmarking suite (to evaluate :robot: robotics computing performance in various types of compute solutions, including CPUs, GPUs and FPGAs), and 2) that it leverages ros2_tracing as a unified (community-accepted) approach to instrument graphs and measure performance.

I gave this a try and here’s some feedback at the moment:

  • The repository instructions don’t work out of the box in aarch64 (also gave it a quick try in a Dockerized x64_64 ROS container and bumped into similar issues). Also, Instructions are unnecessarily complicated and currently very tied to Nvidia tools and packages, which is unnecessary. Instead, you should consider a development flow that’s simpler and capable of recognizing/benchmarking various compute substrates without imposing additional complexity to the developer. Here’s the flow we’re aiming for in RobotPerf and you can try out today:
# Create a ROS 2 overlay workspace
mkdir -p /tmp/benchmark_ws/src

# Clone the benchmark repository
cd /tmp/benchmark_ws/src && git clone https://github.com/robotperf/benchmarks

# Fetch dependencies
source /opt/ros/humble/setup.bash
cd /tmp/benchmark_ws && sudo rosdep update || true && sudo apt-get update &&
  sudo rosdep install --from-paths src --ignore-src --rosdistro humble -y

# Build the benchmark
colcon build --merge-install --packages-up-to a1_perception_2nodes

# Source the workspace as an overlay, launch the benchmark
source install/setup.bash
RMW_IMPLEMENTATION=rmw_cyclonedds_cpp ros2 launch a1_perception_2nodes trace_a1_perception_2nodes.launch.py

(taken from the README of a1 benchmark)

  • I love the fact that Nvidia embraced the discourse that “Robots are real-time systems” however things should be coherent with that statement. The way you’re collecting timestamps should be reconsidered. You should consider using a low-overhead framework for real-time tracing of ROS 2. That’s what ros2_tracing is for (also why the whole ROS 2 stack is instrumented with it, and why we all are using it)
  • The approach implemented here (difficults and) reinvents the wheel for profiling accelerators to measure performance. You should consider leveraging directly ros2_tracing instead of pointing to Nsight systems. ros2_tracing can be extended to integrate vendor-specific profilers. This was done in the past with systems from AMD and FPGA accelerators.
  • Using JSON is great, but there’s a good reason why we sticked to CTF. Besides real-time, robots are systems of systems. Often involving multiple networks, multiple clocks, etc. You want to stick to a format that can collect and store data of robot components spread across a network. CTF is really powerful on this regard.
  • I don’t believe it’s a good idea to discarding min and max results to reduce variability. I think there’re other ways to tackle variability. Max results is what we often care the most in real-time systems.
  • It’s not clear to me whether or not you’re storing data for later analysis. Is this possible? One of the fantastic aspects of the approach we’re following at RobotPerf in alignment with REP-2014 PR is that we disconnect data collection from data analysis (we generate a series of CTF files that can later be post-processed as desired).
  • Fetching the r2b dataset is incredibly slow. Moreover, right now, the ngc-cli is somewhat buggy. Only downloads r2b_cafe for me and from what I can see.

Overall, I see this very positively but I fear it’s very tied to NVIDIA-ISAAC-ROS marketing efforts, and not that useful for the overall ROS community. I’m also a bit dissapointed this came as a separate project, and not connected/aligned to community efforts at the ROS 2 Hardware Acceleration Working Group. I will plan for a discussion around this topic in the next meeting (happening 2023-04-18T16:30:00Z, event link, past recordings).

I believe there’s much we can use from this disclosure and we’ll try tro streamline it for general consumption within RobotPerf.

4 Likes

Thanks for taking a look. Can you report issues to Issues · NVIDIA-ISAAC-ROS/ros2_benchmark · GitHub so we can work to address it.

r2b_hope_sequence r2b_storage_sequence
(r2b_hope sequence left, r2b_storage sequence right)

We are reviewing r2b dataset deployment on NGC. I recommend starting with the r2b_hope sequence as it’s 30MB in size for testing a download. If I were to choose one sequence I would use r2b_storage as it contains a lot of features specific to robotics. The rosbags are large, as they contain compressed real sensor data. As such they were too large to host practically in GitHub, which is why they are available for free from NGC.

Don’t be a bit disappointed that this is separate as it predates your work by more than a year. We developed ros2_benchmark in fall 2021 and have been running benchmarks nightly since.

We’ve wanted to share our expertise from decades of benchmarking accelerated computing platforms with the ROS community for quite some time. While there are applications where compute is sufficient, in many applications compute is limited. Tooling to measure throughput, latency, and compute utilization are essential to make informed design decisions on how best a robotics application can meet its real-time requirements, and where to focus optimization work. This release offers great value to the ROS 2 community to accomplish this.

ros2_benchmark is a stand-alone package that measures performance for graphs of nodes off-the-shelf with nothing vendor-specific to NVIDIA other than being its developer. It enables transparency and independent verification of results we publish for accelerated computing packages in Isaac ROS. The work is complete and hardened as we depend on this for our own accelerated computing work; as such, we will continue to maintain and improve it.

As you were replying to this, we were getting work done with Kubernetes running 190 KPI measurements with ros2_benchmark, on multiple different platforms (Orin, Orin NX, Orin Nano, Xavier, Xavier NX, RTX3060ti, RTX4090, x86 Core i7 11th Gen, and x86 Core i7 12th Gen), while sleeping; getting work done as you sleep is wonderful.


(internal nightly performance coverage run with Kubernetes, on aarch64 and x86 platforms for our work on ROS 2)

In the documentation we provide methods for developers to profile, listing ros2_tracing for CPU, and Nsight Systems for accelerated computing. I understand the ask of building custom plugins for ros2_tracing; this may be something we can do in the future as our small team on ROS is focused on providing great accelerated computing packages for roboticists to develop their applications while saving them the time of grinding on the nuances of hardware. With Nsight Systems we give access to the best profiling tools for robotics developers used to optimize amazing products including ChatGPT and Nintendo Switch, for free.

We have a lot in common: we agree that benchmarking is an essential tool for the development of optimized robotics applications. It’s great if you can benefit from this for your related work.

1 Like

Following up, the Quickstart for ros2_benchmark has been updated to show how to run natively with Humble packages for platforms on Humble, as a vendor agnostic solution to benchmarking.

See you on Thursday May 4th for the webinar for using ros2_benchmark.

Thanks

1 Like

We’re pleased to announce that live streaming in ROS 2 Benchmark has been added in a release update.

This enables benchmarking with sensors in the loop to measure performance, CPU | GPU loading, and latency with the incoming capture rate from the sensor.

Thanks

2 Likes

Hi @ggrigor
This is a great news. I’m going to test it very soon