We are celebrating the release of humble, and the new features it brings to enable hardware acceleration for graphs of nodes.
This is a culmination of a 9 month collaboration with Open Robotics to improve ROS performance on hardware accelerated platforms. Our thanks and appreciation goes to Chris Lalancette, Audrow Nash, Gonzalo de Pedro, and William Woodall for all of their hard and relentless efforts; special thanks to Brian Gerkey for partnering on this work.
As robotics applications embrace AI, CV, and other compute intensive workloads, it has become imperative to enable hardware acceleration in ROS. With hardware acceleration, these applications can perform more functions with higher throughput, and better perf/watt. Realizing these benefits are often specific to the hardware implementation and therefore need to be abstracted from ROS.

(example graph of nodes using hardware acceleration in Foxy (top graph) compared to use of type adaption in Humble (bottom graph). Type adaptation reduces copies from CPU to GPU in a pipeline of nodes, while increasing concurrency between the CPU and GPU)
TYPE ADAPTATION
ROS topics can be adapted to a format better suited for acceleration in hardware using type adaptation (REP-2007). A node using an adapted type, can publish, and/or receive the adapted type. Nodes using an adapted type, need to provide functions to convert from the standard type, to the adapted type, and visa-versa. This enables a graph of nodes to use an adapted type which can improve CPU and hardware acceleration concurrency, offload the CPU from compute tasks, and eliminate memory copies between the CPU and hardware accelerator.
TYPE NEGOTIATION
With a graph of ROS nodes using an adapted type, we can further benefit from optimizing the type used between nodes in the graph. Nodes supporting type negotiation (REP-2009) can share a list of types they support as a publisher, and as a subscriber with a weight indicating their preferences. ROS will review the publishers and subscribers participating in type negotiation, and optimize for preferences while maintaining compatibility with nodes that do not support type negotiation. Preferences are a way to reflect performance, or cost of the type, and should be tuned by the developer of the node, but can be overridden by the application developer.
(profile of Jetson AGX Xavier, with 89ms on Foxy vs 32ms on Humble with type adaptation for the same graph of nodes)
As performance was improved with type adaptation, and type negotiation, intra-process topic passing in ROS became a bottleneck. Nsight Systems was used to profile message passing to identify areas for improvement. Changes were made in rcl.cpp to reduce shared memory pointer copies and checking to print debug messages.
(ROS2 node graph operating in sequence on 1080p CUDA buffers in Foxy vs the same node graph in Humble with Type Adaptation; results measured in Hz on Jetpack 5.0 developer preview, Ubuntu 20.04 with Jetson AGX Orin and Xavier. Graph of nodes is designed to test framework performance by minimizing compute workload, bringing focus to overhead in the ROS Client Library)
In pixel processing Jetson AGX Orin went from 0.55 gigapixels/sec in Foxy to 4 gigapixels/sec in Humble on this test.
Hardware accelerated graphs of nodes using type adaption and type negotiation improve performance, concurrency, and perf/watt. There are other alternative approaches to implementing hardware acceleration which fork ROS, bypass ROS topics or introduce incompatibilities with existing nodes. Type adaptation and type negotiation are native to ROS with Humble, compatible with existing nodes, and open to all types of hardware accelerators including GPUs, DSPs, NN accelerators, and other HW blocks.
We are implementing type adaptation and type negotiation in NITROS (NVIDIA Isaac Transport for ROS) to optimize hardware acceleration for ROS2 Humble; this will be released in Isaac ROS late June.