As part of the work that we’ve been doing at the Hardware Acceleration WG to conduct testing and benchmarking on ROS 2 acceleration kernels, I’d like to raise a discussion around two topics:
a) adaptive ROS 2 Node computations and
b) how to constructively contribute to existing ROS 2 packages with acceleration kernels (either for FPGAs, or GPUs). Both are interconnected, which is why I’m raising them into the same thread. Since I’m at it, I’d also take the chance and invite you all to the upcoming HAWG meeting: 2021-11-24T14:00:00Z.
We proposed and contributed adaptive_component, a composable stateless container for Adaptive ROS 2 Node computations which allows you to select where to run a ROS 2 Node (between FPGA, CPU or GPU) at run-time.
Nodes using hardware acceleration are generally able to perform computations faster relying on FPGAs or GPUs, improving performance. Adaptive ROS 2 Nodes leverage hardware acceleration at run-time and aim to allow robotics engineers to select which computational resource the Node uses on-the-go, giving you a finer-grained control over the resources your computional graphs use in the underlying hardware.
In a nutshell, this ROS 2 package provides a composable stateless container for Adaptive ROS 2 Node computations. It allows building Nodes that can select between FPGA, CPU or GPU, at run-time. Technically, it’s a ROS 2
Node subclass programmed as a
Component and including its own single threaded executor to build adaptive computations. Adaptive ROS 2
Nodes can then be built easily and are able to perform computations in the CPU, the FPGA or the GPU. Adaptive behavior is controlled through the
adaptive ROS 2 parameter, with the following values considered in the current implementation:
adaptive_component is stateless by default, if you need your Adaptive Nodes to be stateful, you can inherit from
composition::AdaptiveComponent and create your own stateful subclasses (e.g. see this example (ROS 2 component)).
using NodeCPU = composition::DoubleVaddComponent; using NodeFPGA = composition::DoubleVaddComponentFPGA; rclcpp::NodeOptions options; // Create an executor rclcpp::executors::MultiThreadedExecutor exec; // Create an adaptive ROS 2 Node using "components", the resulting // Node is also programed as a "component", retaining composability auto adaptive_node = std::make_shared<composition::AdaptiveComponent>( "doublevadd_publisher_adaptive", options, // CPU std::make_shared<NodeCPU>("_doublevadd_publisher_adaptive_cpu", options), // FPGA std::make_shared<NodeFPGA>("_doublevadd_publisher_adaptive_fpga", options), // GPU nullptr); exec.add_node(adaptive_node); // fill up the executor exec.spin(); // spin the executor
Then, dynamically, one could switch from CPU to FPGA by setting the
adaptive parameter in the
- To run in the CPU:
ros2 param set /doublevadd_publisher_adaptive adaptive 0
- To run in the FPGA:
ros2 param set /doublevadd_publisher_adaptive adaptive 1
The integration of hardware acceleration into ROS often requires rewriting parts of the Node computations to further exploit parallelism. These changes often conflict with CPU-centric architectures and as a maintainer, you’re likely to care for “not breaking” CPU-centric implementations.
To consistently integrate hardware acceleration, ,
composition::AdaptiveComponent allows to extend ROS 2 CPU-centric Nodes with their computational counterparts separating concerns at build-time. From a package-maintenance perspective, each Node (across computation options) is written in a separated file and as a separated Component. These can live either within the same package, or in totally different (disconnected) ones.
adaptive_component takes care of putting them together at launch time and no dependency with the package is required at build-time.
From an execution perspective, developers can easily create Adaptive ROS 2 Nodes and compose them together as desired at launch-time, with capabilities to adaptively switch between compute alternatives at run-time.
Currently there are no integrations with ROS 2 launch system, but there’s a ticket tracking progress on that end so if you have a few cycles and wish to contribute, that’s your ticket .
Examples of using
- An Adaptive ROS 2 Node
- An Adaptive stateless ROS 2 Component (Node example using it)
- An Adaptive stateful ROS 2 Component (Node example using it)
As pointed out above, when integrating hardware acceleration we’ll often find ourselves rewriting Nodes to further exploit parallelism, fit to other compute architectures and/or use existing acceleration libraries.
I recently prototyped the version of FPGA
ResizeNode into a fork within the ros-acceleration organization to get a better grasp of the porting effort. This prototype still deserves some additional work, refactoring and CI alignment but while early, it’s already been helpful to bechmark computations. We tested and benchmarked easily
adaptive_component above with this example (stay tuned to #9 if you’d like to reproduce it).
After having gone through this effort, a few questions were raised:
- Do maintainers want Pull Requests submitted that rewrite Nodes for CPU computational counterparts (e.g. FPGAs or GPUs)? If so, do we have any community guidelines on how to do so?
- E.g. besides ensuring dependencies are found/met, what steps do I need to take to merge
ResizeNodeFPGAupstream? (ping @JWhitleyWork @Vincent_Rabaud @jacob) Is it acceptable to create a new Node? Shall we set some conventions?
- One of the reasons why ROS has been so successful is because people’s contributing back. I wrote
adaptive_componentto simplify benchmarking and to separate compute concerns, so that Nodes (Components) can live in different packages. This allows for mixed public/private implementations as well. While reviewing these aspects, I found out that there exists already a renamed image_pipeline fork. Do we want to encourage forks which fragment each ROS 2 stack for hardware acceleration? Shouldn’t we instead advocate against this and ask for people to get contributions upstream?
- I also found out that these unofficial forks ship with a different license, which seems very different from the original image_pipeline’s LICENSE, which is Apache 2.0. I found this interesting so I did a quick smoke test to see how different these implementations are with JPlag using the
ResizeNodeComponent class across the official CPU implementation (
resize.cpp), the FPGA one I wrote (
resize_fpga.cpp) and the GPU unofficial fork I found (
java -jar jplag/jplag/target/jplag-3.0.0-SNAPSHOT-jar-with-dependencies.jar -l cpp test Initialized language C/C++ Scanner [basic markup] JPlag initialized Comparing resize.cpp-resize_fpga.cpp: 33.415234 Comparing resize.cpp-resize_node.cpp: 24.576271 Comparing resize_fpga.cpp-resize_node.cpp: 7.4941454 ...
I would like to get input on these questions and thoughts, as well as (community-)alignment. Ideally, the answers should somehow be reflected into REP-2008 PR which aims to address exactly this providing a reference architecture and conventions for hardware acceleration in ROS 2.