Hello everyone,
As part of the work that we’ve been doing at the Hardware Acceleration WG to conduct testing and benchmarking on ROS 2 acceleration kernels, I’d like to raise a discussion around two topics: a)
adaptive ROS 2 Node computations and b)
how to constructively contribute to existing ROS 2 packages with acceleration kernels (either for FPGAs, or GPUs). Both are interconnected, which is why I’m raising them into the same thread. Since I’m at it, I’d also take the chance and invite you all to the upcoming HAWG meeting: 2021-11-24T14:00:00Z.
Adaptive ROS 2 Node computations
We proposed and contributed adaptive_component, a composable stateless container for Adaptive ROS 2 Node computations which allows you to select where to run a ROS 2 Node (between FPGA, CPU or GPU) at run-time.
Rationale
Nodes using hardware acceleration are generally able to perform computations faster relying on FPGAs or GPUs, improving performance. Adaptive ROS 2 Nodes leverage hardware acceleration at run-time and aim to allow robotics engineers to select which computational resource the Node uses on-the-go, giving you a finer-grained control over the resources your computional graphs use in the underlying hardware.
In a nutshell, this ROS 2 package provides a composable stateless container for Adaptive ROS 2 Node computations. It allows building Nodes that can select between FPGA, CPU or GPU, at run-time. Technically, it’s a ROS 2 Node
subclass programmed as a Component
and including its own single threaded executor to build adaptive computations. Adaptive ROS 2 Nodes
can then be built easily and are able to perform computations in the CPU, the FPGA or the GPU. Adaptive behavior is controlled through the adaptive
ROS 2 parameter, with the following values considered in the current implementation:
-
0
: Hardware::CPU -
1
: Hardware::FPGA -
2
: Hardware::GPU
adaptive_component
is stateless by default, if you need your Adaptive Nodes to be stateful, you can inherit from composition::AdaptiveComponent
and create your own stateful subclasses (e.g. see this example (ROS 2 component)).
How does it work?
using NodeCPU = composition::DoubleVaddComponent;
using NodeFPGA = composition::DoubleVaddComponentFPGA;
rclcpp::NodeOptions options;
// Create an executor
rclcpp::executors::MultiThreadedExecutor exec;
// Create an adaptive ROS 2 Node using "components", the resulting
// Node is also programed as a "component", retaining composability
auto adaptive_node = std::make_shared<composition::AdaptiveComponent>(
"doublevadd_publisher_adaptive",
options,
// CPU
std::make_shared<NodeCPU>("_doublevadd_publisher_adaptive_cpu", options),
// FPGA
std::make_shared<NodeFPGA>("_doublevadd_publisher_adaptive_fpga", options),
// GPU
nullptr);
exec.add_node(adaptive_node); // fill up the executor
exec.spin(); // spin the executor
Then, dynamically, one could switch from CPU to FPGA by setting the adaptive
parameter in the /doublevadd_publisher_adaptive
Node:
- To run in the CPU:
ros2 param set /doublevadd_publisher_adaptive adaptive 0
- To run in the FPGA:
ros2 param set /doublevadd_publisher_adaptive adaptive 1
Why should I care as a ROS package maintainer?
The integration of hardware acceleration into ROS often requires rewriting parts of the Node computations to further exploit parallelism. These changes often conflict with CPU-centric architectures and as a maintainer, you’re likely to care for “not breaking” CPU-centric implementations.
To consistently integrate hardware acceleration, avoid unnecessary forks and discourage package fragmentation, composition::AdaptiveComponent
allows to extend ROS 2 CPU-centric Nodes with their computational counterparts separating concerns at build-time. From a package-maintenance perspective, each Node (across computation options) is written in a separated file and as a separated Component. These can live either within the same package, or in totally different (disconnected) ones. adaptive_component
takes care of putting them together at launch time and no dependency with the package is required at build-time.
From an execution perspective, developers can easily create Adaptive ROS 2 Nodes and compose them together as desired at launch-time, with capabilities to adaptively switch between compute alternatives at run-time.
Currently there are no integrations with ROS 2 launch system, but there’s a ticket tracking progress on that end so if you have a few cycles and wish to contribute, that’s your ticket .
Some examples
Examples of using adaptive_component
:
- An Adaptive ROS 2 Node
- An Adaptive stateless ROS 2 Component (Node example using it)
- An Adaptive stateful ROS 2 Component (Node example using it)
Hardware acceleration contributions - concerns on fragmentation, relicensing, etc
As pointed out above, when integrating hardware acceleration we’ll often find ourselves rewriting Nodes to further exploit parallelism, fit to other compute architectures and/or use existing acceleration libraries.
I recently prototyped the version of FPGA image_pipeline
's ResizeNode
into a fork within the ros-acceleration organization to get a better grasp of the porting effort. This prototype still deserves some additional work, refactoring and CI alignment but while early, it’s already been helpful to bechmark computations. We tested and benchmarked easily ResizeNode
(and ResizeNodeFPGA
) using adaptive_component
above with this example (stay tuned to #9 if you’d like to reproduce it).
After having gone through this effort, a few questions were raised:
- Do maintainers want Pull Requests submitted that rewrite Nodes for CPU computational counterparts (e.g. FPGAs or GPUs)? If so, do we have any community guidelines on how to do so?
- E.g. besides ensuring dependencies are found/met, what steps do I need to take to merge
ResizeNodeFPGA
upstream? (ping @JWhitleyWork @Vincent_Rabaud @jacob) Is it acceptable to create a new Node? Shall we set some conventions? - One of the reasons why ROS has been so successful is because people’s contributing back. I wrote
adaptive_component
to simplify benchmarking and to separate compute concerns, so that Nodes (Components) can live in different packages. This allows for mixed public/private implementations as well. While reviewing these aspects, I found out that there exists already a renamed image_pipeline fork. Do we want to encourage forks which fragment each ROS 2 stack for hardware acceleration? Shouldn’t we instead advocate against this and ask for people to get contributions upstream? - I also found out that these unofficial forks ship with a different license, which seems very different from the original image_pipeline’s LICENSE, which is Apache 2.0. I found this interesting so I did a quick smoke test to see how different these implementations are with JPlag using the
ResizeNode
Component class across the official CPU implementation (resize.cpp
), the FPGA one I wrote (resize_fpga.cpp
) and the GPU unofficial fork I found (resize_node.cpp
) :
java -jar jplag/jplag/target/jplag-3.0.0-SNAPSHOT-jar-with-dependencies.jar -l cpp test
Initialized language C/C++ Scanner [basic markup]
JPlag initialized
Comparing resize.cpp-resize_fpga.cpp: 33.415234
Comparing resize.cpp-resize_node.cpp: 24.576271
Comparing resize_fpga.cpp-resize_node.cpp: 7.4941454
...
I would like to get input on these questions and thoughts, as well as (community-)alignment. Ideally, the answers should somehow be reflected into REP-2008 PR which aims to address exactly this providing a reference architecture and conventions for hardware acceleration in ROS 2.