Adaptive ROS 2 Node computations and hardware acceleration contributions

Hello everyone,

As part of the work that we’ve been doing at the Hardware Acceleration WG to conduct testing and benchmarking on ROS 2 acceleration kernels, I’d like to raise a discussion around two topics: a) adaptive ROS 2 Node computations and b) how to constructively contribute to existing ROS 2 packages with acceleration kernels (either for FPGAs, or GPUs). Both are interconnected, which is why I’m raising them into the same thread. Since I’m at it, I’d also take the chance and invite you all to the upcoming HAWG meeting: 2021-11-24T14:00:00Z.

Adaptive ROS 2 Node computations

We proposed and contributed adaptive_component, a composable stateless container for Adaptive ROS 2 Node computations which allows you to select where to run a ROS 2 Node (between FPGA, CPU or GPU) at run-time.

Rationale

Nodes using hardware acceleration are generally able to perform computations faster relying on FPGAs or GPUs, improving performance. Adaptive ROS 2 Nodes leverage hardware acceleration at run-time and aim to allow robotics engineers to select which computational resource the Node uses on-the-go, giving you a finer-grained control over the resources your computional graphs use in the underlying hardware.

In a nutshell, this ROS 2 package provides a composable stateless container for Adaptive ROS 2 Node computations. It allows building Nodes that can select between FPGA, CPU or GPU, at run-time. Technically, it’s a ROS 2 Node subclass programmed as a Component and including its own single threaded executor to build adaptive computations. Adaptive ROS 2 Nodes can then be built easily and are able to perform computations in the CPU, the FPGA or the GPU. Adaptive behavior is controlled through the adaptive ROS 2 parameter, with the following values considered in the current implementation:

  • 0: Hardware::CPU
  • 1: Hardware::FPGA
  • 2: Hardware::GPU

adaptive_component is stateless by default, if you need your Adaptive Nodes to be stateful, you can inherit from composition::AdaptiveComponent and create your own stateful subclasses (e.g. see this example (ROS 2 component)).

How does it work?

asciicast

using NodeCPU = composition::DoubleVaddComponent;
using NodeFPGA = composition::DoubleVaddComponentFPGA;

rclcpp::NodeOptions options;

// Create an executor
rclcpp::executors::MultiThreadedExecutor exec;

// Create an adaptive ROS 2 Node using "components", the resulting
// Node is also programed as a "component", retaining composability
auto adaptive_node = std::make_shared<composition::AdaptiveComponent>(
      "doublevadd_publisher_adaptive",        
      options,                                
                                              // CPU
      std::make_shared<NodeCPU>("_doublevadd_publisher_adaptive_cpu", options),
                                              // FPGA
      std::make_shared<NodeFPGA>("_doublevadd_publisher_adaptive_fpga", options),
                                              // GPU
      nullptr);

exec.add_node(adaptive_node);  // fill up the executor
exec.spin();  // spin the executor

Then, dynamically, one could switch from CPU to FPGA by setting the adaptive parameter in the /doublevadd_publisher_adaptive Node:

  • To run in the CPU: ros2 param set /doublevadd_publisher_adaptive adaptive 0
  • To run in the FPGA: ros2 param set /doublevadd_publisher_adaptive adaptive 1
Why should I care as a ROS package maintainer?

The integration of hardware acceleration into ROS often requires rewriting parts of the Node computations to further exploit parallelism. These changes often conflict with CPU-centric architectures and as a maintainer, you’re likely to care for “not breaking” CPU-centric implementations.

To consistently integrate hardware acceleration, avoid unnecessary forks and discourage package fragmentation, composition::AdaptiveComponent allows to extend ROS 2 CPU-centric Nodes with their computational counterparts separating concerns at build-time. From a package-maintenance perspective, each Node (across computation options) is written in a separated file and as a separated Component. These can live either within the same package, or in totally different (disconnected) ones. adaptive_component takes care of putting them together at launch time and no dependency with the package is required at build-time.

From an execution perspective, developers can easily create Adaptive ROS 2 Nodes and compose them together as desired at launch-time, with capabilities to adaptively switch between compute alternatives at run-time.

Currently there are no integrations with ROS 2 launch system, but there’s a ticket tracking progress on that end so if you have a few cycles and wish to contribute, that’s your ticket :slight_smile: .

Some examples

Examples of using adaptive_component:

Hardware acceleration contributions - concerns on fragmentation, relicensing, etc

As pointed out above, when integrating hardware acceleration we’ll often find ourselves rewriting Nodes to further exploit parallelism, fit to other compute architectures and/or use existing acceleration libraries.

I recently prototyped the version of FPGA image_pipeline's ResizeNode into a fork within the ros-acceleration organization to get a better grasp of the porting effort. This prototype still deserves some additional work, refactoring and CI alignment but while early, it’s already been helpful to bechmark computations. We tested and benchmarked easily ResizeNode (and ResizeNodeFPGA) using adaptive_component above with this example (stay tuned to #9 if you’d like to reproduce it).

After having gone through this effort, a few questions were raised:

  1. Do maintainers want Pull Requests submitted that rewrite Nodes for CPU computational counterparts (e.g. FPGAs or GPUs)? If so, do we have any community guidelines on how to do so?
  2. E.g. besides ensuring dependencies are found/met, what steps do I need to take to merge ResizeNodeFPGA upstream? (ping @JWhitleyWork @Vincent_Rabaud @jacob) Is it acceptable to create a new Node? Shall we set some conventions?
  3. One of the reasons why ROS has been so successful is because people’s contributing back. I wrote adaptive_component to simplify benchmarking and to separate compute concerns, so that Nodes (Components) can live in different packages. This allows for mixed public/private implementations as well. While reviewing these aspects, I found out that there exists already a renamed image_pipeline fork. Do we want to encourage forks which fragment each ROS 2 stack for hardware acceleration? Shouldn’t we instead advocate against this and ask for people to get contributions upstream?
  4. I also found out that these unofficial forks ship with a different license, which seems very different from the original image_pipeline’s LICENSE, which is Apache 2.0. I found this interesting so I did a quick smoke test to see how different these implementations are with JPlag using the ResizeNode Component class across the official CPU implementation (resize.cpp), the FPGA one I wrote (resize_fpga.cpp) and the GPU unofficial fork I found (resize_node.cpp) :
java -jar jplag/jplag/target/jplag-3.0.0-SNAPSHOT-jar-with-dependencies.jar -l cpp test
Initialized language C/C++ Scanner [basic markup]
JPlag initialized
Comparing resize.cpp-resize_fpga.cpp: 33.415234
Comparing resize.cpp-resize_node.cpp: 24.576271
Comparing resize_fpga.cpp-resize_node.cpp: 7.4941454
...

I would like to get input on these questions and thoughts, as well as (community-)alignment. Ideally, the answers should somehow be reflected into REP-2008 PR which aims to address exactly this providing a reference architecture and conventions for hardware acceleration in ROS 2.

3 Likes

I tried looking in the code to understand the core idea, but got a little lost. Could you somehow summarize the core idea of the adaptive node? Is it so that the adaptive parameter instructs one sub-node to unsubscribe from everything and another one to subscribe? Or is it doing something else? Does it somehow interact with the lifetime events of the sub-nodes?

Other from that, adaptive is a very bad choice of the parameter name IMO. It tells nothing about what the parameter is doing (well, adapting…? :slight_smile: ).

Thanks for taking a look @peci1.

The core idea of an adaptive Node is that it should allow to move computations across compute substrates in an adaptive manner and (by default) stateless manner (e.g. going from CPUs to the GPU for an optimized performance, or from the GPU to an FPGA for a more deterministic one, etc).

The approached followed for this first implementation is to create AdaptiveNode class as a container of Components (one for each compute substrate) with a single-threaded executor as an attribute that is used to provide compute cycles to the desired substrate (and only this one). Then, through ROS parameters, one can change which sub-Component gets added to the executor.

For now there’s no ROS Lifecycle capabilities implemented into AdaptiveNode (sub-components may or may not have it) but I’d like to explore that topic in the future (see my comment in here).

Thanks for the input, do you have a better name suggestion for the parameter?

Thanks, now it’s clearer to me. However, I have one more question about the adaptive node being stateless. E.g. GPUs require initialization (reservations of memory etc.), so when will these be done? At the meta-node startup, or at each component activation?

Regarding the parameter name, what about compute_device or similar?

Good questions. Here’re my views but this is far from final and constructive criticism is needed:

  • Initialization should happen on each Node/Component as appropriate. AdaptiveComponent has been written as a container on top of a Node’s behavior and shouldn’t conflict with its sub-Components except for what concerns “giving them compute cycles in the appropriate substrate”, sequencing the corresponding Components as needed. Memory allocation/deallocation is a Node’s responsibility and part of the Node’s initialization routine. AdaptiveComponent doesn’t need to be aware of that (nor the Node needs to be aware that it’s been deposited in an adaptive Node). GPU or FPGA initializations should happen in the corresponding <node>_gpu or <node>_fpga.

  • The state needs to be set/reset at every node startup/switch (which means, also at each component activation). By default, adaptive_component is stateless, but you can inherit from the AdaptiveComponent class and create your own stateful subclass. It’s just hard to generalize by default of what a Node’s state is (each node’s state is different), so I defaulted to stateless.

Both of these concepts are demonstrated in this example which captures the “state” of a simple publisher by inheriting from AdaptiveComponent and wherein each node deals with its corresponding initializations.

I’ll take into account, thanks.

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.