REP-2008 RFC - ROS 2 Hardware Acceleration Architecture and Conventions

As some of you may know, we at Xilinx have been busy contributing to ROS 2 and driving the Hardware Acceleration Working Group. After a few months prototyping, developing acceleration kernels, and researching how to best integrate hardware acceleration, we decided to try and bring our learnings to the community. To do so we considered two important aspects:

  • a) we want this to remain technology-agnostic (i.e. avoiding vendor lock-in) so that roboticists can use whichever hardware accelerator they prefer, and silicon vendors desiring to enter the ROS world have “a way” for it
  • b) we wanted to reduce (as much as possible, if not completely) the impact that introducing hardware acceleration has on package maintainers, so that upgrading existing ROS 2 packages to support hardware acceleration becomes a minimum effort. This last bit is very important. After all, we don’t want to favour the creation of ROS 2 package forks unnecesarily (or ROS 2 forks). There’re already a few companies creating forks out there for commercial purposes, and besides the fragmentation concerns that this brings, reality is that arguably, contributions don’t always get back as promised.

With this intent, I’m happy to share we’re submiting a convention for hardware acceleration in the form of a ROS Enhancement Proposal (REP). You can find the current draft at this PR. Moreover, a reference implementation of this REP has finally received approval from our legal group, and will be shared in the ROS 2 Hardware Acceleration WG GitHub organization in the coming weeks.

We hope this REP will receive community attention and feedback, thereby, I’m calling for a RFC. I’m sharing below the Abstract and the Motivation sections to provide an intuition. Please refer to the full draft in the PR above and leave your constructive feedback:

Abstract

This REP describes the architectural pillars and conventions required to introduce hardware acceleration in ROS 2 in a scalable and technology-agnostic manner. A set of categories meant to classify supported hardware acceleration solutions is provided. Inclusion in a category is based on the fulfillment of the hardware acceleration capabilities.

Motivation

With the decline of Moore’s Law, hardware acceleration has proven itself as the answer for achieving higher performance gains in robotics. By creating specialized compute architectures that rely on specific hardware (i.e. through FPGAs or GPUs), hardware acceleration empowers faster robots, with reduced computation times (real fast, as opposed to real-time), lower power consumption and more deterministic behaviours. The core idea is that instead of following the traditional control-driven approach for software development in robotics, a mixed control- and data-driven one allows to design custom compute architectures that further exploit parallelism. To do so, we need to integrate in the ROS 2 ecosystem external frameworks, tools and libraries that facilitate creating parallel compute architectures.

The purpose of this REP is to provide standard guidelines for how to use hardware acceleration in combination with ROS 2. These guidelines are realized in the form of a series of ROS 2 packages that integrate these external resources and provide a ROS 2-centric open architecture for hardware acceleration.

The architecture proposed extends the ROS 2 build system (ament), the ROS 2 build tools (colcon) and add a new firmware pillar to simplify the production and deployment of acceleration kernels. The architecture is agnostic to the computing target (i.e. considers support for edge, workstation, data center or cloud targets), technology-agnostic (considers initial support for FPGAs and GPUs), application-agnostic and modular, which enhances portability to new hardware solutions and other silicon vendors. The core components of the architecture are disclosed under an Apache 2.0 license, available and maintained at the ROS 2 Hardware Acceleration Working Group GitHub organization.

Value for stakeholders:

  • Package maintainers can use these guidelines to integrate hardware acceleration capabilities in their ROS 2 packages.
  • Consumers can use the guidelines in the REP, as well as the corresponding category of each hardware solution, to set expectations on the hardware acceleration capabilities that could be obtained from each vendor’s hardware solution.
  • Silicon vendors and solution manufacturers can use these guidelines to connect their firmware and technologies to the ROS 2 ecosystem, obtaining direct support for hardware acceleration in all ROS 2 packages that support it.

The outcome of this REP should be that maintainers who want to leverage hardware acceleration in their packages, can do so with consistent guidelines and with support across multiple technologies (FPGAs and GPUs) by following the conventions set. This way, maintainers will be able to create ROS 2 packages with support for hardware acceleration that can run across hardware acceleration technologies, including FPGAs and GPUs.

In turn, the documentation of categories and hardware acceleration capabilities will improve. The guidelines in here provide a ROS 2-centric open architecture for hardware acceleration, which silicon vendors can decide to adopt when engaging with the ROS 2 community.

Read more at the REP’s PR.

13 Likes

@vmayoral great stuff, great initiative. I did not have time to properly read through your PR yet but would you mind commenting on where does this proposal feed into/overlap with: Unified ML Inference in Autoware - A proposal. More specifically with the Apache TVM project.

2 Likes

@Dejan_Pangercic thanks for showing so much entusiam for this. Let me give you a bit of context before answering your question. Hopefully you’re already aware of all of this, and it’s just a reminder, but I’ve been getting similar questions from others in the community, so it should be helpful for those following this thread:


Hardware acceleration in the ROS 2 context refers to the process by which a ROS abstraction (typically a Node) offloads certain computing tasks onto specialized hardware components within the hardware system, enabling greater performance (faster), more determinism or security/safety (through isolation) capabilities, among others.

I got this quite a few times already and I’ll be speaking about it in Adapt 2021 (and taking questions).

(hardware acceleration) that’s not something roboticists think about too much typically and you might want to start from a more base explanation that not everyone (or most people, I suspect) are going to have any idea what this is really about. For instance, if I use a GPU typically I’m parallelizing some process or utilizing AI/ML models that can’t be run on a CPU. Often times those libraries are tied into specific GPU vendors / implementations so its not like I’m going to go through the effort, as a company, to rewrite really low-level parts of TensorFlow or Caffe. In that context, I’m not really sure what this (REP) is trying to accomplish. When I am writing GPU code, I’m using openCL so its portable across Intel / AMD / Nvidia hardware and that generally solves the cross-vendor issue.

There’s a merit in this comment. Particularly the bit about “When I am writing GPU code, I’m using openCL so its portable across Intel / AMD / Nvidia hardware”. This is the expectation (it was mine when I started working on this as well). Reality thought, is different.

OpenCL (unfortunately) has had only partial success. Most silicon companies (including the one I work for today, Xilinx) are only pushing it for interoperability at the CPU-level (that is, how acceleration kernels interoperate with the CPU), but almost nobody writes kernels in OpenCL these days. At least that I’ve seen. You’d use HIP for AMD (at a high-level), CUDA with Nvidia, or HLS for C++ with Xilinx (or even Verilog if you like hardcore hardware like me). It’s important to understand each silicon vendor has its own thing. There’re business reasons behind it which need to be understood, but that should not overcomplicate the use of popular frameworks. It should be silicon vendors the ones making the effort (after all, they’re deciding to waste the chance with OpenCL), not us.

The reason why you can use TensorFlow in all these compute substrates (FPGAs, GPUs, etc.) without caring about low level details (such as HLS, HIP or CUDA, among others) is exactly because those projects (e.g. TensorFlow) became so big (so relevant), that each one silicon vendor was forced to build their own accelerators (again, each in their corresponding form) for these projects if they wanted to keep up.

What I’m attempting to do with this REP, in a nutshell, is exactly (really) this. Start a community effort that elevates ROS 2 at the point of telling these big (silicon) companies to wake up, and start commiting engineering resources to build accelerators for the ROS stacks in an organized, structured and ROS 2 package maintainer centric manner.

Note the section in the REP about value for stakeholders.

Note also that this comes as opposed to allowing each silicon vendor (which I’ve seen for months now doing) create their own package forks (each its own), for each ROS 2 packages subject to hardware acceleration. Some are even doing this for ROS 2 core packages, reinventing the wheel. This is horrible from a maintener’s perspective, and even worse from a users’:

  • For a maintainer, if she wants to support different hardware acceleration solutions, (e.g. AMD and Nvidia), she will find herself struggling to coordinate efforts between silicon vendors who likely won’t pay much attention, because their business drivers are just different (not community oriented).
  • For a user, he will struggle horribly while changing between acceleration solutions (e.g. moving from the Jetson to the Kria SOMs), because it’s not just the kernels themselves, but the whole underlying infrastructure changes. The libraries required to build those accelerators, the way BSPs are generated, the cross-compilation process. Lots of fun if you like embedded like me, but horrible otherwise.

I’m speaking about this in a recent paper we disclosed.

I’m in touch with AMD and Nvidia exactly for this. We at Xilinx, with this REP proposal, are trying to turn things around now, on time. So that there’s a general outer set of abstractions and primitives that maintainers can use when including hardware acceleration in their packages in a technology-agnostic manner. Then, it’s up to each silicon vendors to implement the low level kernels that favour the ROS community’s interest for each ROS package/stack. This is explained in the architecture pillars of the REP.

That’s why this REP proposes a reference architecture for hardware acceleration and the corresponding conventions that will empower us all to use ROS with hardware acceleration in a ROS 2-centric manner (as opposed to in an HLS-centric manner, or HIP-centric way, etc). This REP is a community effort to provide these vendors “a way” for interoperability. A consensus of “a way in”. Otherwise, we won’t be able to run ROS in the same way you run TensorFlow, across hardware acceleration solutions.

And I definitely want to encourage this. That the hardware acceleration API we all use is based on the ROS official packages, not whatever-fork-company-A has created.

In the pillars, it looks like you’re just proposing some new cmake macros and colcon verbs. Why does this need to be an REP? Can’t this go in your own repo as a simple extensions?

If we want package maintainers to be able to include support for hardware acceleration in a technology agnostic maner, the cleanest and most scalable way (unless we start forking everything, which I’d argue strongly against) is to have a series of CMake macros that are included in the ROS 2 Common packages and that hide the complexity of each one of those underlying implementations. These CMake macros should operate as if you have hardware acceleration, then build the kernel. If you don't, just skip it. It’s then up to each silicon vendor, to match those CMake macros with their own low-level libraries and tools (we at Xilinx are providing a reference implementation for this, so that others can immite it). This way, we’re empowering the possibility to include hardware acceleration in existing ROS packages (by using these macros), without having to fork every single ROS 2 package out there that’s computationally expensive.

I can’t stress enough how important it is to get this “right” now. If we don’t, then, the already existing ROS fragmentation (across distros) is soon to get a new dimension with hardware acceleration.

My very biased opinion is that we need a REP that somehow defines a series of CMake abstractions and conventions that:

  • a) users can use without caring for the underlying acceleration technology and buying/using what their budget allows for,
  • b) maintainers will use for including hardware acceleration support and
  • c) companies like mine (Xilinx), AMD or Nvidia (among others) should comply with if they wish to align with what benefits ROS users and package maintainers the most (which arguably should be IMHO the community’s position).

How can I see this live, with real demos and examples?

As I said above:

I’m working with our team to disclose everything in the coming weeks. I’ll ping back in here. But very soon. The initial implementation is done. Just big company’s problems which we’re addressing.

This assumes there’s some capabilities like tensorflow within ROS that needs GPU/FGPA support – what are those, specifically?

Lots. E.g. one of the basic examples that I’ve been looking at lately is image_pipeline . Most vision-based computational graphs rely on it, and it appears many transformations can be accelerated heavily by using specialized hardware libraries/kernels that already exists. I’m currently looking at this and in the case of Xilinx, we have something called the Vitis Vision Library which implements primivitives for rectify , resize or conversions (among others). Nvidia has a similar thing (VPI), and so does AMD (MIVisionX). Again, each one their own tune.

I’m working with both AMD and Nvidia to try and get common ground. Launching this REP was my attempt to get us all to push in one direction.


Sure, and sorry for the long discourse above, but that’s what this forum is for, right :crazy_face: ?

In a nutshell, as far my understanding (from a quick reading) goes, there’s no overlap. A complement possibly, if the AI demand suddenly grows in ROS :joy:. Apache TVM is ML specific, and neither the HAWG targets this specifically, nor all hardware acceleration is ML-related. In fact, this question is probably best brought to the AI Edge WG at the TSC level once @joespeed has time for it (discussion pointer).

On a general view, Apache TVM seems somewhat equivalent to the example I was giving above about Vision libraries and the image_pipeline ROS 2 package: there’s the Xilinx’s Vitis Vision Library, AMD’s MIVisionX and Nvidia’s VPI, which can drive you a bit nuts. Having a project simplying interoperability for ML sounds like a nice intent, but it’s not the ROS community responsibility to get down to the specifics of each ML library.
That’s something silicon vendors (and in this case I guess Apache foundation?) should arrange themselves. From a ROS perspective, we simply should tell them how we want (and that’s what this REP is for) their accelerators to be built and integrated into ROS 2 packages.

A group that I led in the past somewhat initiated one of such ML discussions in this community, and I still today get questions with overblown expectations of what AI can do in robotics. It’s surprising how little AI folks understand about robotics sometimes. Nevertheless, for this context, If I were to compare ML frameworks with ROS, there’s a big difference. For the former, most silicon vendors already have their own implementations (Caffe, TensorFlow, etc.). For the latter, for ROS, this is getting started now.

While TVM is an awesome initiative, it’s going to be (very) hard to change silicon vendor’s mindset to switch to supporting TVM, instead of their own self-implemented accelerators for Caffee or TensorFlow, among others. Instead, with ROS, we now have the chance to set the blueprint, and ask silicon vendors interested in growing their customer base in robotics to comply.

As I was saying, we’re on time.

4 Likes

Our team published a white paper further motivating the importance of this topic: Adaptive Computing in Robotics Leveraging ROS 2 to Enable Software-Defined Hardware for FPGAs.

There’s an overview about the different compute substrates available to roboticists and how each one excels at different aspects. This should be a useful introduction for anyone considering how different compute architectures can help improve her/his computational graphs. Moreover, the extensions to the build system (ament) covered in the REP are further explained in the article, feedback and discussion is welcome.

1 Like

Thinking from a ROS developer point of view. I don’t think this goes far enough. I can see the following benefits:

  • Accelerator SDKs are managed using ROS packages so that they are automagically pulled in (this is not actually explicitly mentioned in the REP)
  • Automatically build firmware artefacts (somewhat mentioned in the REP but could be more spelt out)
  • Automatically install firmware artefacts (not mentioned in REP, becomes very complicated in practice). For colcon to support firmware flashing in a generic way, this becomes a very large problem.

But as a developer, I still have to write my code specifically for an accelerator:

Generally, off-the-shelf software cannot be efficiently converted into accelerated hardware on an FPGA. Even if the software program can be automatically converted (or synthesized) into hardware, achieving acceptable quality of results (QoR) will require additional work such as rewriting elements of the software to help Vitis HLS achieve the desired performance goals. To help, you need to understand the best practices for writing good software for execution on the FPGA as discussed in Design Principles for Software Programmers in the Vitis High-Level Synthesis User Guide (UG1399).

This is also true for CUDA.

If I write a ros2 library for NDT for example, I still have to write kernels for xilinx, nvidia, amd, intel. Which I will most certainly not have the expertise or time to do. As you mentioned, roboticists don’t really think about hw acceleration all that much. Hence your REP does not help them that much.

If I write code for my own specific robot, then the convenience of this REP provides diminishes because I am ultimately just targeting 1 accelerator.

IMHO, we should go one step further, and unify the programming language for writing acceleratable kernels, through a generic DSL such as HALIDE https://halide-lang.org/. This way:

  • I write kernel only once.
  • The kernel can be compiled to run accelerated on every architecture CPU/GPU/FPGA.
  • Everything architecture-specific can be hidden from me.

This REP is a step in the right direction but I propose that it does not go far enough in terms of abstraction.

I assume Xilinx will not be very happy b/c halide does not emit to FPGA at the moment. But instead of still forcing robotics people to write FPGA specific code, Xilinx can provide a backend to halide. At the end of the day, roboticists does not know how to write good fpga code, xilinx does. So we should enable everyone to focus on what they are best at.

Then there is a still a lot of work to be done, to enable halide with amet and all the magic talked about in this REP is still valid.

2 Likes

Glad to read the initiative is well received.

I think there’re some good ideas in here, but some of your comments sound to me beyond the scope of what (in general) REPs contain and very technology-specific, which is the opposite I’m trying to achieve by fighting so much to keep it technology-agnostic (e.g. how SDKs are managed or artefacts built). Xilinx is providing a reference implementation that anyone’s welcome to imitate, but we should not impose this to other vendors.

Regarding the difficulties with colcon you point out, see the reference implementation of the firmware artifacts mentioned above (you’ll need to download the release to inspect the source code :slightly_frowning_face: , GitHub doesn’t allow me to push it in the repo due to size limitations). That in combination with the ament_vitis ament extensions are the magic you mention. Note both, firmware and ament extensions are purposely built so that they can be replaced with other technologies.

See our first meeting recording for more context on the architecture and goals proposed in the REP.

Here we disagree, but I’d love hear your thoughts back:

  • First, if you’re a maintainer, you do not have to write kernels. You have the possibility to do so if you wish to dive into the hardware world, but that’ll rarely happen. In most cases, it’ll be hardware experts the ones writing those kernels (specially employees from silicon vendor companies). This is the most likely scenario, and since it’s already happening (I’m doing it :wink: , and so are a few others), what makes the most sense is to coordinate a common abstraction layer that allows each one of these vendors to interoperate from a higher level perspective.
    This is what was done successfully with other layers in the past with. E.g. with DDS, nobody forced the vendors (i.e. DDS ones) to change. Similarly, it’s unreasonable to expect silicon vendors to change their acceleration languages (as I argued above, there’re good reasons why HIP, HLS and CUDA are there, and they will not dissapear), since you’ll obtain best performance with them on each one of their technologies. Even if you’re a company just producing IP, you’ll still benefit from having such a reference architecture and conventions. After all, you’d like to sell you IP possibly in various format to maximize revenue, which this REP favours.

  • Second, the proposed architecture gives you as a ROS 2 package maintainer the possibility to leverage kernels that others may have written with in a common and consistent syntax. This to me rejects your claim that this REP doesn’t help.

  • Third, as a user (company building a particular robot) that wants to leverage hardware acceleration, typically, she’ll look into different axes, including easy of use, integration with ROS 2 infrastructure (which is why the reference architecture extends ament and colcon, and doesn’t reinvent the wheel), performance, determinism, power consumption, etc. This REP facilitates a common path which allows to not just benchmark acceleration hardware, but also switch across solutions (even from the same technology, e.g. going from an embedded edge KV260, to a workstation-like PCIe Alveo card for more acceleration capabilities).
    Even if you have already picked whatever hardware you’ll be using, and you’ll build yourself the accelerator, this REP aims to provide a consistent way to integrate hardware acceleration with ROS 2, with examples that can kickstart your development. I don’t see how that dimishes the value of this REP, quite the opposite.

I like the HALIDE proposal. But again, I wouldn’t be overly ambitious if we want things to be actionable.

Don’t get me wrong, I’d love for the vision proposed to happen. It’s just that OpenCL has (as argued above) failed to force silicon vendors to converge on a kernel development language (though it’s widely used for host-to-kernel interaction). Arguably, it’s going to be hard for HALIDE to succeed where OpenCL failed.

The (best) way forward though, is not to be exclusive, but inclusive, as we attempted at the HAWG architecture. See ament_halide block:

  ROS 2 stack                   HAWG @ ROS 2 stack

+-------------+             +--------------------+
|             |             |  xilinx_examples   |
| user land   |  +-------------------+-----------+-------+--------------+
|             |  |       Drivers     |     Libraries     |    Cloud     |
+-------------+  +---------------+---+--------+-------------------------+
|             |  |   ament_vitis | ament_halide |          |  accel_fw    |
|             |  +---------------+----------+-+----------+-+------------+
|  tooling    |  |     ament_acceleration   | colcon_accel |  accel_fw  |
|             |  +------------------------------------------------------+
|             |  |      build system        |   meta build |  firmware  |
+-------------+  +--------------------------+--------------+------------+
|     rcl     |
+-------------+
|     rmw     |
+-------------+
|   adapter   |
+-------------+
|             |
| middleware  |
|             |
|             |
+-------------+

Anyone motivated to push forward HALIDE, can create ament_halide and provide a matching (halide-wise) firmware with this REP’s architecture and conventions.

Have you consider puting person-months on this @LiyouZhou? I’d be happy to walk you through the extensions needed to match HALIDE to the current architecture, and to test things together. Note again that your view can completely be embed into the existing architecture.

No need to wait for the work to be done! You can try things out today with Xilinx’s KV260 using the following ROS 2 packages which extend your ROS 2 workspace to include hardware acceleration:

1 Like

Hi there,
I am following this thread, as I consider it very interesting.
I just want to add my two cents. I know this community is not that interested in the implementation of the acceleration rather using it, but following these points

I was thinking that edalize-style tool could be used to reuse and abstract the kernel implementations. Thank to Edalize, in principle (not an expert), you could design one accelerator and implement it in several FPGAs, easing the porting to different targets.

Maybe is putting too much stuff at the same time…

1 Like

This is cool, thanks @imguruza. Sounds like including edalize will lead to support a number of different backends in one go. ament_vitis focuses only on Vitis. I’m hoping this is just a starting point, with others appearing in the future. edalize could indeed help with this vision.

I saw Vivado amongst the backend options for edalize. This, I guess, means that creating ament_edalize will in fact support Vivado and quite a few others in one go. It also looks pretty straightforward to extend each backend with more modern capabilities, since they are simple Python files each (e.g. Vivado’s).

I don’t have lots of extra cycles right now to create the ament extensions myself for edalize but I’d be happy to walk you through the changes needed if you are willing to put time @imguruza. Also, as a point of convergence, you could try then to extend the simple accelerated double vector add example to get it built with ament_edalize CMake macros (instead of using the ament_vitis ones). Provided ament_edalize is in the right form, the only changes needed in this package would be the CMake macro bits and the sources of the kernel (in whatever format the targeted EDA backend wants them)

For the sake of the discussion, note that one of the interesting features proposed in REP-2008 through the ament_acceleration ROS 2 package is to abstract the underlying acceleration backend through a series of technology-agnostic CMacros.

In other words, instead of:

# Vitis-specific 
# ament_vitis CMake macros
vitis_acceleration_kernel(
  NAME vadd
  FILE src/vadd.cpp
  CONFIG src/kv260.cfg
  INCLUDE
    include
  TYPE
    hw
  PACKAGE
)

or potentially, for edalize:

# edalize-specific 
# ament_edalize CMake macros
edalize_acceleration_kernel(
  NAME vadd
  FILE src/vadd.cpp
  CONFIG src/kv260.cfg
  INCLUDE
    include
  TYPE
    hw
  PACKAGE
)

one could just use:

# generic hardware acceleration 
# ament_acceleration CMake macros
acceleration_kernel(
      NAME vadd
      FILE src/vadd.cpp
      INCLUDE
        include
      TARGET kv260  # this can actually be derived
                    # at build-time, from the firmware 
                    # and build args, so not needed
)

Then, ament_acceleration package will derive at build-time which underlying ament extensions should use from the firmware selected in the ROS 2 workspace, and the colcon build arguments. More specifically, today you can do:

Btw, this video I found explaining edalize is awesome :joy: .

1 Like

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.