@Dejan_Pangercic thanks for showing so much entusiam for this. Let me give you a bit of context before answering your question. Hopefully you’re already aware of all of this, and it’s just a reminder, but I’ve been getting similar questions from others in the community, so it should be helpful for those following this thread:
Hardware acceleration in the ROS 2 context refers to the process by which a ROS abstraction (typically a Node) offloads certain computing tasks onto specialized hardware components within the hardware system, enabling greater performance (faster), more determinism or security/safety (through isolation) capabilities, among others.
I got this quite a few times already and I’ll be speaking about it in Adapt 2021 (and taking questions).
(hardware acceleration) that’s not something roboticists think about too much typically and you might want to start from a more base explanation that not everyone (or most people, I suspect) are going to have any idea what this is really about. For instance, if I use a GPU typically I’m parallelizing some process or utilizing AI/ML models that can’t be run on a CPU. Often times those libraries are tied into specific GPU vendors / implementations so its not like I’m going to go through the effort, as a company, to rewrite really low-level parts of TensorFlow or Caffe. In that context, I’m not really sure what this (REP) is trying to accomplish. When I am writing GPU code, I’m using openCL so its portable across Intel / AMD / Nvidia hardware and that generally solves the cross-vendor issue.
There’s a merit in this comment. Particularly the bit about “When I am writing GPU code, I’m using openCL so its portable across Intel / AMD / Nvidia hardware”. This is the expectation (it was mine when I started working on this as well). Reality thought, is different.
OpenCL (unfortunately) has had only partial success. Most silicon companies (including the one I work for today, Xilinx) are only pushing it for interoperability at the CPU-level (that is, how acceleration kernels interoperate with the CPU), but almost nobody writes kernels in OpenCL these days. At least that I’ve seen. You’d use HIP for AMD (at a high-level), CUDA with Nvidia, or HLS for C++ with Xilinx (or even Verilog if you like hardcore hardware like me). It’s important to understand each silicon vendor has its own thing. There’re business reasons behind it which need to be understood, but that should not overcomplicate the use of popular frameworks. It should be silicon vendors the ones making the effort (after all, they’re deciding to waste the chance with OpenCL), not us.
The reason why you can use TensorFlow in all these compute substrates (FPGAs, GPUs, etc.) without caring about low level details (such as HLS, HIP or CUDA, among others) is exactly because those projects (e.g. TensorFlow) became so big (so relevant), that each one silicon vendor was forced to build their own accelerators (again, each in their corresponding form) for these projects if they wanted to keep up.
What I’m attempting to do with this REP, in a nutshell, is exactly (really) this. Start a community effort that elevates ROS 2 at the point of telling these big (silicon) companies to wake up, and start commiting engineering resources to build accelerators for the ROS stacks in an organized, structured and ROS 2 package maintainer centric manner.
Note the section in the REP about value for stakeholders.
Note also that this comes as opposed to allowing each silicon vendor (which I’ve seen for months now doing) create their own package forks (each its own), for each ROS 2 packages subject to hardware acceleration. Some are even doing this for ROS 2 core packages, reinventing the wheel. This is horrible from a maintener’s perspective, and even worse from a users’:
- For a maintainer, if she wants to support different hardware acceleration solutions, (e.g. AMD and Nvidia), she will find herself struggling to coordinate efforts between silicon vendors who likely won’t pay much attention, because their business drivers are just different (not community oriented).
- For a user, he will struggle horribly while changing between acceleration solutions (e.g. moving from the Jetson to the Kria SOMs), because it’s not just the kernels themselves, but the whole underlying infrastructure changes. The libraries required to build those accelerators, the way BSPs are generated, the cross-compilation process. Lots of fun if you like embedded like me, but horrible otherwise.
I’m speaking about this in a recent paper we disclosed.
I’m in touch with AMD and Nvidia exactly for this. We at Xilinx, with this REP proposal, are trying to turn things around now, on time. So that there’s a general outer set of abstractions and primitives that maintainers can use when including hardware acceleration in their packages in a technology-agnostic manner. Then, it’s up to each silicon vendors to implement the low level kernels that favour the ROS community’s interest for each ROS package/stack. This is explained in the architecture pillars of the REP.
That’s why this REP proposes a reference architecture for hardware acceleration and the corresponding conventions that will empower us all to use ROS with hardware acceleration in a ROS 2-centric manner (as opposed to in an HLS-centric manner, or HIP-centric way, etc). This REP is a community effort to provide these vendors “a way” for interoperability. A consensus of “a way in”. Otherwise, we won’t be able to run ROS in the same way you run TensorFlow, across hardware acceleration solutions.
And I definitely want to encourage this. That the hardware acceleration API we all use is based on the ROS official packages, not whatever-fork-company-A has created.
In the pillars, it looks like you’re just proposing some new cmake macros and colcon verbs. Why does this need to be an REP? Can’t this go in your own repo as a simple extensions?
If we want package maintainers to be able to include support for hardware acceleration in a technology agnostic maner, the cleanest and most scalable way (unless we start forking everything, which I’d argue strongly against) is to have a series of CMake macros that are included in the ROS 2 Common packages and that hide the complexity of each one of those underlying implementations. These CMake macros should operate as
if you have hardware acceleration, then build the kernel. If you don't, just skip it. It’s then up to each silicon vendor, to match those CMake macros with their own low-level libraries and tools (we at Xilinx are providing a reference implementation for this, so that others can immite it). This way, we’re empowering the possibility to include hardware acceleration in existing ROS packages (by using these macros), without having to fork every single ROS 2 package out there that’s computationally expensive.
I can’t stress enough how important it is to get this “right” now. If we don’t, then, the already existing ROS fragmentation (across distros) is soon to get a new dimension with hardware acceleration.
My very biased opinion is that we need a REP that somehow defines a series of CMake abstractions and conventions that:
- a) users can use without caring for the underlying acceleration technology and buying/using what their budget allows for,
- b) maintainers will use for including hardware acceleration support and
- c) companies like mine (Xilinx), AMD or Nvidia (among others) should comply with if they wish to align with what benefits ROS users and package maintainers the most (which arguably should be IMHO the community’s position).
How can I see this live, with real demos and examples?
As I said above:
I’m working with our team to disclose everything in the coming weeks. I’ll ping back in here. But very soon. The initial implementation is done. Just big company’s problems which we’re addressing.
This assumes there’s some capabilities like tensorflow within ROS that needs GPU/FGPA support – what are those, specifically?
Lots. E.g. one of the basic examples that I’ve been looking at lately is
image_pipeline . Most vision-based computational graphs rely on it, and it appears many transformations can be accelerated heavily by using specialized hardware libraries/kernels that already exists. I’m currently looking at this and in the case of Xilinx, we have something called the Vitis Vision Library which implements primivitives for rectify , resize or conversions (among others). Nvidia has a similar thing (VPI), and so does AMD (MIVisionX). Again, each one their own tune.
I’m working with both AMD and Nvidia to try and get common ground. Launching this REP was my attempt to get us all to push in one direction.
Sure, and sorry for the long discourse above, but that’s what this forum is for, right ?
In a nutshell, as far my understanding (from a quick reading) goes, there’s no overlap. A complement possibly, if the AI demand suddenly grows in ROS . Apache TVM is ML specific, and neither the HAWG targets this specifically, nor all hardware acceleration is ML-related. In fact, this question is probably best brought to the AI Edge WG at the TSC level once @joespeed has time for it (discussion pointer).
On a general view, Apache TVM seems somewhat equivalent to the example I was giving above about Vision libraries and the
image_pipeline ROS 2 package: there’s the Xilinx’s Vitis Vision Library, AMD’s MIVisionX and Nvidia’s VPI, which can drive you a bit nuts. Having a project simplying interoperability for ML sounds like a nice intent, but it’s not the ROS community responsibility to get down to the specifics of each ML library.
That’s something silicon vendors (and in this case I guess Apache foundation?) should arrange themselves. From a ROS perspective, we simply should tell them how we want (and that’s what this REP is for) their accelerators to be built and integrated into ROS 2 packages.
A group that I led in the past somewhat initiated one of such ML discussions in this community, and I still today get questions with overblown expectations of what AI can do in robotics. It’s surprising how little AI folks understand about robotics sometimes. Nevertheless, for this context, If I were to compare ML frameworks with ROS, there’s a big difference. For the former, most silicon vendors already have their own implementations (Caffe, TensorFlow, etc.). For the latter, for ROS, this is getting started now.
While TVM is an awesome initiative, it’s going to be (very) hard to change silicon vendor’s mindset to switch to supporting TVM, instead of their own self-implemented accelerators for Caffee or TensorFlow, among others. Instead, with ROS, we now have the chance to set the blueprint, and ask silicon vendors interested in growing their customer base in robotics to comply.
As I was saying, we’re on time.