Proposal for ROS 2 Hardware Acceleration Working Group (HAWG)

Thanks, I sent an invitation to join the group and the community. Hope to hear from them.

Have you given thoughts for SYCL support? I’ve found reasonable success in having a single code for relatively different platforms: HEDT, edge CPU and GPU using SYCL 2020 (with C++17 support). The ecosystem is new, but with support for OpenCL, HiP (curtsy of AMD), it looks promising.

The official implemenation triSYCL even has some extensions for Xilinx FPGAs (disclaimer: I haven’t played with those or seen them in action)

Thanks for organizing this, @vmayoral, and it does sound promising. The biggest challenge that I see is that there isn’t a great API for write once, run everywhere accelerated code. Since Nvidia has not published an OpenCL interface driver for their popular Jetson platform, it would exclude these devices. I haven’t looked at SYCL but I would imagine that if someone had achieved it, we’d all be using it already.

I was waiting for a similar initiative for a long time! Great to know that it is progressing. I am personally particularly interested in GPU HW acceleration as almost every x86 CPU comes nowadays with a “free” GPU.

This is very interesting initiative. We have been working with the Ultra96 boards for a time with ROS and this year started using Vitis. Partial reconfiguration of FPGAs is also something we’ve been trying to look into but haven’t had enough time or people on this side of research.

1 Like

@vmayoral definitely a great goal to organize the compute acceleration community around an application framework like ROS-2. I am not a big fan of either OpenCL or HLS for compute accelerators as OpenCL is too constrained for complex concurrency patterns in optimization and linear algebra kernels as the good stuff that needs acceleration aren’t GEMM, and HLS is difficult to use in an application development environment as the compilation times are not commensurate with edit/compile/test cycles that the software community expects.

The ROS-2 computational graph is a better structure to leverage There are many examples of these types of graph runtimes in HPC to manage expanding and contracting concurrency (for example MAGMA), and the nodes of the graph represent well-defined APIs and dispatch abstractions. The DAG scheduler would also be able to synthesize compute and latency demands that would be instrumental in driving either acceleration design, or potentially hardware synthesizers to yield accelerators that are perfectly tuned to the computational load.

The problem I see on the horizon is that this hardware acceleration problem of non-trivial computational graphs is likely going to be solved more urgently in the networking/telecom space as compared to the robotics space as the application, money, and skillsets are readily available in that vertical, whereas with robotics budgets and teams are simply too small to really deliver on this engineering.

Personally, I think the most productive path forward here is to focus on bottleneck operators in perception, and use architectural simplifications to deliver application level value quickly and cost-effectively.

Thanks for your comments and input everyone! Lots of interesting feedback. We’re glad to see the interest our initiative raised. We’ve collected all the feedback and new ideas and plan to include it in a follow up thread launching officially the WG.

We’ll give a few more weeks for everyone to share their thoughts, but stay tuned :wink: . A few comments from our side:

This is an interesting angle @kunaltyagi. I haven’t looked at SYCL but I believe this would be a great contribution/extension to the initial hardware acceleration architecture.

True @rgov, but we believe that by relying on open standards we’ll maximize our chances of suceeding here. That’s why we’re proposing to bet on C, C++ and OpenCL which can be used today for programming hardware accelerators in many devices (across compute substrates, including FPGAs, CPUs and GPUs).

Dynamic Function eXchange (DFX) (formerly known as Partial Reconfiguration (PR)) is centric to our approach at Xilinx @jopequ. That’s indeed a feature we will be leveraging while integrating such capability directly into ROS 2. So that using it doesn’t require you to be a hardware expert.

Before diving into DFX though, getting up to speed with accelerators is a must (that’s how we came up with the plan above :slight_smile: ). We don’t have yet a timeline for DFX integration but stay tuned if you’re interested or ping me if you have the resources to push in that front and would like to contribute.

Welcome to the community @Ravenwater. HLS has traditionally been a pain, I agree. Things are changing though.

Though I certainly can’t generalize for all hardware, speaking for our solutions at Xilinx and more specifically our Kria portfolio, we propose three build targets for HLS:

  • Software Emulation (sw_emu): The kernel code is compiled to run on the host processor. This allows iterative algorithm refinement through fast build-and-run loops. This target is useful for identifying syntax errors, performing source-level debugging of the kernel code running together with application, and verifying the behavior of the system. Simply put, a transformation which runs all the code in an emulated processor matching the K26 SOM, as if there wasn’t any accelerator.

  • Hardware Emulation (hw_emu) - The kernel code is compiled into a hardware model (RTL), which is run in a dedicated simulator. This build-and-run loop takes longer but provides a detailed, cycle-accurate view of kernel activity. This target is useful for testing the functionality of the logic that will go in the FPGA and getting initial performance estimates. In other words, a simulation within an emulation. The FPGA is simulated and runs inside of an emulation (QEMU), sitting together to emulated processors and allowing to get performance estimations faster.

  • Hardware (hw) - The kernel code is compiled into a hardware model (RTL) and then implemented on the FPGA, resulting in a binary that will run on the actual FPGA.

sw_emu allows you to run both the host code and the accelerator inside an emulator (QEMU) and the transformation takes about a minute. This includes the kernel’s code which runs as if it was host code. I’m using this approach everyday and I believe it is sufficient to meet the common software development flows in ROS/robotics.

Here’re some numbers I just produced for the sake of the argument:

build target ROS 2 accelerated package build time
sw_emu \approx 1 minute
hw_emu \approx 4 minutes
hw \approx 23 minutes

Note this depends heavily on your developer workstation characteristics. Mine is based on a AMD Ryzen 5 PRO 4650G.

These are approximate and based on a ROS 2 package in my current workspace (including host code and kernel) which includes a very simple kernel.

I’d be curious to hear what are your thoughts on this @Ravenwater and everyone else. I agree nevertheless that we should push towards getting a similar user experience, both time-wise and development flow-wise.

3 Likes

@vmayoral The sw_emu/hw_emu/hw targets you describe are solid and will be very productive. Absolutely thrilled that this is coming to the community.

2 Likes

Hi @vmayoral, which ros package are you building, and which kind of acceleration functions are you using for the tests?

Which is going to be the target platform? Kria KV260? How do you manage different FPGA board resources if I want to use the kernel in another board?

Do I need to compile the kernel and FPGA design every time on my own or will I be able to download it.

Hi @vmayoral ,
I am currently a (Xilinx) FPGA user. I would love to help on the project on open source kernel acceleration.

1 Like

Thanks to everyone that showed interest! WG announced and first meeting called for. See details at Announcing the Hardware Acceleration WG, meeting #1.

In case you’ve missed it, the first meeting of the WG will happen next week at: 2021-06-30T18:00:00Z.

  • Coordinates: Zoom
    • Phone one-tap: US: +17209289299,99299967182#,0#,8504098917# or +19292056099,99299967182#,0#,8504098917#
    • Meeting URL: Launch Meeting - Zoom
    • Meeting ID: 992 9996 7182
    • Passcode: Xk.X73&rNY
  • Preliminary agenda:

    1. Introductions
    2. ROS 2 Hardware Acceleration WG, quick review of objectives, rationale and overview
    3. Initial hardware acceleration architecture for ROS 2 and short demonstrations
    4. Community hardware platforms (e.g. Ultra96-v2), process and steps
    5. Q&A
    6. (your acceleration project)
1 Like

Hello @vmayoral ,

I am an Embedded Engineer with some experience in FPGAs. Really interested to join this group and contribute. How can I get started?

Thanks.

@kscharan, you can get started checking the resources at ROS 2 Hardware Acceleration Working Group · GitHub. Then, watch the first two HAWG group meetings and review the resources:

Stay tuned for upcoming ones and open a ticket at GitHub - ros-acceleration/community: WG governance model & list of projects if you have any ideas/projects where you’d like to contribute.

1 Like

Thanks @vmayoral for the quick reply. Excited to get started with my Ultra96v2. :relieved:

Back in early 2021 we proposed this WG and kicked off activities with the following goals:

I’m happy to report on the following progress that happened during 2021:

The ROS 2 Hardware Acceleration WG work reached more than 250.000 users/roboticists and generated more than 2000 reactions. The community repo has ~200 biweekly views and the recorded meetings more than 1000 views (data disclosed at the ROS 2 Hardware Acceleration Working Group 2021 dissemination report).

Target Description
2021 :white_check_mark: 1) Design tools and conventions to seamlessly integrate acceleration kernels and related embedded binaries into the ROS 2 computational graphs leveraging its existing build system (ament_acceleration extensions) [1], meta build tools (colcon-acceleration extension) and a new firmware layer (acceleration_firmware) [2].
2021 :white_check_mark: 2) Provide reference examples and blueprints for acceleration architectures used in ROS 2 and Gazebo.
2022 :white_check_mark: 3) Facilitate testing environments that allow to benchmark accelerators with special focus on power consumption and time spent on computations (see HAWG benchmarking approach, community#9, tracetools_acceleration, ros2_kria)
2022 :warning: 4) Survey the community interests on acceleration for ROS 2 and Gazebo (see discourse announcement, survey).
2022 :warning: 5) Produce demonstrators with robot components, real robots and fleets that include acceleration to meet their targets (see acceleration_examples).

During today’s meeting 2022-01-25T18:00:00Z, we’ll go through these and present a new set of objectives for this 2022.


  1. See ament_vitis ↩︎

  2. See acceleration_firmware_kv260 for an exemplary vendor extension of the acceleration_firmware package ↩︎

Hi, and great work addressing this important issue. How is the support for writing accelerators in HDLs? I tried building a couple of the examples and the HLS/OpenCL flow works really great, plug-and-play. One thing that I find tedious with SoC FPGA design is manually creating the AXI interfaces to the PS, DDR and the SW drivers. Is there ways to automate this with the colcon/ament extensions for accelerators written in a HDL as well?

1 Like

Awesome to hear @erlingrj, thanks and keep the feedback coming please.

I am not aware of anyone looking at this from the ROS side at the moment. From a Xilinx’s tooling perspective, this is fully supported but you need hardware skills to use it at the moment. A few pointers for hardware engineers:

The current CMake macros to integrate Vitis capabilities (ament_vitis) only allow you to either a) generate acceleration kernels (technically, .xo files, Xilinx Objects) from C++ [1] or b) link together (place & route) various kernels [2]. Ideally, we’d have CMake macros that generate a kernel out of HDL sources. The way to implement this would be to generate a Tcl file and pass that to Vivado for .xo packaging. This is demonstrated in this example. I prototyped Tcl script generation from CMake macros a while ago in here.

If you have development cycles @erlingrj and want to merge both of these last pointers together, I’d be happy to review a PR with a new CMake macro for that at ament_vitis. From that, it’d be pretty easy to get a full example using those macros and RTL within acceleration_examples.

Can you describe a particular ROS use case that drives this ask (node/component/computational graph)?


  1. see vitis_acceleration_kernel ↩︎

  2. see vitis_link_kernel ↩︎

Thanks for your reply @vmayoral. I am not that well-versed in Vitis and Xilinx Objects. I am gonna take a good look at your references. I could be interested in contributing a flow for RTL kernels. But first I would need to add support for the Zedboard (as it is my only Zynq board).

1 Like

Sure, refer to the ticket tracking the port to the Ultra96v2, there’re lots of good bits in there on how to bring the architecture we’ve put together up to speed in a new board.