ROS Resources: Documentation | Support | Discussion Forum | Service Status | Q&A answers.ros.org

Proposal for ROS 2 Hardware Acceleration Working Group (HAWG)

Totally. This has been historically one of the big hurdles for acceleration adoption. Not just in robotics btw. The Not Invented Here syndrome is a big thing in the acceleration world. Big silicon vendors are too self-centric and rather than contributing to communities, prefer to reinvent the wheel with their yet-not-new-feature-in-acceleration-language-A. This is a nuts. My team at Xilinx wants to tackle this and contribute to the ROS 2 community directly, connecting to real ROS users’ needs and simplifying things.

This is what Goal 1 above is for in our roadmap (first one!). We’ll build tools that facilitate the integration of accelerators directly into ROS packages. In a nutshell, building an arbitrary ROS package with or without acceleration should be the same experience. Just colcon build-it.

If enough people’s interested on this our plan is to disclose and contribute with our implementations while pushing forward conventions, so that other silicon vendors can attach to it (instead of yet again, reinventing the wheel).

2 Likes

I’m surely forgetting many, but here’re a few more folks that may find this of interest @Jmeyer, @andreucm, @crystaldust, @DavidT, @LiyouZhou, @cheng.chen, @yukkysaito, @chang.su, @ArunPrasad, @kosuke55

@vmayoral that’s exciting news! On behalf of the micro-ROS team and as coordinators of the EWG, consider us to be definitely interested in collaborating on this! It’d be nice to organize a dedicated meeting someday soon, we’ll contact you by pm. In the meantime, don’t hesitate to bring the topic to the incoming EWG meetings to raise a public discussion with the community!
[Next one will be held on Tuesday the 4th of May at 5 pm CEST here]

3 Likes

Absolutely @FraFin! There’s definitely room for cooperation between EWG and HAWH! I look forward to it and I’m confident the MCU focus you guys have has a good fit with the architectures we’re starting to prototype.

3 Likes

This is a great idea! It would be wonderful to have more free open source FPGA IP cores that a community can share, fix, and enhance.

My group creates, among other things, custom high speed cameras by connecting imagers to Xilinx FPGAs so we can perform image processing directly on the camera. I would love to create a TSN/DDS/ROS interface to the cameras.

One issue we always run into is that we have to purchase many FPGA IP cores and they are black boxes that we cannot inspect or modify. They sometimes have bugs that no one is interested in fixing and that makes many people hesitant to use FPGAs.

Often people prefer doing image processing on GPUs instead, but the nondeterminism and the delays of transferring images from a camera to a GPU using a CPU kills the performance.

It would be great to have a way of feeding images directly from an FPGA to a GPU and doing image processing on them both, in whichever ways each is optimal, using all free open source acceleration kernels.

5 Likes

There is an Open-source FPGA Foundation with companies such as QuickLogic who are using Open-Source FPGA tools, so if not done so already, it might be worth contacting them and seeing if they are interested in working with the ROS folks to improve FPGA support in ROS 2.
https://osfpga.org/

3 Likes

Awesome! Looking forward to it :star_struck:

Lovely to hear @peterpolidoro! Empowering these kind of use cases should be a key driver within this group.

Noted!

Thanks, I sent an invitation to join the group and the community. Hope to hear from them.

Have you given thoughts for SYCL support? I’ve found reasonable success in having a single code for relatively different platforms: HEDT, edge CPU and GPU using SYCL 2020 (with C++17 support). The ecosystem is new, but with support for OpenCL, HiP (curtsy of AMD), it looks promising.

The official implemenation triSYCL even has some extensions for Xilinx FPGAs (disclaimer: I haven’t played with those or seen them in action)

Thanks for organizing this, @vmayoral, and it does sound promising. The biggest challenge that I see is that there isn’t a great API for write once, run everywhere accelerated code. Since Nvidia has not published an OpenCL interface driver for their popular Jetson platform, it would exclude these devices. I haven’t looked at SYCL but I would imagine that if someone had achieved it, we’d all be using it already.

I was waiting for a similar initiative for a long time! Great to know that it is progressing. I am personally particularly interested in GPU HW acceleration as almost every x86 CPU comes nowadays with a “free” GPU.

This is very interesting initiative. We have been working with the Ultra96 boards for a time with ROS and this year started using Vitis. Partial reconfiguration of FPGAs is also something we’ve been trying to look into but haven’t had enough time or people on this side of research.

1 Like

@vmayoral definitely a great goal to organize the compute acceleration community around an application framework like ROS-2. I am not a big fan of either OpenCL or HLS for compute accelerators as OpenCL is too constrained for complex concurrency patterns in optimization and linear algebra kernels as the good stuff that needs acceleration aren’t GEMM, and HLS is difficult to use in an application development environment as the compilation times are not commensurate with edit/compile/test cycles that the software community expects.

The ROS-2 computational graph is a better structure to leverage There are many examples of these types of graph runtimes in HPC to manage expanding and contracting concurrency (for example MAGMA), and the nodes of the graph represent well-defined APIs and dispatch abstractions. The DAG scheduler would also be able to synthesize compute and latency demands that would be instrumental in driving either acceleration design, or potentially hardware synthesizers to yield accelerators that are perfectly tuned to the computational load.

The problem I see on the horizon is that this hardware acceleration problem of non-trivial computational graphs is likely going to be solved more urgently in the networking/telecom space as compared to the robotics space as the application, money, and skillsets are readily available in that vertical, whereas with robotics budgets and teams are simply too small to really deliver on this engineering.

Personally, I think the most productive path forward here is to focus on bottleneck operators in perception, and use architectural simplifications to deliver application level value quickly and cost-effectively.

Thanks for your comments and input everyone! Lots of interesting feedback. We’re glad to see the interest our initiative raised. We’ve collected all the feedback and new ideas and plan to include it in a follow up thread launching officially the WG.

We’ll give a few more weeks for everyone to share their thoughts, but stay tuned :wink: . A few comments from our side:

This is an interesting angle @kunaltyagi. I haven’t looked at SYCL but I believe this would be a great contribution/extension to the initial hardware acceleration architecture.

True @rgov, but we believe that by relying on open standards we’ll maximize our chances of suceeding here. That’s why we’re proposing to bet on C, C++ and OpenCL which can be used today for programming hardware accelerators in many devices (across compute substrates, including FPGAs, CPUs and GPUs).

Dynamic Function eXchange (DFX) (formerly known as Partial Reconfiguration (PR)) is centric to our approach at Xilinx @jopequ. That’s indeed a feature we will be leveraging while integrating such capability directly into ROS 2. So that using it doesn’t require you to be a hardware expert.

Before diving into DFX though, getting up to speed with accelerators is a must (that’s how we came up with the plan above :slight_smile: ). We don’t have yet a timeline for DFX integration but stay tuned if you’re interested or ping me if you have the resources to push in that front and would like to contribute.

Welcome to the community @Ravenwater. HLS has traditionally been a pain, I agree. Things are changing though.

Though I certainly can’t generalize for all hardware, speaking for our solutions at Xilinx and more specifically our Kria portfolio, we propose three build targets for HLS:

  • Software Emulation (sw_emu): The kernel code is compiled to run on the host processor. This allows iterative algorithm refinement through fast build-and-run loops. This target is useful for identifying syntax errors, performing source-level debugging of the kernel code running together with application, and verifying the behavior of the system. Simply put, a transformation which runs all the code in an emulated processor matching the K26 SOM, as if there wasn’t any accelerator.

  • Hardware Emulation (hw_emu) - The kernel code is compiled into a hardware model (RTL), which is run in a dedicated simulator. This build-and-run loop takes longer but provides a detailed, cycle-accurate view of kernel activity. This target is useful for testing the functionality of the logic that will go in the FPGA and getting initial performance estimates. In other words, a simulation within an emulation. The FPGA is simulated and runs inside of an emulation (QEMU), sitting together to emulated processors and allowing to get performance estimations faster.

  • Hardware (hw) - The kernel code is compiled into a hardware model (RTL) and then implemented on the FPGA, resulting in a binary that will run on the actual FPGA.

sw_emu allows you to run both the host code and the accelerator inside an emulator (QEMU) and the transformation takes about a minute. This includes the kernel’s code which runs as if it was host code. I’m using this approach everyday and I believe it is sufficient to meet the common software development flows in ROS/robotics.

Here’re some numbers I just produced for the sake of the argument:

build target ROS 2 accelerated package build time
sw_emu \approx 1 minute
hw_emu \approx 4 minutes
hw \approx 23 minutes

Note this depends heavily on your developer workstation characteristics. Mine is based on a AMD Ryzen 5 PRO 4650G.

These are approximate and based on a ROS 2 package in my current workspace (including host code and kernel) which includes a very simple kernel.

I’d be curious to hear what are your thoughts on this @Ravenwater and everyone else. I agree nevertheless that we should push towards getting a similar user experience, both time-wise and development flow-wise.

3 Likes

@vmayoral The sw_emu/hw_emu/hw targets you describe are solid and will be very productive. Absolutely thrilled that this is coming to the community.

2 Likes

Hi @vmayoral, which ros package are you building, and which kind of acceleration functions are you using for the tests?

Which is going to be the target platform? Kria KV260? How do you manage different FPGA board resources if I want to use the kernel in another board?

Do I need to compile the kernel and FPGA design every time on my own or will I be able to download it.

Hi @vmayoral ,
I am currently a (Xilinx) FPGA user. I would love to help on the project on open source kernel acceleration.

1 Like

Thanks to everyone that showed interest! WG announced and first meeting called for. See details at Announcing the Hardware Acceleration WG, meeting #1.

In case you’ve missed it, the first meeting of the WG will happen next week at: 2021-06-30T18:00:00Z.

  • Coordinates: Zoom
    • Phone one-tap: US: +17209289299,99299967182#,0#,8504098917# or +19292056099,99299967182#,0#,8504098917#
    • Meeting URL: Launch Meeting - Zoom
    • Meeting ID: 992 9996 7182
    • Passcode: Xk.X73&rNY
  • Preliminary agenda:

    1. Introductions
    2. ROS 2 Hardware Acceleration WG, quick review of objectives, rationale and overview
    3. Initial hardware acceleration architecture for ROS 2 and short demonstrations
    4. Community hardware platforms (e.g. Ultra96-v2), process and steps
    5. Q&A
    6. (your acceleration project)
1 Like