Proposal for ROS 2 Hardware Acceleration Working Group (HAWG)

Hello everyone, on behalf of Xilinx, I’m happy to bring up to the community the proposal of a new working group, the Hardware Acceleration WG. Find below a short description of the objectives and intentions.

Why should I care about hardware acceleration?

There’s a critical relationship between the hardware and the software capabilities in a robot. Robotic systems usually have limited on-board resources, including memory, I/O, disk or compute capabilities, which difficults the system integration process (e.g. when extending, adapting or fixing them), making it hard to meet the real-time requirements and limiting robots reaction, speed or capabilities. It is thereby essential to choose a proper compute platform for the robotic system. One that simplifies system integration, meets power restrictions and adapts to the changing demands of robotic applications.

CPUs are widely used commercial compute platforms in robotics due to their availability and generalized use. The general purpose nature of CPUs makes them specially interesting for roboticists to kickstart projects, however this comes at a cost when translating into real applications:

  • their fixed architectures difficult adaptability to new robotic scenarios. Additional demands often require additional hardware which usually imply additional system integration.
  • their general-purpose nature leads them to time inefficiencies, which impacts determinism (hard to meet hard real-time deadlines)
  • their power consumption is generally one or two orders of magnitude above specialized deeply pipelined compute architectures realized with FPGAs and ASICs.

Acceleration with dedicated compute architectures (in either FPGAs, GPUs and ASICs) is presented as an alternative to CPUs. One that allows to adaptively generate custom computing architectures to meet the robotic demands delivering a mixed-criticality solution that can comply with real-time requirements, increasing reliability and lowering power consumption. Particularly, we at Xilinx know that FPGAs are heavily used by popular industrial manufacturers as well as in automotive, medical and space robotic applications.

If this proposal were to receive enough attention, we plan on moving forward into a formal WG. We’re seeking here to drive the discussion on acceleration and engage with hardware, software and embedded engineers, as well as other silicon vendors and groups to jointly get more involved, coordinate and contribute in the open with accelerators that impact ROS 2 and Gazebo flows.

Here’s the proposal:

Objective:

Drive creation, maintenance and testing of acceleration kernels on top of open standards (C++ and OpenCL) for optimized ROS 2 and Gazebo interactions over different compute substrates (including FPGAs, GPUs and ASICs).

Overview:

The Hardware Acceleration WG will focus on reducing the time (real fast, as opposed to real-time) to compute Gazebo and ROS 2-based robotic flows by producing specialized compute architectures that rely on FPGAs, GPUs or ASICs. Alignment with various other WGs is expected on specific acceleration topics (e.g. with Real-time WG to come up with dual real-time (deterministic) and real-fast (low latency) accelerators, with Embedded WG to leverage MCU-based ROS 2 implementations in soft cores, with Navigation to ensure AMCL accelerators meet their needs, etc.).

Initially, accelerators will target ROS 2 underlayers to optimize interactions betwen nodes within the ROS computational graph. After this, we’ll pivot into compute architectures at the application level with initial consideration for a) perception, b) actuation, c) navigation and d) manipulation (in no specific order for now). The acceleration kernels will initially target hardware embedded devices but future efforts will attempt to extend the kernels to workstations, data centers or cloud as applicable.

The work of this WG will leverage open standards for parallel computing and acceleration, namely OpenCL and libraries built on top. While maintaining a ROS-centric experience, we expect ROS users to be able to adapt and reuse the resulting kernels for their applications. For that, out of our work we expect to produce tools and conventions that facilitate including hardware acceleration seamlessly into the existing ROS build system (ament) and meta build tools (colcon). To facilitate initiation and engagement, reference acceleration architectures, designs and examples will be produced. Ultimately, testing and benchmarks will be proposed to evaluate accelerators across the power-consumption and time domains.

WG goals and initial roadmap (date estimations):

The WG seeks to do the following for ROS 2:

  1. Design tools and conventions to seamlessly integrate acceleration kernels and related embedded binaries into the ROS 2 computational graphs leveraging its existing build system (ament) and meta build tools (colcon). In a nutshell, we want to build awesome tools and conventions for running ROS components while leveraging hardware acceleration in a simplified manner. (2021Q3/Q4)
  2. Provide reference examples and blueprints for acceleration architectures used in ROS 2 and Gazebo. First targeting ROS 2 underlayers. Second, stacks on top of ROS 2 and Gazebo. (2022+)
  3. Facilitate testing environments that allow to benchmark accelerators with special focus on power consumption and time. (2022Q1)
  4. Survey the community interests on acceleration for ROS 2 and Gazebo. (2022)
  5. Produce demonstrators with robot components, real robots and fleets that include acceleration to meet their targets. Disseminate results and show how ROS 2 fulfills industry needs. (2022Q2/Q3)

Selected acceleration-related past threads:

Pinging a subset of the folks which I’m aware showed past interest in acceleration and hardware architectures: @rgov, @ak-nv, @pedrombmachado, @YangZ, @Flemming_Sundance, @tomoyafujita, @MartinCornelis, @arthur.berkowitz, @jopequ, @kscharan

30 Likes

Pinging some more @marguedas, @codebot, @jlamperez, @Amamory, @malapatiravi, @amitgoel, @kunaltyagi, @simonschmeisser, @peci1, @smac

Sounds interesting! I’d be curious to learn more about which exact abstractions within ROS would be targeted. There’s a lot that can be done with FPGAs and it’s a great way to try out new dedicated hardware!

1 Like

This is great initiative !:ok_hand:

1 Like

Sounds interesting! I’d be curious to learn more about which exact abstractions within ROS would be targeted. There’s a lot that can be done with FPGAs and it’s a great way to try out new dedicated hardware!

Glad to hear you also find it interesting @gbalke! FPGAs are indeed great for prototyping and accelerating specialized compute architectures in robotics. I highly encourage to read through [2009.06034] A Survey of FPGA-Based Robotic Computing which surveys nicely recent work on that direction.

To your question, initial plan as indicated in point 2. above is to focus on ROS 2 underlayers. Particularly and for starters, we’d like to empower ROS 2 users with capabilities to leverage time-sensitive communications at the Data Link layer (OSI L2) and obtain distributed sub-microsecond synchronization capabilities from the ROS 2 computational graph. This is strategic and core for distributed computational graphs as discussed in here. Without optimized computational graphs, it’ll be hard to leverage other accelerators since that’ll be the bottleneck.

    ROS 2 stack              Software stack               OSI stack

+--------------------+    +--------------------+    +--------------------+    
|    user land       |    |                    |    |                    |
+--------------------+    +                    +    +                    +    
| ROS client library |--->|                    |    |                    |
+--------------------+    +       ROS 2        +    +   7. Application   +    
|   middleware iface |    |                    |    |                    |
+--------------------+    +                    +    +                    +    
|    DDS adapter     |    |                    |    |                    |
+--------------------+    +--------------------+    |                    |
|                    |    |                    |    +--------------------+
|     DDS impl       |--->|         DDS        |    |   6. Presentation  |
|                    |    |                    |    +--------------------+    
+--------------------+    |                    |    |   5. Session       |
                          +--------------------+    +--------------------+
                          |                    |    |   4. Transport     |
                          |      UDP / IP      |    +--------------------+
                          |                    |    |   3. Network       |
                          +--------------------+    +--------------------+
                          |                    |    |   2. Data Link     |
                          |      Ethernet      |    +--------------------+
                          |                    |    |   1. Physical      |
                          +--------------------+    +--------------------+

From there, HAWG should go up based on the input we get. We don’t have a specified order and that’s what this post is for, to get your input. We’ve heard in the past that it’d be awesome to run DDS in the Programmable Logic (PL, the FPGA itself) optimizing intra-network flows while leveraging specialized communication buses for nanosecond-level intra and inter-process communications. We also have some additional ideas ourselves but as pointed out, we’d love to hear your thoughts and needs.

One important aspect is that we don’t want HAWG to be just about FPGAs. It’d be a waste since many accelerators can nowadays be developed with OpenCL and modern C++ (leveraging HLS). We are hoping to create some “open” accelerators and inspire others to contribute impacting not just ROS 2, but also Gazebo. Wouldn’t it be awesome to supercharge our simulations with GPUs?

2 Likes

I did a quick pass-over of the paper. Very interesting overview and confirming a lot of what I’ve seen! I really like FPGAs for edge applications but it gets a bit trickier in core compute.

Partial Reconfiguration (PR) takes this flexibility one step further, allowing the modification of an operating FPGA design by loading a partial configuration file. Using PR, part of the FPGA can be reconfigured at runtime without compromising the integrity of the applications running on those parts of the device that are not being reconfigured.

I didn’t even realize this was a feature. This is amazing :exploding_head:

FPGA programming is still much more challenging than regular software programming, and the supply of FPGA engineers is still limited.

This was my biggest concern w/r/t FPGAs moving into ROS. It is a very specific skillset and it’s just a question of manpower to be able to bring up and support them. I’ve worked on the Zynq Ultrascale platform for a couple of years and really enjoy how much effort Xilinx has put into the tools/education portion of the product. I definitely think the combination of FPGA + CPU is a game changer and will continue to attract a lot of interest from industry (which will hopefully in-turn mean more engineers)!

We’ve heard in the past that it’d be awesome to run DDS in the Programmable Logic (PL, the FPGA itself) optimizing intra-network flows while leveraging specialized communication buses for nanosecond-level intra and inter-process communications.

DDS is my first thought on what could be optimized but that exists outside of ROS so it’s likely an independent project? I’m not sure where the entry point is for that. I had some concerns about the fact that there will be limitations regarding the node count/message size. I noticed an attempt at this for ROS 1 that seemed to go well!

We are hoping to create some “open” accelerators and inspire others to contribute impacting not just ROS 2, but also Gazebo. Wouldn’t it be awesome to supercharge our simulations with GPUs?

I 100% agree that optimizations with highly parallelized compute would be awesome and very much part of the modern workflow. I have not done much work in the area but have seen plenty of proof from the computer graphics + ML worlds to know that it’s great for physics environments with lots of variables :grinning_face_with_smiling_eyes:

Note: Reposted to maintain reply linking.

Totally. This has been historically one of the big hurdles for acceleration adoption. Not just in robotics btw. The Not Invented Here syndrome is a big thing in the acceleration world. Big silicon vendors are too self-centric and rather than contributing to communities, prefer to reinvent the wheel with their yet-not-new-feature-in-acceleration-language-A. This is a nuts. My team at Xilinx wants to tackle this and contribute to the ROS 2 community directly, connecting to real ROS users’ needs and simplifying things.

This is what Goal 1 above is for in our roadmap (first one!). We’ll build tools that facilitate the integration of accelerators directly into ROS packages. In a nutshell, building an arbitrary ROS package with or without acceleration should be the same experience. Just colcon build-it.

If enough people’s interested on this our plan is to disclose and contribute with our implementations while pushing forward conventions, so that other silicon vendors can attach to it (instead of yet again, reinventing the wheel).

2 Likes

I’m surely forgetting many, but here’re a few more folks that may find this of interest @Jmeyer, @andreucm, @crystaldust, @DavidT, @LiyouZhou, @cheng.chen, @yukkysaito, @chang.su, @ArunPrasad, @kosuke55

@vmayoral that’s exciting news! On behalf of the micro-ROS team and as coordinators of the EWG, consider us to be definitely interested in collaborating on this! It’d be nice to organize a dedicated meeting someday soon, we’ll contact you by pm. In the meantime, don’t hesitate to bring the topic to the incoming EWG meetings to raise a public discussion with the community!
[Next one will be held on Tuesday the 4th of May at 5 pm CEST here]

2 Likes

Absolutely @FraFin! There’s definitely room for cooperation between EWG and HAWH! I look forward to it and I’m confident the MCU focus you guys have has a good fit with the architectures we’re starting to prototype.

2 Likes

This is a great idea! It would be wonderful to have more free open source FPGA IP cores that a community can share, fix, and enhance.

My group creates, among other things, custom high speed cameras by connecting imagers to Xilinx FPGAs so we can perform image processing directly on the camera. I would love to create a TSN/DDS/ROS interface to the cameras.

One issue we always run into is that we have to purchase many FPGA IP cores and they are black boxes that we cannot inspect or modify. They sometimes have bugs that no one is interested in fixing and that makes many people hesitant to use FPGAs.

Often people prefer doing image processing on GPUs instead, but the nondeterminism and the delays of transferring images from a camera to a GPU using a CPU kills the performance.

It would be great to have a way of feeding images directly from an FPGA to a GPU and doing image processing on them both, in whichever ways each is optimal, using all free open source acceleration kernels.

5 Likes

There is an Open-source FPGA Foundation with companies such as QuickLogic who are using Open-Source FPGA tools, so if not done so already, it might be worth contacting them and seeing if they are interested in working with the ROS folks to improve FPGA support in ROS 2.
https://osfpga.org/

3 Likes

Awesome! Looking forward to it :star_struck:

Lovely to hear @peterpolidoro! Empowering these kind of use cases should be a key driver within this group.

Noted!

Thanks, I sent an invitation to join the group and the community. Hope to hear from them.

Have you given thoughts for SYCL support? I’ve found reasonable success in having a single code for relatively different platforms: HEDT, edge CPU and GPU using SYCL 2020 (with C++17 support). The ecosystem is new, but with support for OpenCL, HiP (curtsy of AMD), it looks promising.

The official implemenation triSYCL even has some extensions for Xilinx FPGAs (disclaimer: I haven’t played with those or seen them in action)

Thanks for organizing this, @vmayoral, and it does sound promising. The biggest challenge that I see is that there isn’t a great API for write once, run everywhere accelerated code. Since Nvidia has not published an OpenCL interface driver for their popular Jetson platform, it would exclude these devices. I haven’t looked at SYCL but I would imagine that if someone had achieved it, we’d all be using it already.

I was waiting for a similar initiative for a long time! Great to know that it is progressing. I am personally particularly interested in GPU HW acceleration as almost every x86 CPU comes nowadays with a “free” GPU.

This is very interesting initiative. We have been working with the Ultra96 boards for a time with ROS and this year started using Vitis. Partial reconfiguration of FPGAs is also something we’ve been trying to look into but haven’t had enough time or people on this side of research.

1 Like

@vmayoral definitely a great goal to organize the compute acceleration community around an application framework like ROS-2. I am not a big fan of either OpenCL or HLS for compute accelerators as OpenCL is too constrained for complex concurrency patterns in optimization and linear algebra kernels as the good stuff that needs acceleration aren’t GEMM, and HLS is difficult to use in an application development environment as the compilation times are not commensurate with edit/compile/test cycles that the software community expects.

The ROS-2 computational graph is a better structure to leverage There are many examples of these types of graph runtimes in HPC to manage expanding and contracting concurrency (for example MAGMA), and the nodes of the graph represent well-defined APIs and dispatch abstractions. The DAG scheduler would also be able to synthesize compute and latency demands that would be instrumental in driving either acceleration design, or potentially hardware synthesizers to yield accelerators that are perfectly tuned to the computational load.

The problem I see on the horizon is that this hardware acceleration problem of non-trivial computational graphs is likely going to be solved more urgently in the networking/telecom space as compared to the robotics space as the application, money, and skillsets are readily available in that vertical, whereas with robotics budgets and teams are simply too small to really deliver on this engineering.

Personally, I think the most productive path forward here is to focus on bottleneck operators in perception, and use architectural simplifications to deliver application level value quickly and cost-effectively.