Thanks for your comments and input everyone! Lots of interesting feedback. We’re glad to see the interest our initiative raised. We’ve collected all the feedback and new ideas and plan to include it in a follow up thread launching officially the WG.
We’ll give a few more weeks for everyone to share their thoughts, but stay tuned . A few comments from our side:
This is an interesting angle @kunaltyagi. I haven’t looked at SYCL but I believe this would be a great contribution/extension to the initial hardware acceleration architecture.
True @rgov, but we believe that by relying on open standards we’ll maximize our chances of suceeding here. That’s why we’re proposing to bet on C, C++ and OpenCL which can be used today for programming hardware accelerators in many devices (across compute substrates, including FPGAs, CPUs and GPUs).
Dynamic Function eXchange (DFX) (formerly known as Partial Reconfiguration (PR)) is centric to our approach at Xilinx @jopequ. That’s indeed a feature we will be leveraging while integrating such capability directly into ROS 2. So that using it doesn’t require you to be a hardware expert.
Before diving into DFX though, getting up to speed with accelerators is a must (that’s how we came up with the plan above ). We don’t have yet a timeline for DFX integration but stay tuned if you’re interested or ping me if you have the resources to push in that front and would like to contribute.
Welcome to the community @Ravenwater. HLS has traditionally been a pain, I agree. Things are changing though.
Though I certainly can’t generalize for all hardware, speaking for our solutions at Xilinx and more specifically our Kria portfolio, we propose three build targets for HLS:
-
Software Emulation (
sw_emu
): The kernel code is compiled to run on the host processor. This allows iterative algorithm refinement through fast build-and-run loops. This target is useful for identifying syntax errors, performing source-level debugging of the kernel code running together with application, and verifying the behavior of the system. Simply put, a transformation which runs all the code in an emulated processor matching the K26 SOM, as if there wasn’t any accelerator. -
Hardware Emulation (
hw_emu
) - The kernel code is compiled into a hardware model (RTL), which is run in a dedicated simulator. This build-and-run loop takes longer but provides a detailed, cycle-accurate view of kernel activity. This target is useful for testing the functionality of the logic that will go in the FPGA and getting initial performance estimates. In other words, a simulation within an emulation. The FPGA is simulated and runs inside of an emulation (QEMU), sitting together to emulated processors and allowing to get performance estimations faster. -
Hardware (
hw
) - The kernel code is compiled into a hardware model (RTL) and then implemented on the FPGA, resulting in a binary that will run on the actual FPGA.
sw_emu
allows you to run both the host code and the accelerator inside an emulator (QEMU) and the transformation takes about a minute. This includes the kernel’s code which runs as if it was host code. I’m using this approach everyday and I believe it is sufficient to meet the common software development flows in ROS/robotics.
Here’re some numbers I just produced for the sake of the argument:
build target | ROS 2 accelerated package build time |
---|---|
sw_emu |
\approx 1 minute |
hw_emu |
\approx 4 minutes |
hw |
\approx 23 minutes |
Note this depends heavily on your developer workstation characteristics. Mine is based on a AMD Ryzen 5 PRO 4650G.
These are approximate and based on a ROS 2 package in my current workspace (including host code and kernel) which includes a very simple kernel.
I’d be curious to hear what are your thoughts on this @Ravenwater and everyone else. I agree nevertheless that we should push towards getting a similar user experience, both time-wise and development flow-wise.