Challenges of GPU acceleration in ROS

rgov · March 5, 2021, 5:20pm

(I’m trying to phrase this not as a support request, more of an observation of a general problem.)

I have been using an Nvidia Jetson TX2 on an autonomous vehicle that uses a camera to follow a target. The benefit of the Jetson platform of course is that you have access to CUDA for acceleration.

I haven’t settled on whether I will write my own node to actually track the target or use a community-developed package, but either way the solution will probably end up using OpenCV. Already I am using the spinnaker_sdk_camera_driver for the camera which in turn uses cv_bridge, both of which require OpenCV.

The conundrum is how to actually make use of the GPU compute capability of the Jetson.

On the Jetson, which does not support OpenCL (as of 2019), the only option is to use the CUDA API, which is an invasive code change.

(It is a bit of an exercise even to install OpenCV with CUDA, as Nvidia’s published package does not include the CUDA module that they themselves contributed…?)

The consequence of this is that an accelerated OpenCV-based computer vision pipeline in ROS would basically require parallel implementation of several different nodes. (And the more message passing involved in the pipeline, the less you realize the benefit of acceleration.)

This problem probably generalizes to other types of accelerated computing, but it is less likely to have, say, an ML model that requires multiple nodes, so I think it especially impacts vision pipelines.

What can the community do, if anything, to support increasingly-popular accelerated computing platforms like the Jetson, which often require different APIs, without creating fragmentation?

Katherine_Scott · March 5, 2021, 6:14pm

I think @ak-nv @amitgoel are ROS developers at NVIDIA. Perhaps they can help route your request to the right people.

smac · March 5, 2021, 10:04pm

I had no idea that was true, that puts a set of breaks on some work I had planned on supporting GPUs in some of the Nav2 algorithms. I’ll probably scale back that project to just multi-threading support now.

I’m not really sure, I think the biggest thing is for Nvidia to come to the table and start contributing GPU optimizations in ROS / related projects to let their users make use of these techniques when reasonable. I don’t really see any way around that first incremental step. From a starting ground that Nvidia’s involved and making reasonable efforts to support their own customers in this area, things could really start to snowball from there quickly.

It’s hard for me, for instance, to justify any significant amount of time developing GPU optimizations requested by Nvidia’s customers when Nvidia themselves aren’t really doing it in any part of ROS. Plus the CUDA comment above further makes me less interested if my OpenCL work wouldn’t even work on the Jetsons anyway. I would probably merge CUDA optimizations in my projects, but I wouldn’t develop any myself. If I do something, I want it to be completely cross-platform and impact the maximum number of people possible, and CUDA can never be that unless it gets supported by every major GPU manufacturer. I don’t write CPU code thinking that 50% of the market can’t use my work, why would I accept that for a GPU?

This motivation is clearly different for companies building discrete products. They can just build hardware-limiting code based on a product line. For open-source though, that doesn’t align as well since we want our work to be usable to everyone and alot of work goes into making our implementations very general and complete to enable that (from design, to the choice of algorithms, to code readability, language choice, and more).

So I think the major items would have to be

Nvidia contributing GPU optimizations into ROS and related that their customers are asking for
OpenCL support on Jetsons so that open-source can support GPUs themselves
Companies / groups maintaining and contributing to open-source themselves leveraging GPUs to need to write them (less easy to solve)

peci1 · March 5, 2021, 10:31pm

Are you sure the OpenCV preinstalled with Jetpack isn’t accelerated? What sense would that make?

ak-nv · March 5, 2021, 11:29pm

On Jetson Platform with JetPack OpenCV comes pre-installed. You can checkout our newly build Docker Images with CycloneDDS here. We do offer accelerated AI packages and Cuda accelerated PCL library support on ROS/ROS2, which can be found on ros2_jetson GitHub page

We are working towards building more accelerated packages and end-to-end projects on Jetson with ROS/ROS2 . We can certainly discuss these points in Edge WG meeting.

Looking forward to contributing. Please let us know if anything specific Packages if you are looking for and we will certainly add it to our priority list.

rgov · March 5, 2021, 11:29pm

See this thread. But even if it is accelerated, you must use CUDA APIs, so just dropping in the accelerated version and running existing ROS nodes that are not designed to use those APIs will not gain from it.

@ak-nv That’s interesting that installing with Jetpack gives an accelerated version but the L4T apt repositories do not contain it I guess?

ak-nv · March 5, 2021, 11:48pm

You can either install it through JetPack SDK manager or I would suggest this way is the easiest.

smac · March 6, 2021, 1:20am

Top priority for all ROS users:

Rviz
PCL, a very important dependency
Gazebo / ignition rendering
BFL / filtering libraries / Ceres / g2o / etc

Medium priority navigation users:

grid_maps update / processing / insertion / access, many folks use (and future for Nav2’s environmental height modeling)
voxel and obstacle layer raycasting in costmap_2d
DWB critic and trajectory evaluation
costmap layer updates in costmap_2d
OpenVSLAM
AMCL / new localization framework in Nav2 (in progress, can’t expect contributions until done )
Likely more, but these are the low-hanging obvious fruit to me, but I’m not an expert in what all can be easily GPU-ized.

Lower priority nice to haves:

Image transport compression plugins
image_pipeline openCV with GPU acceleration options

chfritz · March 6, 2021, 1:38am

+1 on “voxel and obstacle layer raycasting in costmap_2d”

smac · March 6, 2021, 1:49am

Yeah, this would go a long way to having it work on smaller platforms (and when using 3D lidars or something). Long term, we want to move to a grid_maps-based height maps solution, but costmap_2d will be forever in Nav2 since not everyone a height map is appropriate for. Alot of people genuinely do work in fully-planar environments. Costmap2D is the most ripe place for impact in Navigation, even some extremely coarse testing of OpenMP parallelizing the costmap layer updates sped things up 20%.

ak-nv · March 6, 2021, 3:29am

Thank you for the list Steve.

You can find Cuda accelerated PCL GitHub and brief blog post link

Will keep you all posted here with more updates.

vmayoral · March 6, 2021, 6:18am

Myself and my team in the past bumped into the same situation with Jetson. We ended up moving to different compute substrates for that specific reason.

While this is true from a tooling perspective, I’d also mention ROS 2 users have additional priorities computational graph-wise which should be considered. E.g. it appears to be quite a hot topic to optimize interactions betwen nodes (inter-process, intra-process and even over the network). Accelerators on this end will add significant value to the overall ROS 2 computational graphs (and even to the underlying data layer graphs, which at the end of the day, matter as most in real deployments). In other words, while it’d be great to have OpenCL kernels to accelerate specific computations, real impact won’t be achieved unless a holistic ROS view is applied (bottlenecks identified, etc.)

Determinism and real-time are also important topics to consider. While GPUs have historically performed poorly on RT aspects, most robotic behaviors have some sort of deadlines. It’d be great to hear @ak-nv and team how pipelined accelerators in GPUs could help in here as well.

Emphasizing one of @smac’s points above, I’d heavily encourage you and team @ak-nv to consider GPU accelerators for simulation. It’s a long dream of many to get such capabilities and this somehow aligns well with your current trend of AI-related demos. This can heavily impact ROS developers but also AI researchers (e.g. doing RL setups like this could heavily benefit from GPU simulation accelerations).

As a point of criticism @ak-nv, NVIDIA is somehow disengaged from the ROS community (with their Gems, its own physics engine not integrated in Gazebo/ignition, it’s own UIs, etc). IMHO there’re lots of wasted resources in here. ROS/Gazebo communities already provide much of that and there’s a decent level of “reinventing the wheel”. Why don’t you consider integrating PhysX in Gazebo and/or helping developers accelerate their simulations and integrate it in their DevOps pipelines (that’s a fantastic business case for selling more GPUs to roboticists )?

gavanderhoorn · March 6, 2021, 5:28pm

The big challenge I believe – as @smac already hints at – is platform / technology dependence and vendor lock-in (@rgov points to the same problem essentially: “ROS would basically require parallel implementation of several different nodes”).

OpenCL was intended to provide a neutral environment for CPU/GPU/ASIC accelerated workflows, but it doesn’t seem like it’s achieved the same acceptance/uptake as the vendor-specific offerings have.

I’m far from an expert though, so I’d love to hear I’m wrong, or whether there is an alternative to OpenCL these days (SYCL?).

Without a cross-platform and vendor-agnostic development and execution environment we may be required to maintain parallel implementations to support the available hw accelerators. Realistically, there are only three when it comes to GPUs: AMD, NVIDIA and Intel.

(that’s far from a desirable situation of course)

Chris_Albertson · March 6, 2021, 6:28pm

The parallel node design is not so bad in ROS2 as you can do “zero copy” message passing where the cost of moving data is very low.

About OpenCV. It is very good for development but after you have your algorithm working how much of OpenCV do you really use? For most of us, it is maybe a half dozen functions out of the hundreds available. You could implement the one or two functions you need yourself. THis is what I assume is meant by “parallel nodes”. But if those nodes are inside the same Linux process the cost is low.

BTW, I’ve decided I’m better off using a Raspberry Pi4 and a Google/Coral “USB Accelerator” But this is limited to int-8, while the Jetson does have floating-point. If you can use int-8, Tensorflow runs well on it.

gavanderhoorn · March 6, 2021, 6:31pm

yes, it’s about the cost of maintenance of multiple sets of nodes which do the same thing, but with a different back-end for GPU/ASIC acceleration.

It would be the opposite if what we use CBSE frameworks like ROS for in my opinion.

Alan_Federman · March 6, 2021, 11:10pm

In general, I am frustrated by Nvidia’s lack of support for ROS Noetic/Foxy. In particular, after spending a long time upgrading the the standard Jetpack Ubuntu to 20.04 and ROS Neotic, I discovered no CUDA support in a Ubuntu LTS current release. Nvidia’s response apparently is “Just run ROS in docker containers, and we will get around to a Jetson 20.04 version when we feel like it.”

My main reason for getting the Jeston Nano over another SBC for a small ROS robot was GPU enabled AI in Vision(opencv 4) and Deep Learning(Yolo) applications.

rgov · March 7, 2021, 12:28am

I don’t mean to start a thread to vent about Nvidia. I do think it’s a great sign that there are contributors there like @ak-nv who join us on the forum and are thinking about how to improve ROS on their hardware.

I’m looking forward to Ubuntu 20.04 support for Jetson too, but I would argue that it is actually pretty cool that with containerization we can run a new OS release without waiting for the hardware vendor to release a new board support package. And on the flip side, if your dependencies are stuck on Melodic, when L4T moves on to newer releases, you’ll still be able to run Melodic in a container.

doisyg · March 7, 2021, 11:53am

I am very happy about this discussion as it is vast topic and many things can be done.

My 2 cents as a ROS user:
I see two domains that can be distinguished with different constraints:
-Production environment
-Development environment

For a production environment, I fully agree with @smac that the HW optimizations should be as much as possible NOT vendor specific. I cannot allow my developments to be too dependent on a specific hardware. We all have stories about a vendor suddenly stopping a range of products. The ideal solution would be to use OpenGL/OpenCL. However, with much more work, there are good examples of software stack dynamically using the different vendor optimization, OpenCV for instance.

Btw, just allowing OpenCV to use its HW accelerations where it is used in ROS (image_pipeline?) would be a great start.
On my applications, HW/GPU optimizations will benefits the most on the sensor processing operations: filtering, transformations, and downsampling for images, laser scan and point clouds. OpenCL accelerated PCL would be amazing.

For a development environment, I am okay to have vendor specific optimizations. Everything related to simulation for instance. It is acceptable for me to have developments machines with decent nvidia GPU to ensure smooth operation of Ignition Gazebo. But GPU rendering development demands skills quite different to robotics development (I recently discover the world of shaders by porting laser_retro support to Ignition, and well, the learning curve is steep).

kunaltyagi · March 8, 2021, 1:28am

Re: Nvidia support for PCL

The link shared by @ak-nv is a stand-alone distribution (and goes without saying, unsupported by PCL).

Personally, I’m disappointed that Nvidia has gone the route of dumping binaries. No change or support for in-built CUDA modules has been up-streamed to PCL, so it looks like someone will soon (have to) make a package that allows catkin/colcon to overlay the libraries installed by the package manager to allow seamless operation for ROS users

vmayoral · April 29, 2021, 9:25am

Connected to the Challenges of GPU acceleration in ROS, have a look at Proposal for ROS 2 Hardware Acceleration Working Group (HAWG) and our attempt to push forward a common open front of acceleration (including GPU-based) at the ROS level.

We tried and account for the concerns raised and comments by @rgov, @smac, @Alan_Federman and others. Do let us know if we missed something important that needs to be added to the roadmap “early” please.

Topic		Replies	Views
10x acceleration for point cloud processing with CUDA PCL on Jetson General	8	8100	May 7, 2021
Isaac ROS, hardware acceleration for autonomous robots General ros2 , wg-acceleration	9	6846	November 21, 2021
NVIDIA Isaac ROS Webinar Series Kicks Off on November 14, 2022 Training & Education ros2 , tutorial , ai , wg-acceleration , gpu	1	3700	December 11, 2022
Development potential in Localization, Computer Vision and Deep Learning ROS Projects ros2	2	2197	November 22, 2023
Edge AI WG Dec 3rd - NVIDIA's ROS2 AI dockers, Nav2 Mask R-CNN help needed Next Generation ROS wg-navigation , navigation , wg-edgeai , cyclonedds , jetson	1	1513	December 4, 2020

Challenges of GPU acceleration in ROS

Related topics