Introducing rosidlcpp: Building interface packages 10x faster

Hi everyone,

I’m excited to share a side project I’ve been working on:

Motivation

Building interfaces in ROS2 is slow. Interface packages with a lot of messages might even be the packages that take the most time to build in your workspace. If you work with ublox_msgs or px4_msgs, you know what I mean.

During the build, it doesn’t look like the CPU is very busy. It might have one thread running a mysterious rosidl_generator_ or rosidl_typesupport_ script with shorter bursts where it’s fully loaded in between.

Also, modifying a single interface file rebuilds the entire package. This, combined with the long build time, results in an unpleasant development process.

The idea behind rosidlcpp was to reimplement rosidl in C++ so that the scripts can be parallelized. The speedup of reimplementing in C++ was sufficient that multithreading still hasn’t been started.

Issues with rosidl

Slow Python Generators

Most of the time spent building interface packages is spent on generating files. With the default install, 10 file generators need to run, each generating multiple files per interface.

I suspect there are two reasons for the generators to need this time:

  • Parsing the IDL files (rosidl_generator_type_description generates a single file per message but still takes long).
  • Complex file templates to generate (see the last generator rosidl_generator_py).

Slow Compilation

Zooming in on the main compilation section (after the rosidl_typesupport_) reveals that some files take over 300ms to compile.

Adding -ftime-report to one of these build stages reveals that most of the time (around 90%) is spent parsing the file.

[3/5] Building CXX object CMakeFiles/px4_msgs__rosidl_typesupport_fastrtps_c.dir/rosidl_typesupport_fastrtps_c/px4_msgs/msg/detail/action_request__type_support_c.cpp.o

Time variable                                   usr           sys          wall           GGC
 phase setup                        :   0.00 (  0%)   0.00 (  0%)   0.00 (  0%)  1882k (  2%)
 phase parsing                      :   0.14 ( 88%)   0.11 (100%)   0.26 ( 93%)    68M ( 90%)
 phase lang. deferred               :   0.01 (  6%)   0.00 (  0%)   0.01 (  4%)  3741k (  5%)
 phase opt and generate             :   0.01 (  6%)   0.00 (  0%)   0.01 (  4%)  1746k (  2%)
 |name lookup                       :   0.04 ( 25%)   0.01 (  9%)   0.04 ( 14%)  2962k (  4%)
 |overload resolution               :   0.00 (  0%)   0.00 (  0%)   0.02 (  7%)  3727k (  5%)
 callgraph construction             :   0.01 (  6%)   0.00 (  0%)   0.00 (  0%)   272k (  0%)
 preprocessing                      :   0.01 (  6%)   0.04 ( 36%)   0.07 ( 25%)  2166k (  3%)
 parser (global)                    :   0.04 ( 25%)   0.02 ( 18%)   0.03 ( 11%)    25M ( 34%)
 parser struct body                 :   0.02 ( 13%)   0.00 (  0%)   0.02 (  7%)    17M ( 23%)
 parser function body               :   0.01 (  6%)   0.01 (  9%)   0.02 (  7%)  1857k (  2%)
 parser inl. func. body             :   0.02 ( 12%)   0.00 (  0%)   0.04 ( 14%)  2720k (  4%)
 parser inl. meth. body             :   0.00 (  0%)   0.02 ( 18%)   0.02 (  7%)  8009k ( 10%)
 template instantiation             :   0.05 ( 31%)   0.02 ( 18%)   0.06 ( 21%)    14M ( 19%)
 constant expression evaluation     :   0.00 (  0%)   0.00 (  0%)   0.01 (  4%)   118k (  0%)
 initialize rtl                     :   0.00 (  0%)   0.00 (  0%)   0.01 (  4%)    12k (  0%)
 TOTAL                              :   0.16          0.11          0.28           75M

The reason parsing the file takes so long is that its includes, once expanded, result in a file over 50,000 lines long and over 1MB. Each interface definition generator file includes the same headers that are parsed by the compiler for each file.

What rosidlcpp Does Differently

  1. Complete reimplementation in C++. The IDL files are parsed into JSON data structures and passed to the Inja template engine (Why Inja? It’s the first result when searching “C++ template engine”. I have no major complaints).
  2. Precompiled headers. The headers shared by all interfaces are precompiled once, which significantly reduces the time it takes to compile each .c/cpp file.

The resulting build is almost entirely spent on actually compiling files:

Replacing rosidl Generators (This is the Jankiest Part of rosidlcpp)

There is, to my knowledge, no clean way to replace existing generators with others.

To replace the rosidl_ generators, the rosidlcpp_ generators manually edit the CMake AMENT_EXTENSIONS_rosidl_generate_idl_interfaces variable (which is normally updated when calling ament_register_extension). Also, all rosidlcpp_ generators only register as rosidl_generator_packages, even though typesupport would normally be expected to register rosidl_typesupport_c[pp].

Apart from that, the build is the exact same as done by the rosidl_ generators with rosidl substituted for rosidlcpp in some places.

Benchmarks

Compilation time on AMD Ryzen 9 9900X 12-Core (24 threads) Processor, 32 GB of RAM. In a Docker container based on osrf/ros:rolling-desktop-full.

Time measured from the .ninja_log logs (there is about 1-2 seconds of additional overhead).

Packages rosidl rosidlcpp (not precompiled) rosidlcpp rosidlcpp speedup
test_msgs 7.680s 2.435s 2.090 3.7
px4_msgs 72s 14.450s 5.685s 12.7
ublox_msgs 42s 5.985s 2.670s 15.7
22 Likes

Great and Solid work!

Hope there will be an individual helloworld sample github repo.

Great job! Do you plan to propose your approach to be upstreamed and used by default?

1 Like

@TonyWelte thanks for this contribution and starting the disccusion! I’m excited to check it out and see what we can do to either A) make it easier to switch out these generators or B) move this implementation into a place that everyone can benefit from it.

I’m going to add it as a discussion point in the PMC meeting. Would you be interested in coming to present in the next week or two?

4 Likes

@TonyWelte: I would perhaps suggest submitting the changes to the various rosidl_*_generate_interfaces.cmake which add precompiled header support as a first step to getting your optimisations merged.

They should be fairly risk free (compared to replacing the whole of rosidl), rather self-contained and on my system (72T, 64GB ram) seem to already result in a nice speedup (not on the same order you report in the final table in your OP, but still).


Edit: I was curious where some of the bottlenecks in rosidl were, and with some really primitive benchmarking it looks like rosidl_parser.parser.parse_idl_file(..) is one of the slower functions. It’s used by many of the packages @TonyWelte mentions.

Very crudely wrapping it in a multiprocessing.Pool in rosidl_pycommon.generate_files(..), rosidl_generator_py.generate_py(..) and rosidl_generator_type_description.generate_type_hash(..) seems to speed things up approximately 5x on my system (from 180 to 35 sec for ublox_msgs).

1 Like

@peci1 That would be nice at some point, but I think it will be a while before it’s stable enough to be used by default. I think I’ll try to release a proper version for the next ROS release so more people can try it out. In the meantime, there are a few things I’ve learned along the way that would benefit rosidl, so I’ll probably make some PRs to address those.

@mjcarroll Yes, I should be able to attend the PMC meetings. Let me know when you’re planning to discuss it.

@gavanderhoorn I agree that it would be a good first step. I actually already have forks of rosidl, rosidl_typesupport, and rosidl_typesupport_fastrtps with precompiled headers (although they are not up to date with the latest changes I’ve made to rosidlcpp). rosidl_typesupport and rosidl_typesupport_fastrtps need to precompile headers that they don’t include directly (that are included by _functions.h, _struct.hpp, etc.), which I’m not too keen on. I’ll try making an issue this week on rosidl to discuss how it could be done.

1 Like

The rosidlcpp generator are about 50 times faster than rosidl (last time I checked) so I would expect you should be able to obtain a similar if not better speedup using multiprocessing on your 72 threads system

Great work @TonyWelte!

Looks amazing @TonyWelte. I will definitely try it out.