Thank you so much for this detailed review. This means ROS1 is still at least three times faster compared to ROS2 in this example. That is quite a disappointing result, especially after it’s been known for a couple of years know.
I’d say that’s a feeling that a huge number of ROS users share and it should get much higher priority. I’ve been teaching ROS1 and ROS2 since more then 10 years now and ROS2 still brings much more trouble in debugging communication problems, which is just frustrating for newcomers and also far beyond their scope. (On top of some other hurdles for beginners including the build process, especially for Python.)
So +1 to make ROS2 just as easy to use and implement as ROS1 was.
Forgive me if this exists and I was too lazy to find it- is there an architecture diagram somewhere that follows the path of a message through from one node to another, including any potential serialization, all the way down to the network protocol (or local system shared memory) in use? It’d be great to get this drawing with ROS1 and compare to a few of the different RMWs in ROS2 just to get a starting point. Also a comparison of this drawing between custom types vs primitives would be helpful.
I think main innovation in ROS2 is not necessary DDS but RMW (see ROS 2 middleware interface design document).
It allows to decouple ROS2 from DDS/Zenoh or any other protocol. Which brings me to the idea that ROS2 can implement new RMW which would be based on TCPROS (which is ROS1 is based on) so people who prefer ROS1 for out-of-the box performance can use it.
There is RMW based on SMTP already, so why not have RMW based on TCPROS?
Another idea is that default ROS2 DDS profile is not suitable for loads with small messages when latency is important.
I don’t know if this is helpful for the discussion, but I completed some naïve performance testing a few years ago to compare ROS 2 to another project. The results are on page 6: https://files2.wasontech.com/RobotRaconteur_CASE2023.pdf . RTT time for ROS 2 was around 100-300 us during my tests but with a high deviation. The source code is here: GitHub - johnwason/rr_ros_latency_tests
I can find the original dataset if people are interested.
Another consideration is that the latency may be caused by the context switch performance of Linux. I have noticed at times the latency to switch threads when new data arrives can be the problem rather than the performance of the communication or the code. The context switching performance is affected by the power settings of the computer and a myriad of kernel settings. Anecdotally I have found that Windows has better default context switch performance when receiving small amounts of data at high frequencies, but this is due to the scheduler prioritizing this scenario. Linux in my experience can have pretty high latency when trying to context switch rapidly for receiving small amounts of data.
I also did a quick benchmark of the serialization alone (I’m probably not the first though).
It’s a bit rough, so take it with a grain of salt:
Noetic
Run on (12 X 4213.38 MHz CPU s)
CPU Caches:
L1 Data 32 KiB (x6)
L1 Instruction 32 KiB (x6)
L2 Unified 256 KiB (x6)
L3 Unified 12288 KiB (x1)
Load Average: 0.56, 0.59, 0.88
---------------------------------------------------------------------------------------------------------------------------
Benchmark Time CPU Iterations UserCounters...
---------------------------------------------------------------------------------------------------------------------------
BM_serialization/std_msgs_header 35.4 ns 35.4 ns 20223211 bytes_per_second=565.588M/s
BM_serialization/std_msgs_bool 31.8 ns 31.8 ns 20964098 bytes_per_second=30.019M/s
BM_serialization/geometry_msgs_pose_array 883 ns 883 ns 806462 bytes_per_second=11.8422G/s
Galactic
Using rmw_fastrtps_cpp.
Run on (12 X 4500 MHz CPU s)
CPU Caches:
L1 Data 32 KiB (x6)
L1 Instruction 32 KiB (x6)
L2 Unified 256 KiB (x6)
L3 Unified 12288 KiB (x1)
Load Average: 0.36, 0.65, 0.94
------------------------------------------------------------------------------------------------------------
Benchmark Time CPU Iterations UserCounters...
------------------------------------------------------------------------------------------------------------
BM_serialization/std_msgs_header 1166 ns 1166 ns 595924 bytes_per_second=17.9911M/s
BM_serialization/std_msgs_bool 1155 ns 1155 ns 606509 bytes_per_second=6.60702M/s
BM_serialization/geometry_msgs_pose_array 8226 ns 8225 ns 84909 bytes_per_second=1.27132G/s
There’s quite a big overhead in ROS2, which appears to be rmw_serialize calling get_message_typesupport_handle and reconstructing the MessageTypeSupport structure. Maybe caching those information would be a low-hanging fruit?