My team and I have noticed that our image publisher is much slower on shared memory transport compared to solely using UDP transport.
We have two docker containers on net: host and ipc: host. One docker container publishes an image topic from an RTSP IP camera using a gstreamer to ros bridge. The other docker container has one subscriber to the image topic.
The image topic publishes raw images at 1080p and has the following bandwidth:
# With SHM transport
23.04 MB/s from 43 messages
Message size mean: 4.15 MB min: 4.15 MB max: 4.15 MB
# With UDP transport
62.72 MB/s from 100 messages
Message size mean: 4.15 MB min: 4.15 MB max: 4.15 MB
It has the following publish QoS and has only one BEST_EFFORT subscriber.
When checking the image FPS from my subscribing container, I get around 5.5 FPS. I’ve verified in Fast DDS monitor that the image publisher is using shared memory.
If I set RMW_FASTRTPS_PUBLICATION_MODE=ASYNCHRONOUS in the image publishing container, I get around 19 FPS. However, this is odd to me since my gst to ros bridge publishes the images on reliable QoS and the FastDDS monitor also shows the publish mode as SYNCHRONOUS_PUBLISH_MODE.
For anyone facing this issue, it was caused by an insufficient shared memory segment size.
Default shared memory segment size is 0.5MB which caused my 4.1MB image to be very fragmented hence affecting the frame rate. Increasing to segment size to at least 4.1MB resolved my issue. I am now getting a consistent 25 FPS in both containers.
It also helped to increase the allocated /dev/shm size. Going from 16GB to 32GB /dev/shm size for my 32GB ram computer helped to increase fps from 22 to 25 fps.
This was the fastdds xml profile that solved my issue.
<?xml version="1.0" encoding="UTF-8"?>
<dds xmlns="http://www.eprosima.com/XMLSchemas/fastRTPS_Profiles">
<profiles>
<transport_descriptors>
<!-- Set larger segment size to reduce fragmentation -->
<transport_descriptor>
<transport_id>shm_transport</transport_id>
<type>SHM</type>
<segment_size>4198400</segment_size>
</transport_descriptor>
</transport_descriptors>
<!-- Link the Transport Layer to the Participant -->
<participant profile_name="SHMParticipant" is_default_profile="true">
<rtps>
<userTransports>
<transport_id>shm_transport</transport_id>
</userTransports>
</rtps>
</participant>
</profiles>
</dds>
We are using FastDDS. The camera publishing container is actually running on jazzy while the rest of our nodes on the other container is running iron. Having different versions of ROS is not ideal but seems to be doing ok for us now.
I have looked into the zero copy data sharing using loaned messages. However, from what I can see (and happy to be corrected), there is no way to implement loaned messages using the xml profile. The api also only seems to be in C++ whereas our image-related nodes are in python (and we don’t wish to rewrite in C++ at the moment).
Seeing as we’re using the images solely for ML perception, we might eventually move to Deepstream which also has zero memory copy.
It’s documented (here) and is just a property of shared memory in general: segments backing it have a size, as all files do.
I’m not saying it’s immediately obvious this could cause the behaviour @darrenjkt reports – although the documentation I linked mentions it – but this is a bit like complaining about Nagle’s algorithm in TCP/IP and its effect on message delivery in TCPROS.
Would you however also want one for all other transports which may need to rely on fragmentation to get large(r) message sent across a transport with a smaller maximum message/frame size?
Not sure how generalizable my view is. However, I understand that all network-utilizing transports have to fragment the packets in the end of the day to chunks between 1518 and 9000 B based on Jumbo frames availability. That’s a hard constraint. However, with a local-memory transport, I’d expect it just “bites” as large chunks as needed for each message as there is no such natural chunk size limitation. If the limitation is caused by limitation of the underlying technology (segmenting memory to fixed-sized chunks to have easier memory management?), I guess this could be communicated to the user…