Shared memory transport in FastDDS for Image publisher has lower frame rate than UDP transport

My team and I have noticed that our image publisher is much slower on shared memory transport compared to solely using UDP transport.

We have two docker containers on net: host and ipc: host. One docker container publishes an image topic from an RTSP IP camera using a gstreamer to ros bridge. The other docker container has one subscriber to the image topic.

The image topic publishes raw images at 1080p and has the following bandwidth:

# With SHM transport
23.04 MB/s from 43 messages
        Message size mean: 4.15 MB min: 4.15 MB max: 4.15 MB

# With UDP transport
62.72 MB/s from 100 messages
        Message size mean: 4.15 MB min: 4.15 MB max: 4.15 MB

It has the following publish QoS and has only one BEST_EFFORT subscriber.

Node name: gst_rosimagesink_bow
Node namespace: /vessel_1
Topic type: sensor_msgs/msg/Image
Topic type hash: RIHS01_d31d41a9a4c4bc8eae9be757b0beed306564f7526c88ea6a4588fb9582527d47
Endpoint type: PUBLISHER
GID: 01.0f.a7.96.db.5f.19.ef.01.00.00.00.00.00.14.03
QoS profile:
  Reliability: RELIABLE
  History (Depth): UNKNOWN
  Durability: VOLATILE
  Lifespan: Infinite
  Deadline: Infinite
  Liveliness: AUTOMATIC
  Liveliness lease duration: Infinite

When checking the image FPS from my subscribing container, I get around 5.5 FPS. I’ve verified in Fast DDS monitor that the image publisher is using shared memory.

Now if I use UDP transport with the following fastdds profile solely for my image publishing container, I get around 16 FPS.

<?xml version="1.0" encoding="UTF-8"?>
<dds>
    <profiles xmlns="http://www.eprosima.com/XMLSchemas/fastRTPS_Profiles">
        <transport_descriptors>
            <transport_descriptor>
                <transport_id>udp_transport</transport_id>
                <type>UDPv4</type>
            </transport_descriptor>
        </transport_descriptors>
        
        <participant profile_name="no_shm_participant" is_default_profile="true">
            <rtps>
                <userTransports>
                    <transport_id>udp_transport</transport_id>
                </userTransports>
                <useBuiltinTransports>false</useBuiltinTransports>
            </rtps>
        </participant>
    </profiles>
</dds>

If I set RMW_FASTRTPS_PUBLICATION_MODE=ASYNCHRONOUS in the image publishing container, I get around 19 FPS. However, this is odd to me since my gst to ros bridge publishes the images on reliable QoS and the FastDDS monitor also shows the publish mode as SYNCHRONOUS_PUBLISH_MODE.

For anyone facing this issue, it was caused by an insufficient shared memory segment size.

Default shared memory segment size is 0.5MB which caused my 4.1MB image to be very fragmented hence affecting the frame rate. Increasing to segment size to at least 4.1MB resolved my issue. I am now getting a consistent 25 FPS in both containers.

It also helped to increase the allocated /dev/shm size. Going from 16GB to 32GB /dev/shm size for my 32GB ram computer helped to increase fps from 22 to 25 fps.

This was the fastdds xml profile that solved my issue.

<?xml version="1.0" encoding="UTF-8"?>
<dds xmlns="http://www.eprosima.com/XMLSchemas/fastRTPS_Profiles">
    <profiles>
        <transport_descriptors>
            <!-- Set larger segment size to reduce fragmentation -->
            <transport_descriptor>
                <transport_id>shm_transport</transport_id>
                <type>SHM</type>
                <segment_size>4198400</segment_size>
            </transport_descriptor>
        </transport_descriptors>

        <!-- Link the Transport Layer to the Participant -->
        <participant profile_name="SHMParticipant" is_default_profile="true">
            <rtps>
                <userTransports>
                    <transport_id>shm_transport</transport_id>
                </userTransports>
            </rtps>
        </participant>
    </profiles>
</dds>
6 Likes

Seems this is another WTF moment that comes with ROS 2 and its “goodies”. You can add it to Dds middleware complaint .

@darrenjkt thanks for sharing the experience and solution that you found.

there is one more thing you would want to try, that is Enable Zero Copy Data Sharing, Shared Memory Transport is in the transport layer using shared memory but data sharing.

btw, what distribution that you use? if that is humble, Fast-DDS Data Sharing needs to be enabled by configuration.

besides that, currently ROS 2 Loaned Message on subscription is disabled by default, if you want the true zero copy please see more details for Configure Zero Copy Loaned Messages — ROS 2 Documentation: Rolling documentation.

hope this helps a bit!

thanks again for sharing and openess.
Tomoya

Thanks for your suggestion @tomoyafujita

We are using FastDDS. The camera publishing container is actually running on jazzy while the rest of our nodes on the other container is running iron. Having different versions of ROS is not ideal but seems to be doing ok for us now.

I have looked into the zero copy data sharing using loaned messages. However, from what I can see (and happy to be corrected), there is no way to implement loaned messages using the xml profile. The api also only seems to be in C++ whereas our image-related nodes are in python (and we don’t wish to rewrite in C++ at the moment).

Seeing as we’re using the images solely for ML perception, we might eventually move to Deepstream which also has zero memory copy.

The camera publishing container is actually running on jazzy while the rest of our nodes on the other container is running iron.

i see that is why you bind IPC namespace to the container from host system.

there is no way to implement loaned messages using the xml profile.

No, LoanedMessage feature (class) is not related to Fast-DDS, that is the API provided by ROS 2.

since you are using iron or later, Fast-DDS data sharing is enabled default. (again this is not ROS 2 but Fast-DDS implementation)

The api also only seems to be in C++ whereas our image-related nodes are in python (and we don’t wish to rewrite in C++ at the moment).

and you are right. LoanedMessage is C++ class, so application needs to be written in C++.

tbh I find it a bit unfair to call this a WTF.

It’s documented (here) and is just a property of shared memory in general: segments backing it have a size, as all files do.

I’m not saying it’s immediately obvious this could cause the behaviour @darrenjkt reports – although the documentation I linked mentions it – but this is a bit like complaining about Nagle’s algorithm in TCP/IP and its effect on message delivery in TCPROS.

1 Like

What about a runtime warning for each topic whose messages get segmented? That would direct the user exactly where he needs to look…

Technically it would probably be possible.

Would you however also want one for all other transports which may need to rely on fragmentation to get large(r) message sent across a transport with a smaller maximum message/frame size?

Not sure how generalizable my view is. However, I understand that all network-utilizing transports have to fragment the packets in the end of the day to chunks between 1518 and 9000 B based on Jumbo frames availability. That’s a hard constraint. However, with a local-memory transport, I’d expect it just “bites” as large chunks as needed for each message as there is no such natural chunk size limitation. If the limitation is caused by limitation of the underlying technology (segmenting memory to fixed-sized chunks to have easier memory management?), I guess this could be communicated to the user…