I am building a DDS node that will interact with a ROS2 node. I have found that the Fast-RTPS DDS middleware is using what I consider a very excessive amount of memory.
Running the HelloWorld example allocated 404,340 KB of memory. This seems extremely large for such a simple example.
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
tco 29836 0.0 0.1 404340 10320 pts/0 Sl 15:41 0:01 ./HelloWorldExample publisher
When I build out my actual application the VSZ gets as large as 3 Gigs. Each publisher and subscriber that my application creates adds Megs to the memory usage even if the actual data is only a single integer.
Is this level of memory usage expected or is it possible get Fast-RTPS running in under 40 megs with approximately 20 topics involved? My embedded target has 256 Megs of RAM total.
@NotMe that sounds like a bug, but Iâm unable to reproduce it, at least on my macOS machine (I havenât tried Linux yet).
If I run talker or listener I get a steady 4.7Mb of memory usage for both. If I modify it like this to create 40 of each publisher/subscriber (each on a different topic):
I still only about 8.4Mb of memory usage for each.
Maybe you could provide a similarly minimal example which has this problem? We havenât gotten to the point where we are profiling to look for unintended memory usage, so there might be a bug in there.
Also, Iâm using the master branch, which version of ROS2 are you using? That might affect your Fast-RTPS version and maybe the issue youâre seeing is already addressed on master.
Quick remark: I think we are talking about different examples here. It looks like the example you are referring too is a FastRTPS example, that uses different QoS settings (amongst other things) that the ROS2 talker / listener example that @wjwwood is referring too. @NotMe do you have the same behavior with the ROS2 talker / listener examples?
Before my first post I was running the HelloWorld example from the eProsima Fast-RTPS. Specifically it was this commit:
"
tco@osboxes:~/Fast-RTPS-git$ git show
commit 2c86bb2d42e414007c2962942e321e82ff43ae9a
Merge: 87881ec b285f9b
Author: JavierIH javierisabel@eprosima.com
Date: Thu Apr 20 08:08:08 2017 +0200
Merge branch 'master' of git.sambaserver.eprosima.com:rtps/rtps
Running the âtalkerâ and âlistenerâ examples in separate terminals I see that each has allocated over 540 megs according to the âVSZâ column in âps auxâ.
I attached two screenshots to help communicate exactly what I used for the test before my first post and the test that I just ran this morning.
I ran the âtalkerâ example from the âlastSuccessfulBuildâ link that you sent. The virtual memory usage is still in the hundreds of megabytes range. Screenshot is attached.
The âvirtual memory sizeâ (VSZ) listed in ps isnât what you think it is. That is the total size of the address space assigned to the executable, which includes code, data, memory-mapped files, shared libraries, swap usage, and pages that have been allocated but not actually used yet. Itâs not uncommon for that to be very large; Iâve got a virtual machine running right now whose VSZ is about 46 GBâŚ
You should actually pay attention to the resident set size (RSS), which is how much physical memory the task is actually using. In your case, 10,320 kB seems more reasonable.
Thank you for bringing this up as a point of clarification. I am aware of the difference between virtual memory size and resident memory size. I have attached a screenshot below showing how the virtual memory becomes resident when an application uses the memory and not immediately when malloc is called.
The issue in my case is that I do not have any way to provide swap space to the OS on my embedded target. Initially everything works functionally since the resident size of all my processes is less than the physical memory size (256 megs). After adding publishers and subscribers for all the necessary topics required the resident size exceeds the physical RAM size and Linux randomly kills a process as described in the following link:
The only way I can ensure that this will not happen in the released software is to enforce that the virtual memory size never exceeds the physical memory available.
My embedded target is running on an SD card and I do not think it is a good idea to add a swap partition due to the wear that occurs on this type of media over numerous write cycles. If I am wrong and it is acceptable to use an SD card partition as swap space then that is very good news for me. Do you know if this is acceptable?
I created this topic in an effort to confirm that the behavior I see on my system is expected and to ask if there is any way to configure Fast-RTPS to allocate significantly less memory. Can you please provide confirmation that the behavior I see is expected? Are you aware of any configuration to reduce the memory allocation?
For a point of reference I have an evaluation copy of a commercial DDS middleware and the VSZ for a simple âhello worldâ application is only 4 megabytes. This is a big difference from 300 megs.
Thank you again for taking the time to respond to my post.
Iâm not familiar with Fast-RTPS, my point is just that VSZ can be deceptive because there are many things that add into that which donât affect resident memory at all; If Fast-RTPS uses memory-mapped files or has large shared libraries, it could easily balloon far beyond the size of your physical memory without having any negative impact. If the actual RSS value is growing enough that processes are being killed, you might want to look in to using a memory profiling tool like Google perftools or valgrind massif on your program to see whatâs allocating so much.
As far as SD cards go, how much wear it can handle depends largely on the card. There are industrial SD cards that supposedly can handle hundreds of TB of writes before they fail, so they would last for years even if youâre writing hundreds of GB per day.
Thank you again. I will look into an industrial SD card.
Before posting here I spent several days reading the Fast-RTPS code and running valgrindâs Massif tool on the sample applications. I was not able to find anything that could be easily reduced or eliminated.
It seems to me that the code is memory intensive as it is implemented and intended to run on systems with several gigs of RAM and plenty of swap space.
I hoped someone here could tell me that I was wrong and that there was a network buffer size or something else that I could configure to be smaller.
Yeah, I think the VSZ is not out of the normal. By comparison, my dockerd and gnome-terminal instances in my Linux VM have over 600Mb each in VSZ. I have no doubt that this might be contributing to your OOM killer issues, but it doesnât seem to be not unique to talker (so itâs probably not a bug). In fact I think the vast majority of that VSZ is shared amongst all of them.
Not that ROS 1 is the standard-bearer, but out of curiosity I had a look at the talker from roscpp_tutorials:
% ps aux | grep talker
william 22046 0.7 0.2 397428 11896 pts/4 Sl+ 13:47 0:00 /opt/ros/kinetic/lib/roscpp_tutorials/talker
And it is quite similar with ~397 Mb of VSZ and ~12 Mb of RSS.
That being said, I think Fast-RTPS is mostly designed for âtypicalâ desktop machines with at least a few Gb of memory. There might be simple things we can do to reduce the memory pressure these programs put on the system, but after reading around on the internet, looking at pmap and a cursory look at the output of massif, I donât see any low hanging fruit.
I still havenât tried Connext or OpenSplice yet, but they might perform better in your usecase until we can improve Fast-RTPS for this case. Thatâs part of the idea behind the middleware abstraction is that we can use different implementations with different intrinsic qualities where applicable.
We can almost certainly fine tune the existing software to use less memory, but we just havenât gotten to that level refinement yet. If anyone has any recommendations (for the ROS 2 code or Fast-RTPS), weâre happy to try them out to improve this.
It seems I will need to continue down the path of a commercial middleware. I will let you know if I find something specific to improve or recommend in ROS 2 or Fast-RTPS.
The rmw_coredx integration layer is relatively new, so there hasnât been much optimization done yet; but in general CoreDX DDS does well in resource constrained environments.
Thanks @wjwwood for making me aware of this thread.
After checking what @NotMe has detected, i have been investigating why Fast RTPS requires so much virtual memory. Thanks to this link Iâve discovered each pthread thread is requiring 72MB. This occurs since a change in glibc-2.15.
I was checking a small program that uses pthread threads and the same program using std threads. A std thread only requires 8MB of virtual memory
Fast RTPS uses Asio library. In my system Asio library is using pthreads. I was able to configure Asio library to use std thread changing src/cpp/CMakeFiles.txt. I changed line 183 with this new line:
I think it would be useful to have a meeting with you to understand better your requirements and give you potential solutions. Could you please send me an email ( jaimemartin@eprosima.com ) with your contact details to organize a conference call ?