Posted by @mxgrey:
The memory chart in the first post is run with modification of the following code to always return no conflict here as shown below, will negotiation be even triggered?
With that modification you’re right, there should be no negotiations taking place, so there shouldn’t be any memory taken up by the negotiation system.
Traffic Schedule Memory
The most likely reason that the traffic schedule node flattens out is because every minute we cull any schedule data older than 2 hours in the past, so it shouldn’t be able to grow unbounded unless there’s a bug in the culling mechanism.
While auditing the implementation of the schedule database, I realized that we’re holding onto traffic history much longer than what’s really necessary. Essentially every change, including each delay signal, for every robot gets tracked until it is more than 2 hours old. This is done with the intention of making network traffic lighter so mirrors can be updated with diffs of the schedule instead of republishing the whole schedule repeatedly. However the memory cost of this is obviously not scaling well.
I’ve added some commits to Reduce memory footprint of database by mxgrey · Pull Request #116 · open-rmf/rmf_traffic · GitHub that should drastically reduce how much history is being retained in between the culls. I expect with these changes the memory should stay within an MB scale and not reach GB, although I haven’t tested with the scale of robots that you’re using. If you can try it out and let me know how it goes, I would appreciate it.
Fleet Adapter Memory
In the fleet adapter, the schedule mirror shouldn’t be growing at all. It should reach a steady state very quickly and have much less memory usage than the traffic schedule node because it doesn’t attempt to retain any history. Whatever is happening in the fleet adapter, I expect it’s unrelated to what has been happening in the traffic schedule. I’ve added memory profile logging for the mirrors in this PR but I expect that to just tell us the mirrors aren’t growing.
So here are my best guesses for what the cause might be:
1. Task Log Growth
Everything inside rmf_fleet_adapter
besides the traffic mirror and negotiations uses RAII, so every time a new task is started for a robot, the memory of its previous task should be cleared out. However we are aware of an issue where task logs can grow very large for long-running tasks. This can even happen while a robot is idle charging, since charging is still considered a task and may produce logs.
If you’re willing to test this out, you could apply this patch to rmf_task
, which will block the task logging mechanism entirely. Virtually all task logging will still get sent to the ROS logger, so you’ll still be able to find the information in the stdout
of the fleet adapter, it just won’t go to the operator dashboard. If you find that this is the cause of the memory growth then we can discuss options for addressing this.
2. Python memory leaks
Another possibility to consider is whether there are memory leaks happening in your integration code. This is especially risky if you are integrating with Python since most likely you wouldn’t be using weakref to manage circular dependencies. I don’t have personal experience with debugging memory leaks in Python, but there seem to be some options available.
3. Something else??
Unfortunately memory profiling is very difficult to do in general. The only way I know how to do it with C/C++ code is using valgrind, but valgrind adds so much overhead that the fleet adapter can’t actually function properly when it’s run through valgrind. If neither of the above theories pan out then we may have to start considering more obscure corners like maybe the websocket server or the quality of service settings for the ROS primitives.