How do you mitigate bugs and issues related to system resources?
I’ve seen some CRAZY things happen when resources get overloaded.
I wrote a blog illustrating some of the crazy things I’ve seen across the after talking with hundreds of robotics teams, a pattern started emerging where resource monitoring is very hard with ROS and robots in general.
We (at Freedom Robotics) built a tool - The Robotics Resource Monitor - to log and monitor system resources on any robot (GPU, network connectivity, topic data, webRTC diagnostics) and correlate this information with ROS-specific topics and processes. I would love to get feedback on how I can make this better
It installs in one line and data is logged historically so you can go back in time… and we’re giving it away for free for a year here.
Here is network bandwidth broken down by topic it shows a /tf topic with message size and update rate graphed over time - i’ve used this to correlate robot specific issues to compute overload or internet dropouts
So honestly, I didn’t give a hoot about CPU, GPU, Memory or Network utilization for the longest time. It’s just such a detail it had seemed but I’ve changed my mind completely on this.
I broke a lot of robots by accident…
Here is what happened. Full usage of the GPU, combined with poor cooling, increased the temperature of the unit and the CPU switched to a lower-compute mode. Perception algorithms weren’t able to keep up with the frame rate, so frame callbacks start to stack up with unstable consequences. Finally motor updates slowed down and became unstable, which created a resonance in the robot. All of this was identified and tracked by viewing historical system resources.
These resources are the lifeblood of a robot - when they run out bananas🍌stuff happens - am I crazy? Does anyone else feel this way?
Here is an example of compute and memory broken down by nodes. This one has a pretty obvious memory leak.