How do you mitigate robot bugs and issues related to system resources?

Hi All,

How do you mitigate bugs and issues related to system resources?

I’ve seen some CRAZY things happen when resources get overloaded.

I wrote a blog illustrating some of the crazy things I’ve seen across the after talking with hundreds of robotics teams, a pattern started emerging where resource monitoring is very hard with ROS and robots in general.

We (at Freedom Robotics) built a tool - The Robotics Resource Monitor - to log and monitor system resources on any robot (GPU, network connectivity, topic data, webRTC diagnostics) and correlate this information with ROS-specific topics and processes. I would love to get feedback on how I can make this better

It installs in one line and data is logged historically so you can go back in time… and we’re giving it away for free for a year here.


Here is network bandwidth broken down by topic it shows a /tf topic with message size and update rate graphed over time - i’ve used this to correlate robot specific issues to compute overload or internet dropouts

So honestly, I didn’t give a hoot about CPU, GPU, Memory or Network utilization for the longest time. It’s just such a detail it had seemed but I’ve changed my mind completely on this.

I broke a lot of robots by accident…

Here is what happened. Full usage of the GPU, combined with poor cooling, increased the temperature of the unit and the CPU switched to a lower-compute mode. Perception algorithms weren’t able to keep up with the frame rate, so frame callbacks start to stack up with unstable consequences. Finally motor updates slowed down and became unstable, which created a resonance in the robot. All of this was identified and tracked by viewing historical system resources.

These resources are the lifeblood of a robot - when they run out bananas🍌stuff happens - am I crazy? Does anyone else feel this way?


Here is an example of compute and memory broken down by nodes. This one has a pretty obvious memory leak.

16 Likes

I’ve been an engineer and come from a systems background - it’s interesting to me to see tools being built for this from a robotics perspective.

Cell data is our constant bane. It’s vital we keep bandwidth low to prevent racking up expensive bills. Have you thought about building this to support that?

1 Like

I 100% agree that resource monitoring is crucial.

This is not a robotics-specific problem and not surprisingly, the systems community has built a huge set of open source telemetry and monitoring tools, which are easy to integrate. So, by extension, I do not agree with the statement of the original poster that this is “very hard”, and do sense a bit of a product placement aspect to that statement, if I may be so bold. Of course it’s great that several companies start building robotics-specific tools that make this even easier, and if you’re running your robots at scale, these will probably save you money overall, but don’t let them tell you that this is too hard to do with open source tools on your own, at least for your robotic prototypes :wink:

I can say that one of the awesome open source tools we’re using is collectd. It is very light-weight, has a huge set of plugins for monitoring almost every conceivable aspect of your machine. It is also easy to write new plugins. collectd is well supported by a variety of backends, e.g. we’re using both influxdb and logstash, for example, which connect to Grafana and Kibana, respectively.

14 Likes

IIRC, Ian Sherman showed this in his presentation at ROSCon19 as well: Beyond Autonomy: ROS in solution architecture.

Seems only the recording is available (no slides): video (part about monitoring: link).

5 Likes

@Ingo_Lutkebohle - Thanks for the response. I love the work that you have done and also enjoyed your 2019 ROS Industrial talk on ROS2 Tracing. I’m pretty excited to see all the new ways we will be able to instrument ROS systems moving forward.

Much of our system is intended to be agnostic to how the data is acquired - we have our own agent which is what gathers resources and sends them for most people - but anyone can build a piece of code that can send data using our API - perhaps we could do this for collectd and use our tool as a visual frontend for it.

One thing I have found for myself at least is that I have historically placed far too little importance on monitoring resources for my robotics systems and in discussions with others I find they do too. That’s one of the main things I am bringing up here - not that it’s hard but that I don’t witness this kind of instrumentation being done in practice on machines

E.g. when I ask my robotics friends how they’re going about checking what their system resources looked like while they ran a demo, they’ll most likely type something analogous to htop >> cpu.txt in the terminal.

From what I see there are a large variety of roboticists out there and not many of them have the same awesome systems background that you have - and they might not necessarily be tracking and monitoring the things they should.

Have you seen anything similar? What are the top view things you always make sure you monitor on a new robotics project? I would love to continue to improve our tools for developers and large fleets alike.

2 Likes

I tend to agree that resource monitoring is an afterthought - at least at first - for new roboticists (raises his own hand :laughing:)

Sure, I’ve used the ELK stack + grafana for resource monitoring before, but it’s actually pretty nice to have a built-in tool for this, in a platform that already comes with a bunch of other telemetry + control stuff.

One question. I keep running into internet dropouts. I’m curious about how you handle this if your solution is cloud based - are you able to buffer resource data during these dropouts?

@sjhansen3 thanks for the kind words :slightly_smiling_face: My talks are motivated by making roboticists more aware of systems issues, as I share some of your observations.

Fortunately, outside of academia and very early-stage startups, I usually encounter mixed teams which include both software and systems engineers next to roboticists. In such teams, building bridges is important, and that’s what I’m trying to do here, by making people aware of the existing work, and by identifying gaps.

Particularly, things like CPU and memory usage – that’s really standard, but of course there are also things which are more robotics-specific that would be of interest. Your topic-statistics would be an example of something like, even though it’s still mostly middleware stuff.

Unfortunately, I cannot directly give you a list of things I’d like to see in robotics-specific tools, precisely because this is not yet generally available and thus considered valuable intellectual property by some people. But I think if you’ve seen my talks, particularly the older ones, they should give you some ideas :wink:

Check out atop. I was able to solve many problems regarding system resources with it. It’s very similar to top except that the log files can be replayed and viewed directly through the command line interface. atop is very easy to automate and can be triggered by external events. With a logging cycle of 1 minute most problems can be found but the used storage space can be quite large. It can also be quite helpful to find the reason for a system crash.

https://www.atoptool.nl/

I always wanted to write an interface for ROS to publish the data directly as a message, but never got around to it. I think it could be quite helpful for debugging purpose.

2 Likes

giving it away for free for a year here

link seems to be broken
https://bit.ly/2X7nwzB

Try https://app.freedomrobotics.ai/#/signup and use the current code “MONITOR” in all caps. See if that helps!