Scalability issues with large number of nodes

My team and I are developing a mobile platform for industrial tasks (such as rivet fastening or drilling), fully based in ROS2 Stack (Humble).

The stack comprises a bunch of nodes for different tasks (slam, motion planning, fiducial registration…) and are coordinated through a state machine node (based on smach).

The issue we are facing is that the state machine node (which is connected to most of the nodes in the stack) gets slower and slower until it stops receiving events from other nodes.

We’ve been debbuging this issue and our feeling is that the number of objects (nodes/clients/subscribers…) is too high and whole stack suffers a lot of overhead, being this most noticeable in the “biggest” node (the state machine).

Our stack has 80 nodes, and a total of 1505 objects

  • Stack clients: 198
  • Stack services: 636
  • Stack publishers: 236
  • Stack subscribers: 173

My questions are:

  • Is this number of nodes too high for an industrial robotics project? How large are usually projects using ROS2?
  • Which is the maximum number of objects in the stack? Is this a rmw limitation or ROS2 itself?
4 Likes

Hi,

this is an interesting question (sorry, I don’t have the answer) since a similar architecture is common in BehaviorTree.CPP too.

The first things that come to my mind are:

  • A deep analysis of the dataflow (number of messages) using some tracing library, maybe this is a good option: GitHub - ros2/ros2_tracing: Tracing tools for ROS 2. that should work out of the box. Alternatively, you may consider instrumenting your code and use some other logging mechanism.

  • The number of stack services seems very high. I wonder if you should consider a different architecture, to reduce that number, or use publish-subscribe instead.

In general, before starting any optimization, it is alway recommendable to pinpoint the actual bottleneck.

6 Likes

Davide as always hit the nail on the head, profiling before optimization always.

A couple options we have taken where necessary on hot message paths is to look at node composition (as it is usually a low overhead to refactor) to reduce the interprocess comms depending on which DDS you are using. Which as a reminder, some DDS vendors use synchronous publishing which can be a major bottleneck too. Large message serialization/deserialization was another common sticking point for us.

1 Like

Time to reintroduce roscore… Ehm, pardon me, DDS Discovery server :slight_smile: See e.g. New Discovery Server .

6 Likes

Always relevant…

17 Likes

@matthews-jca 100% agree.

Try different DDS vendors and maybe enable shared memory transport.

Using components is a great idea

1 Like

I am currently using ROS1-noetic, but we also use smach for controlling our state machine. Something that jumped out to me is that you mention your state machine being connected to most other nodes, which seems odd. Does your state machine also do significant work? I ask because we use a design where smach does very little work other than triggering other nodes. Within our smach concurrences, we have about 5-6 sub-states, but most are simply listening for transition messages.

One issue you may be running into is that smach only creates the active state(s), and then destroys them when they become inactive. If you transition between states quickly, then you may be seeing a very high load on message discovery as states (and therefore the publishers, subscribers, and service handlers within them) get created and destroyed. Smach is also inherently single-threaded, and in my experience is not an especially performant library. It is useful for relatively simple state machines.

1 Like

Hey, I’m a coworker with @leander2189, and I have a few comments to clarify our scope.

Connected to most other nodes, which seems odd.

Our state machine triggers the task to be performed, so it could be a client of any service/action in the stack. That’s why it is “connected.”

Does your state machine also do significant work?

Nothing apart from handling the process logic and triggering other nodes.

Within our smach concurrences, we have about 5-6 sub-states.

We may have around 1000 states in many different containers.

One issue you may be running into is that smach only creates the active state(s) and then destroys them when they become inactive.

Our ROS2 smach keeps all the objects in memory (300-400MB). It is the main scalability issue and causes the system to slow down. Recently, we create/delete the objects only when the state is active.

My main concern is that at some point our state machine calls a service, and the response takes an eternity to arrive (>60s), or in some cases, it does not arrive (something similar to a deadlock). We have traced that the response is quickly returned by the service server, but the client is not notified in time. Reducing the number of clients managed by the state machine causes the system to run normally (client gets the response immediately). Other nodes that are not that loaded run normally without any issue.

It is useful for relatively simple state machines.

Our state machine is everything but simple. It is a very complex manufacturing process with many conditions, behaviors, and recoveries.

Just to mention, we are using CycloneDDS and it has consistently demonstrated higher reliability compared to FastDDS.

2 Likes

Most of our nodes are in python, i’m not sure if components can be written in python or it just a rclcpp feature.

In addiction, there are many other posts that people complain about the performance of nodes written in python.

3 Likes

I am not sure if I understand this correctly (sorry if I miss the point), but you must never destroy your service clients at run-time.

Find a way to create them at launch time and keep them in memory somewhere.

1 Like

Could it be related to the QOS depth? I ran into a similar issue which was caused by too many nodes accessing the same service or topic concurrently, which led to the queue being full.

1 Like

Hm, our stack has

  • 61 nodes,
  • 129 Topics
  • 627 Services on the Nodes

and it runs fine on a Intel i7 10700TE idle load is around 4-8% on all cores, work load around 30%.
Most of our nodes are written in C++ though, as the python nodes just eat up your CPU.

Can you give a reference of the CPU you are using, and the system load you are experiencing while the system is Idle ?

2 Likes

Stack clients: 198
Stack services: 636

out of curiosity, do you have service servers more than clients? or your application creates the endpoint client object at runtime when it comes to necessary, and destroys every time? means service clients scale up to thousands?

Can you give a reference of the CPU you are using, and the system load you are experiencing while the system is Idle ?

+1 for this.

i would like to know this kind of configuration and environment information too if can be disclosed.

besides this question,

  • what kind of RMW implementation do you use?
  • all nodes in the same local host system? or communicating over the network?
  • this problem only can be observed on state machine node, but any other nodes?
  • do you enable ROS 2 security enclaves?

i would recommend that you probably want to create the issue for appropriate rmw implementation repository to track the issue. (i am not saying that the problem is in the rmw implementation yet, but that could be the 1st place to post.)

thanks,
Tomoya

Have you tried switching DDS vendors ?
We had similar issues with Humble that disappeared by switching to Cyclone DDS + localhost only (with multicast enable on lo interface)

Thank you all for your answers,

We’ve tried with both FastDDS and Cyclone. I’ll look into shared memory transport.

I am attaching 2 images with an i9-12900H with 32GB RAM

Idle

Running the process

Everything is running in the same local system.

No that I am aware of

The problem is observed on state machine node. We think this is just because that node is orchestrating everything and is the most noticeable.

A quick solution could be composable nodes.

Nav2 evaluated the effects of composable nodes in Nav2 composition

Unfortunately, we have also observed poor performance of the default Python executor. In our case, it struggled with high-throughput topics, which would actually pin a CPU core to 100% or close to it. But I wouldn’t be suprised if it struggled to service large number of services / subscriptions since it rebuilds the list of callbacks etc… at each iteration.

Hm, this actually looks fine. Any processes at 100% CPU Time ? Judging from the load graph I would guess no.

Could you give further explanation? Doing that seems to help is our case.

Must is too strong here, should is the correct word.

Creating and destroying clients on the fly, instead of storing them in class members and reusing them, creates extra load in the Executor, as it needs to regenerate its internal state in case that ‘Entities’ (Subscriber, Publisher, Waitables, Services etc) are added or removed.

Hm, on the other hand having hundreds of Service clients around, that you never use also create static overhead in the executor as they need to be dumped into the rcl waitset on every wait. I think this is a tradeoff thing.