ROS graph information tools implementation discussion

The stretch goals for Beta 1 (due in mid-December) include rostopic. rostopic is a faily vital tool when developing with ROS, and is probably one of the most-used tools in the ROS-based robot developer’s toolkit. However, there are some challenges in developing rostopic for ROS 2, particularly the rostopic list command.

In ROS 1, rostopic can easily go off and get the list of known topics from the master, because the master Knows All. This approach is not so straight forward in ROS 2. The use of distributed discovery in DDS means that, by default, no one knows for certain the entire state of the DDS network. Furthermore, when a new participant starts up, it has absolutely no knowledge of the state of the network (beyond what may be hard-coded in) and has to wait for the discovery process to being producing results. This process may be very quick for things on the local machine, but for remote computers it is likely to take a significant time.

If we implement rostopic list by making it start up, wait for a given length of time, then print the result, there is a good chance that it will not give a complete picture of the ROS graph and so will not be a useful tool.

The purpose of this topic is to decide how to implement rostopic such that it provides the same rapid response as ROS 1, while dealing with the distributed nature of the graph in ROS2, and with as complete information as possible given the limitations of distributed discovery.

To get right into it, my proposal is:

  • A daemon is started by something, e.g. roslaunch, or on system start up, or when starting the terminal and sourcing setup.bash, or some kind of start_all_the_ros_stuff_ill_need command.
  • This daemon’s job is to listen for changes in the graph. It records all changes and stores the current state of the graph. The recent work on listening for graph changes by @wjwwood would be used.
  • The daemon provides a ROS topic on a well-known topic, e.g. /ros/topics (based on the idea that the /ros namespace is reserved for infrastructure topics). Connecting to this topic will provide the current state of the graph.
  • This topic probably needs to be a service to avoid wasting resources broadcasting something only needed intermittently.
  • If it is a service, there is still value in having a topic that broadcasts on change for long-running tools to listen to.
  • It may be worth considering having a separate domain for the infrastructure topics.
  • Each computer being used should run a copy of the daemon so that connecting to it does not involve traversing the network and so is fast.
  • The rostopic list command, upon starting, connects to the daemon’s topic, gets the current graph info, and prints it out.
  • If rostopic cannot find the expected topic, then it prints a warning, then waits a configurable length of time (with a suitable default), running the DDS discovery process and listening for graph information. After the time limit it prints the result.
  • If you see that warning, you should understand that something is broken in your ROS system.
  • This is a downside of the daemon approach: It is a single point of failure for the rostopic functionality.
  • The daemon can also provide the same information over any other useful protocols. One particularly useful one would be a REST service, so that tools could connect and get the information without needing to use DDS or potentially be a part of the ROS graph. For example, a web-based tool.

The daemon mentioned above could be easily extended to also provide information for rosnode, rosservice, and similar commands.

If we use the above approach, I would like to fix the ROS topics provided by the deamon as well as any other protocol interfaces/ports, and data formats (e.g. YAML) so that they are useful by other tools and tool implementors, such as rviz and rqt_graph.

There has also been work done on rostopic and friends by the fine folks at Erle Robotics:

https://github.com/ros2/ros2/issues/197

Their implementation I think uses a listener waiting for information to pour in on a specific topic. However rather than trying to summarise their work myself, it would be great if @vmayoral could drop by and give us a description himself.

2 Likes

Hey @gbiggs,

Totally agree with your reasoning above. Good hearing that there’s someone else interested on this :).

The issue you pointed out above summarizes our work pretty nicely I believe. We had an initial (more complex) implementation for OpenSplice and then switched to FastRTPS. As for now, we’ve got simple rostopic list and rosnode list functionalities published and available. Changes are needed in rclpy, rcl, rmw, rmw_<dds_implementation>.

Motivated by your comments I just went ahead and submitted a set of pull requests to integrate the changes needed upstream. Here they are:

@marguedas, i think you’ll be interested in this.
Cheers!

1 Like

A dedicated information service seems to be a good approach, and I’m not overly worried about introducing a point of failure there.

However, when googling for this, some patents pooped up! Checkout https://www.google.ch/patents/US20150055509 and https://www.google.ch/patents/US20110258313 They are not exact matches, but particularly “network assisted peer discovery” is pretty close.

btw, from the first patent, you can find a number of other patents related to DDS. I know this is tangential to your question, but it has me a bit worried here.

1 Like

Could you give a brief overview of your approach, for a basis of discussion? How much does it differ from my proposal above?

@iluetkeb That’s a little disturbing. There’s probably prior art (I basically described DNS), but it could put off companies.

@gbiggs, the current proposal from our side (shared in the PRs above) uses the FastRTPS primitives to inspect the existing DDS participants/topics and report those using the ROS tooling. This simple approach did the job for us but we understand that a more generic and DDS-vendor agnostic layer might be put in place at some point.

Hope that helped clarifying our approach.

I am very interested in this discussion, as we at ASI have been playing around with ROS2 a lot. We had implemented our own version of rostopic list by using Node::get_topic_names_and_types().

We haven’t had the time to really dig into the rcl and rmw layers to understand fully what’s going on, but using this call has worked out fairly well. I will attest to @gbiggs statement that using this call comes at the cost of time. We have to let the tool spin for many seconds (currently around 15 seconds) just to be sure we’ve collected info from everyone in the system. Pretty inconvenient from a timing perspective, but still better than nothing!

Also, I’ve noticed a new flaw in this approach which I think may be related to my previous discussion here. When we set the Participant Index for OpenSplice to ‘none’ I think it’s causing the get_topic_names_and_types() to come up empty-handed. Or at least, it’s coming up empty in a sporadic way. I still have to investigate that some more.

As an aside, we’ve also got a rostopic echo working on most basic messages. It’s a simple python hack-fest where we have a python script that does:

  • takes commandline input for topic name, message type and qos
  • does some smart file searching for .msg files
  • does some string-replacing on a listener-template.py file
  • executes that newly generated listener.py.

This has helped tremendously, even as hacky as it is.

I could post the code if anyone thinks it’ll be useful.

I think the idea of a daemon process to optionally provide the requested information faster is a good approach :thumbsup:

I would like to comment on several aspects mentioned in the thread:

  • Imo all of the following cases should work:

    • The daemon got started before (by launch, the system, wherever, this should be an implementation detail).
    • The daemon gets started on demand by the first invocation of a command line tool needing it (slower on first call).
    • No daemon is there but the tool should still provide reasonable results (trade off between wait time and completeness).
  • The command line tools are only one use case of that interface. It should be possible to write code on-top of it which accesses the information without caring about the internal optimization of having a daemon. So all the functionality should be exposed in an API which is then also being used to implement the cli.

  • I am not convinced that the daemon should have a ROS-based interface.In order to call its service the client would still need to wait for some discovery phase. Even if that is local it will take additional time which we want to avoid.
    Therefore I think the daemon should provide its interface using a different protocol. That protocol needs to work without any “discover” and be available across all targeted platforms.

  • With the choice of the protocol comes also the decision how that daemon is being identified. It could be one single daemon per ROS graph or it could be one daemon per host to make the queries “local”. If the daemon would e.g. expose its interface via d-bus you would probably run it on each host. Someone mentioned a REST service it could be either a single global one or a local one per host. Anyway we need to determine how the daemon is being referred to in the context of running to separate ROS systems on a single machine.

  • Any of these tools and interfaces should be independent of any specific rmw implementation. If the rmw interface is being implemented by the newest and hotest discovery / marshalling / transport solution the tools should continue to work. The rmw interface might need to be extended to provide all the necessary information.

Non-Roboticist outsider looking in.

Service Discovery in a distributed environment while not necessarily a solved problem has many implementations currently in use in production in a variety of companies and deployments.

Of the top of my head I’m thinking about EtcD and Consul.

Has anything like that been considered?

Glad to see that many people manifested interest in this discussion.
As expected these tools are needed by everybody and it resulted on a lot of prototype solutions being created by each and everyone of us. Hence the need for this discussion to provide a global rmw agnostic solution for these tools.

Thanks @gbiggs for this great summary.

A couple notes / open questions on top of what has been stated:

  • Agreed on @dirk-thomas comment that the daemon should be seen as a way to provide information faster rather than as a single-point of failure. The feature should hence work without it (at a performance cost)

  • :+1: on the fact that the ROS side of things should be rmw_implementation agnostic and allow users to develop their own tools or daemons using the protocol of their choice.

  • Protocol-wise I’d be curious to know what the preference of all of you is. It seems to me that REST services became very popular now that most people provide web interfaces to their systems but would it be your preferred choice ?

  • In the case of multiple juxtaposed ROS systems, it comes down to how do we identify “different” ROS systems. An approach could be to have the daemon aggregates all graph information for a given DDS Domain. And thus create a deamon for each DDS Domain used. What other approach for differentiating ROS systems could be used ?

  • Another point being addressed is the tradeoff between duplicating daemons (and information storage) on each host vs having a single global daemon and service. If the default protocol allows it, this could be left up to the end user by providing an option to specify when and how to launch the daemon. And the way to query daemon information should be adapted accordingly. We will still need to specify the default behavior for this.

Thanks for enumerating those. I agree whole-heartedly with all of them. (Especially the last! I implement my tools like that and it’s been very useful.)

I strongly support having a non-ROS based interface. You’ve listed the advantages already. However perhaps it could be useful to have a ROS-based interface in case someone wants to use the functionality from a node? Perhaps this should be a lower-priority feature unless we find an actual use case for it.

Yes, there are several ways that this could go. I think that if we use an environment variable that we can build an automatic mechanism without too much trouble. For example, the rostopic tool/library could look for an environment variable defining the graph information daemon’s address, and if it isn’t defined attempt to contact one on the local machine. If that doesn’t work then it could request the daemon be launched on the local machine.

Absolutely agree. This is another benefit of hiding it behind a well-known service description, I think: It becomes easier to abstract away the implementation. Then we can look into using existing technologies such as those @Barrett_Strausser mentioned and see if they do what we need already.

I’m in favour of a REST service as the initial goal. I’ve also been told by someone locally who does a lot of tool implementations, using the browser as an interface, that he wants “everything” to be available over REST. I’m assuming he’s talking about introspection facilities, not absolutely everything.

However, I think the information should be available on other protocols from the same daemon if someone wants it that way. If we specify the data structure and the protocol used to access it separately then it should be relatively easy to add additional access methods.

Being able to configure which domains the daemon aggregates data for, and run different daemons for different domains, sounds like a sensible use case. Especially with the potential for using domains for security zone partitioning.

I would check whether requiring a non-ROS (which, in this case I take to mean a non-DDS) interface is strictly necessary. Maybe we can avoid the discovery phase by other means, e.g. by pre-supplying the necessary information, assuming that the daemon runs locally (which other means would also have to assume).

My reason for that is to avoid duplicating communication code and functionality. While at first it might appear simple to open a TCP connection and exchange some data, you have to handle edge cases, like when the port is blocked by something else, or supporting in-process communication, and so on. I think the reliance on XML-RPC in ROS1 was a cause of complexity, and one of the advantages of using DDS is that we can avoid that.

It would still be possible, of course, to offer a non-ROS interface in addition, e.g. for use by web-based services. This would not necessarily have to run on every machine, however, but only when necessary.

@dirk-thomas

Demand starting is a tricky business. I would prefer that, in the first instance, the launch procedure starts the daemon, the client implementations falls back to DDS discovery when the daemon is not present. Also, I would prefer that when demand-starting is implemented, we do this through a minimal daemon process which loads the “real” process upon first connection. This again keeps demand starting functionality out of the core libraries.

Regarding independence of the specific rmw implementation: The interface should be independent, but the internal implementation could be vendor-specific. Several DDS vendors offer this functionality already (including FastRTPS, as far as I can tell), and this would be very good to re-use. I would be worried that we’re inefficient otherwise, when we bypass the vendor.

So we’ve got a vote here for the ROS interface being the only interface.

I think it is reasonable to have the same information available over different interfaces. Certain tools (e.g. web-based tools) would benefit from not needing to use the DDS stack.

Which interfaces are required, and which interface is considered the “default” interface by tools, are the questions that we need to settle.

How fast a tool using a ROS topic can start up and get the information it wants is something I hope to test today.

I think this overstates the difficulty. If we implement using Python, as an example, there are good libraries available for providing web services, such as Django, CherryPy and several lower-level-but-still-abstracted libraries included in Python 3.

The counter-point there is that, like XML-RPC in ROS1, we would be increasing the number of dependencies if we use something like Django or CherryPi.

@gbiggs

I think this overstates the difficulty.

I meant not just implementation difficulty, but also use-time complexity. While not entirely comparable, consider in ROS1, the use of XML-RPC was the root cause of all the mess with ROS_NAME/ROS_IP, and the various DNS resolution issues that led to. At least in my experience, lots of people struggled with that.

You’re taking in an entirely new communications stack, with all the side effects that has.

Sure, you can try and avoid that, for this special case, but that’s just why I’m suggesting that we investigate whether it’s really necessary first, instead of barging headlong into an implementation.

You are right about use-time difficulty; I hadn’t thought of that. The less that can go wrong, the better, from that perspective.

I agree that investigation is necessary. I originally envisioned the ROS tools like rostopic using the DDS stack to get the information, with the REST interface being a convenience thing for makers of tools that benefit from not using the whole ROS stack. @dirk-thomas would prefer it be the other way around, I think.

Anyway, I spent some time today hacking a wholely-unscientific benchmark into the add_two_ints_client example. Here are my results, based on 100 runs on a 2015 macbook pro (because for some reason alpha8 just point blank doesn’t run on my Ubuntu desktop and I haven’t had time to figure out why yet). Times are in seconds.

                         minimum   first quartile median    mean      third quartile  maximum
rclcpp::init             0.000041  0.000049       0.000051  0.000052  0.000053        0.000071
rclcpp:node::make_shared 0.005342  0.005714       0.005867  0.005948  0.006078        0.007536
node->create_client      0.005635  0.005979       0.006139  0.006234  0.006385        0.008588
make_shared request      0.005643  0.005987       0.006147  0.006243  0.006394        0.008600
wait_for_service         0.008758  0.009588       0.010619  0.030617  0.011594        1.011910
async_send_request       0.008900  0.009737       0.010776  0.030780  0.011735        1.012290
receive result           0.224500  0.232855       0.235795  0.251124  0.238558        1.015980
rcpcpp::shutdown         0.224662  0.233006       0.235900  0.251237  0.238677        1.016050
1 Like

I would like to second the argument of @iluetkeb about use-time complexity of multiple communication mechanisms, from a different point of view: I have a student working on developer understanding of ROS and the current variety of communication mechanisms, and the consensus about the persons being interviewed is that there are too many of them.

I am all in favor of keeping it simple, and implementation-wise, the original proposal using a daemon and the existing ROS communication mechanism looks very attractive to me. The daemon always runs and has the information available as a ROS service call, plus uses a topic to broadcast changes to the graph such that long-running tools are informed of this.

If later on it turns out that more is needed (e.g. REST API), this can simply be implemented as extra functionality on top of this layer.

I just want to reiterate: the reason why the current command line tools are slow is because they have to wait for the discovery phase to finish before they can query the desired information. The idea of the daemon is that it is already running before and it has already accumulated the information.

If the command line tool wants to use the ROS interface to request information from the daemon it again needs to wait for the discovery phase to finish before it can do so. It might only need to wait for the availability of that daemon but still this implies a significant overhead in waiting time for the user. I don’t think a command line tool (which e.g. is often also used for completion) can have that time penalty. While I am certainly not a big fan of having a different transport mechanism I don’t see how the requirement of the lowest latency possible can be fulfilled with using a distributed peer-to-peer system like DDS.

I guess @iluetkeb point was to use DDS communication, but find a way to avoid the discovery phase in the specific case of connecting to the daemon. It seems to me that letting a tool like rostopic know how to directly connect to the daemon with DDS is not a much different problem from letting the tool know how to connect to the daemon through some other protocol.

2 Likes

Thanks to @NikolausDemmel for making my point better than I did :wink: Yes, that is exactly what I was tryint to suggest.

if the command line tool wants to use the ROS interface to request
information from the daemon it again needs to wait for the discovery
phase to finish before it can do so.

Not necessarily. See http://eprosima-fast-rtps.readthedocs.io/en/latest/advanced.html#matching-endpoints-the-manual-way, for example.