[preprint] Message Flow Analysis with Complex Causal Links for Distributed ROS 2 Systems

I’m happy to share this new paper:

It is a significantly improved ROS 2 version of the prototype I made for ROS 1 almost 3 years ago, now using ros2_tracing (see also the ros2_tracing paper).

The method can extract and visualize the path of a message across a ROS 2 system. It works for distributed systems (any number of hosts!), and also supports user-level links between input and output messages. There’s also a visualization of the state of the executor instances over time.

See the example below based on the Autoware reference system split over two hosts.

Finally, I’m happy to say that the ros2_tracing paper was accepted for publication in IEEE Robotics and Automation Letters!

23 Likes

@christophebedard happy to review a PR to add this as a report to the reference_system :wink:

3 Likes

@christophebedard Thanks for sharing this very interesting PAPER! I’ll try it.

So far, there is a way to analyze from jupyter with analysis. Will you be developing trace_analysis as well?

tracetools_analysis: ros-tracing / tracetools_analysis · GitLab

This is a bit off-topic, but ISP and TIER IV are also working on a similar tool for tracing Autoware.universe. CARET derived ideas from ros2_tracing and RAPLET.

RAPLET: Demystifying publish/subscribe latency for ros applications.

First of all, let me share about this one, just a brief announcement.

CARET

Features and differences to ros2_tracing are listed as below.

Features:

  • Low overhead with LTTng-based tracepoints for sampling events in ROS/DDS layer
  • Flexible tracepoints added by function hooking with LD_PRELOAD
  • Python-based API for flexible data analysis and visualization
  • Application-layer events tracing by cooperation with TILDE, runtime message tracer

Differences:

  • CARET-dedicated tracepoints are added by function hooking with LD_PRELOAD
    • CARET also utilizes existing tracepoints for ros2_tracing
    • Implementation with C++ template prevented us from applying LD_PRELOAD, so that we added a few tracepoints to rclcpp directly
  • Our target is ROS2/DDS layer mainly, but OS events like sched:wakeup is out of scope
  • CARET will trace /tf topic after v0.3.x release, and this function is under test now
  • We observe and visualize data via Jupyter notebook with using Python-based API served by CARET
  • To tackle difficulty to calculate latency of a node who has complicated dependency between inputs and outputs on a node, we use wrappers for publishers and subscriptions to annotate each message
    • CARET will cooperate with another tool, TILDE, a framework which detects deadline overrun. This is under development.
  • Only single host application is supported, but CARET cannot be applied an application who runs on multiple machine via network

CARET separates path selection and latency calculation.

  1. Select the path to be evaluated from the node graph

A single path is represented as a nodes-chain.

  1. draw message flows for the path

Each line corresponds to one message and represents latency.

Result

As reference, we will also share an example of measurement with Autoware.universe.

The following is an example of Autoware measurement.

Our team has been so eager to trace Autoware.universe with CARET that we were late in introducing this tool to the ROS community.

First of all, since CARET’s goal is very similar to that of ros2_tracing, I’m willing to share my experience and learning to apply CARET to a large software, Autoware.universe, if you are interested in.

6 Likes

@hsgwa I knew about RAPLET (and we included it in both the ros2_tracing paper and the new message flow paper), but CARET looks very nice! You should definitely announce these kinds of tools in their own posts as soon as they’re ready!

Being able to select a specific path in the DAG is a good feature. I’m looking forward to learning more about TILDE, whenever it is released :smiley: I’m also interested in seeing how you support /tf and other pub/sub extensions (e.g., message_filters, image_transport). As mentioned in our paper, since we want our method to work on existing systems out of the box, detecting links between input and output messages without pub/sub wrappers is quite a challenge.

This is a fair question. While our overall goals are similar in general (i.e., tracking messages and getting various timing-related data), as you pointed out, the exact features are different. Besides supporting distributed systems, our goal with this paper is really to be able to extract and display execution data that we can compare with data from other sources (e.g., Linux kernel and other application-level data) to get the full picture in order to analyze the performance of the whole system. This is why our implementation uses Trace Compass. Therefore we do not plan on porting this method over to tracetools_analysis.

I would definitely be interested in reading a full standalone post about this!

3 Likes

@christophebedard Thanks for the comment!
we’ll post a new article as soon as it is ready.

detecting links between input and output messages without pub/sub wrappers is quite a challenge.

We’re not completely ready to deal with them yet, but develop version can measure /tf with just a hook using LD_PRELOAD.
However, intra-node-sub-pub links are difficult in arbitrary node implementation, and we are thinking of using wrappers; TILDE is pub-sub wrapper .
We’ll also announce TILDE when it is ready :wink:

3 Likes

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.