Consideration of performance measurement methods

hsgwa · September 28, 2021, 5:55am

I’m considering how to measure the performance of ROS2 applications.
I’m still in the middle of this project, but I would appreciate any comments.

Goal

measuring latency and identifying the cause.

Latency measurement

I’d like to measure

callback duration
communication latency
node-latency
end-to-end latency

in a real application.

For example, I’d like to create the following diagram.

node_latency

Visualizing Behavior

ROS2 includes communication and scheduling, and I think it may behave differently from the developer’s intention, resulting in poor performance.
For example, communication QoS, callback group, callback priority.

So, I’d like to create the following diagram.

In terms of content, the following article does something similar.

WIP

I think the following trace points are mainly necessary for measurement.

Since I’m still experimenting, I’m mainly using LD_PRELOAD hooks to add trace points.
However, ROS2 has some header implementations that cannot be hooked.
To add trace points to headers, I also add trace points on the forked rclcpp and ros2_tracing.

In the future, I’m considering proposing new trace points to ROS if the measurements seem useful.

problem

I’m currently trying to use FastDDS and CycloneDDS for measurements, but on_data_available is not working as expected.
on_data_available is not used in rmw_cyclonedds, and there may be other better places to insert tracepoints.
I think it’s necessary if we want to evaluate pure (not dependent on scheduling communication latency though.
I don’t know much about the DDS and rmw, so I would be glad if you could comment on this as well.

tomoyafujita · September 28, 2021, 10:44pm

@hsgwa thanks for sharing your thought. this looks nice

I got a question about message flow.

Can we trace the specific message flow? for me, this looks more like statistics from each independent endpoint?
I was thinking about this kind of traceability before, i think for doing that we need to integrate unique identification in the message to trace. because we have no idea how many frames delayed in application or data process rate/frequency might be different from each application.

anyway, it would be really nice to trace the message flow

hsgwa · September 29, 2021, 5:28am

@tomoyafujita

Thanks for the comment!

I thought that latency statistics for each component are also needed to understand the overall performance, I included them as an example.
Measurements are tracked for each message and can be used for more detailed analysis.

Specifically, the following steps are used to track message flow.

Communication

In order to link publish to subscription, I use the information of source_timestamp that dds sends with the message as a unique message ID.

message_track
I thought using header.stamp, like the topic_statistics built into rclcpp, but I gave up because I was concerned it would restrict the message type and destroy the stamp value.

The source_timestamp is a value generated by the dds layer, so on the publisher side, we have to add a process to bind the message address to the source_timestamp.
The concern is I have also added the LDPRELOAD hook inside DDS, and the implementation depends on the DDS version as well

Source code in the development stage used for the hooks in LD_PRELOAD can be found here.

github.com

tier4/CARET_trace/blob/main/src/hooked_trace_points.cpp

// Copyright 2021 Research Institute of Systems Planning, Inc.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
//     http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.


#include <dlfcn.h>

#include <iostream>
#include <memory>
#include <unordered_set>

This file has been truncated. show original

Callback

Messages entering and leaving the callback are different types, and it is difficult to bind them by IDs such as source_timestamp.
Therefore, I thought binding callback_start with the next message, rclcpp_publish.
This way, We can bind the source_timestamp of the input message with the source_timetamp of the output message.

bind_source_timestamps

At the current stage, I’m trying to give the graph information of the callback in advance.

The actual trace log also records information from other callbacks and publishes, so I use callback addresses and publisher_handler to bind them together in pairs that make sense.

tomoyafujita · September 29, 2021, 5:16pm

@hsgwa

thanks for the explanation, that helps me understand

The source_timestamp is a value generated by the dds layer

(nit) precisely rmw implementation, not DDS.

So this is required and responsibility for rmw implementation, as far as i see all tier-1 implementation does that.

In order to link publish to subscription, I use the information of source_timestamp that dds sends with the message as a unique message ID.

i think there ia a couple of concerns,

timestamp is not unique id. could be the same, also depends on accuracy.
what if system time is adjusted? i am not sure if only monotonic clock is used to fill in this field?

Source code in the development stage used for the hooks in LD_PRELOAD can be found here.

i see, that is hot-patch.

Messages entering and leaving the callback are different types

not only for types but also rate/frequency would be different. assuming subscription takes 3 messages and publisher publishes 1 message. in this case, we can still trace the message?

btw, this is on-target(localhost) traceability or distributed system?

sorry for coming up with many questions, i am interested in this feature

best,

christophebedard · September 29, 2021, 6:46pm

This is great, thanks for sharing @hsgwa! I also think having this kind of information is important, and I’ve been working in this direction as well. Those callback diagrams are very nice!

please do! You can open an issue first to start a discussion, or we can just discuss in a PR.

It looks like you’re mostly using LD_PRELOAD to instrument DDS implementations, but just in case it might be useful, I’ve started instrumenting Cyclone DDS directly: Add read/write instrumentation with LTTng by christophebedard · Pull Request #898 · eclipse-cyclonedds/cyclonedds · GitHub. It’s way more basic (currently at least) than your current LD_PRELOADed instrumentation though.

In the message flow proof of concept for ROS 1, I tried to use network packets to match published->received messages, but to get it to work and be reliable, it would require a lot of work (see my “future work” notes in the blog post).

I’ve also been relying on the source timestamp. Personally, I’m not looking to cover all edge cases, so I think combining source timestamp + topic name (implicitly) is good enough to start. Putting aside clock adjustment (not sure what type of clock is used either), I think the probability of having messages published on the same topic with the same exact source timestamp is quite low. At the very least, you can detect situations like that when a subscription receives two messages with the same source timestamp.

@eboasson has given me some great tips about the internals of Cyclone DDS (+ I found some cool docs). Perhaps he and eProsima people could comment on this (or under an issue) and give some pointers for their respective DDS implementations?

I’ve also been thinking about this ever since I worked on the message flow thing for ROS 1. Could you provide/link to an example where this happens? I think taking a specific use-case/example and applying some sort of heuristic based on knowledge of the implementation could work. Then for the trace data analysis part, the user (i.e., person running the analysis) has to provide information about the sub->pub type/behaviour for the data to be correctly interpreted. Another option is to provide an application-level tracepoint so that developers can provide this “link” themselves. Personally I’m always aiming for all of this instrumentation to be hidden from developers, but for these special use-cases it might be the best solution.

hsgwa · October 1, 2021, 3:43am

@tomoyafujita

Thanks for the questions!
I’m trying to organize my thoughts as I answer them, so it’s much appreciated

timestamp is not unique id. could be the same, also depends on accuracy.

If the timestamps are the same, the measurement results will be out of whack.
However, as cristophe said, I think we can reduce the possibility by adding other information.

If we use source timestamp, topic name, and publisher GID, I think we can reduce the possibility even more.

what if system time is adjusted? i am not sure if only monotonic clock is used to fill in this field?

If source timestamp returned, results aren’t effected what clock is used to get the source_timestamp as long as the value is not duplicated.
However, returning time raises the possibility of duplicated source timestamp.

I also did some search on the clocks used by lttng. (https://lists.lttng.org/pipermail/lttng-dev/2018-April/027993.html)
lttng outputs monotonic and realtime values at initialization, and uses motoconic clock to measure the trace points.
Since trace points use monotonic clock, the measurement results should not be affected.

not only for types but also rate/frequency would be different. assuming subscription takes 3 messages and publisher publishes 1 message. in this case, we can still trace the message?

I think complete tracking is difficult when there are multiple inputs and outputs or different frequency, but there are several ways to do it.

One is to track with the assumption that the most recent message received will be used.
The source code looks something like this.

class Node {
 void timer_callback(void){
   pub_->publish(msg);
 }
 void subscription_callback(msg_){
   msg = msg_;
 }
}

This is the case where the member variable is always overwritten with the latest message.
Messages that are not used due to overwriting are counted as lost.

Not all implementations are covered by the first method.
Specifically, there is a case where a message is put into a buffer and used based on its timestamp.
For example: fusion using message_filter’s synchronizer and tf listener.
I’m thinking of tracking these by matching the timestamp value in the callback input/output.

btw, this is on-target(localhost) traceability or distributed system?

I’m thinking of limiting it to localhost for now.
LTTng uses monotonic as described above.
If multiple monotonic units go at different speeds, the analysis results will be out of whack.
When I compared clock_monotonic and real-time on a Raspberry Pi, there was a discrepancy of 4ms in 2 minutes.
For example, if the measurement is for a short period of time, say 1 minute, it may be possible to use it without deviating from the analysis results.

Just an idea, the two monotonic clocks simply are linearly misaligned.
So, if we include a process to adjust the time to a single clock, it may be applied to distributed systems.

hsgwa · October 1, 2021, 4:37am

@christophebedard
Thanks for the comment!

please do! You can open an issue first to start a discussion, or we can just discuss in a PR.
I’ve started instrumenting Cyclone DDS directly:

I’m currently thinking about the goal of measuring against Autoware, and I haven’t even settled on the trace points I want to add yet.
I will make a proposal after some more verification.
Since there are multiple implementations of DDS, I thought it would be difficult to add tracepoints to DDS.
Right now I’m adding trace points by LD_PRELOAD, but I am in favor of it being officially adopted

Could you provide/link to an example where this happens?

The message_synchronizer is difficult in the following ways

Having messages buffered and selecting messages from the buffer based on the timestamp value.
The other is the complexity of the graph of multiple callbacks.

Ths source code can be found here.

github.com

tier4/AutowareArchitectureProposal.iv/blob/main/sensing/preprocessor/pointcloud/pointcloud_preprocessor/src/concatenate_data/concatenate_data_nodelet.cpp#L311


      
          void PointCloudConcatenateDataSynchronizerComponent::removeRADTFields(
            const sensor_msgs::msg::PointCloud2 & input_cloud, sensor_msgs::msg::PointCloud2 & output_cloud)
          {
            bool has_intensity = std::any_of(
              input_cloud.fields.begin(), input_cloud.fields.end(),
              [](auto & field) { return field.name == "intensity"; });
          
            if (input_cloud.fields.size() == 3 || (input_cloud.fields.size() == 4 && has_intensity)) {
              output_cloud = input_cloud;
            } else if (has_intensity) {
              pcl::PointCloud<PointXYZI> tmp_cloud;
              pcl::fromROSMsg(input_cloud, tmp_cloud);
              pcl::toROSMsg(tmp_cloud, output_cloud);
              output_cloud.header = input_cloud.header;
            } else {
              pcl::PointCloud<pcl::PointXYZ> tmp_cloud;
              pcl::fromROSMsg(input_cloud, tmp_cloud);
              pcl::toROSMsg(tmp_cloud, output_cloud);
              output_cloud.header = input_cloud.header;
            }
          }

The general flow of the process is as follows

receive each LiDAR data
as soon as the three LiDAR data are available, combine the most recent values and publish
If the third LiDAR message is not received (in case of failure), the timer callback will combine the two LiDAR data and publish.
[Each message will be published only once.]

The autoware doesn’t use the message filter, but uses similar implements on its own, in order to publish with two messages even if the third LiDAR message is not delivered.

In such nodes, message passing by member variables occur, and the callback graph becomes more complex.

For example, if we write a callback graph, it would look like this.

Implementation Intent

Draw each callback separately

Since the callback graphs of these nodes are too complex, I think developers need to give the necessary information to create the callback graphs.

Paulinho · October 5, 2021, 8:55am

I am involved in standardisation and safety assessment/testing.
So far your work is highly interesting to me as a possibly way to perform verification of performance of safety related ROS nodes part of safety control loops and systems.

Most interesting would be to couple this checks with the actual performance of a robot for example tested for performance or for safety.
Any idea on this?

hsgwa · October 7, 2021, 4:57am

If we can produce a message flow, we can evaluate the performance of across multiple nodes.
For example, it would be nice to measure the time it takes for a message to enter the sensor node and be reflected in the actuator node.
However, I think it is important to note we can only track the message flow, not the actual movement of the actuator.

I don’t know much about the safety aspect, and this is just an idea, but by using hooks, it may be possible to test the deadline missed behavior.
For example, hook callback_start and callback_end with LD_PRELOAD and intentionally insert a wait.
This will have the same effect as increasing the callback duration.
we can test to make the process exceed the deadline without having to prepare sensor data that requires more processing time.

Topic		Replies	Views
Looking for feedback on ros2_tracing and use-cases/needs for performance analysis & visualization Next Generation ROS ros2 , tracing	1	902	February 5, 2021
Message Flow Analysis for ROS Through Tracing ROS Projects	6	2102	December 11, 2020
Project Proposal:ros2 tool to collect topic metrics ROS Projects	7	2328	August 15, 2023
Latency and throughput in ROS2 Next Generation ROS	24	17550	January 7, 2021
Introduction of CARET; performance analysis tool for ROS 2 applications ROS Projects ros2 , real-time	1	1399	December 9, 2022