[REP-2014] RFC - Benchmarking performance in ROS 2

Good to see this proposal. We have feedback on the draft with the intent of making this proposal more broadly applicable to benchmarking hardware acceleration.

Tracing into a separate REP

REP-2014 should remove tracing as it is independent of benchmarking. Adding probes alters the entity under test, and should not be used in benchmark testing the entity. It is very useful for triage, debug, and optimization for developers to make changes to benchmark results, but is not required for objective benchmark(s).

Hence feedback is focused on benchmarking, not tracing.

Unbiased names

Use of terms like “black-box” and “grey-box” make implications (implicit or explicit) that one color is better than another; this propagates a bias and can be construed negatively. Recommend the use of unbiased naming such as opaque, and transparent be used in place of colors.

Opaque performance tests

The requirement that packages be instrumented within the source, prevents a common benchmark from being used where the implementer needs to recompile the source with added probes which could affect the results. There is considerable prior-art in industry benchmarking that no source is required to assess the performance of an entity under performance test.

For example one can evaluate the acceleration performance or fuel efficiency of two cars without looking under the hood using external measurement.

Measurement can be performed at the node level and graph of nodes, by monitoring subscriptions to topics.

A benchmark should be performed as an opaque performance test(s).

Input data

Performance measurement requires the system to perform some function, which requires input data to operate upon. For some functions input data can be repeated and for others input data need to be sequential.

To benchmark a function, input data needs to be provided with a data loader containing real or synthetic data from a file, rosbag, or directly from the sensor (live).

Without consistency on the input data used, any benchmark measurements performed independently cannot be compared, thus the performance measured is only applicable to the person measuring it, as it’s not reproducible by others.

Output data check

Performance measurement requires confirmation of work completed during measurement of time spent on the work. When there is no assessment of work completed, optimization can inadvertently, or deliberately lead to improvement with functional errors.

For example we measured an impressive 3x improvement in CPU performance of AprilTag but had to disable our quality check on work completed to measure this result. The improvement resulted in decreased detections which failed the quality check

Benchmark tooling needs to perform a minimal check of work results.

Benchmark parameters

Benchmarks need customizable parameters for the entity(s) under test. Parameters are used for data set size | length, input test data, and publishing rate; when performing throughput testing we need to identify the peak throughput rate, within a specified tolerance of drops in work from the entity (i.e. DDS or node drops).

Remove interpretation of results

Performance needs to be performed by highly trusted scientific devices providing objective measurement(s). The analysis of results should be left to those making decisions from the measurements.

The proposal for the REP makes “Performance metrics in robotics” statements around how to interpret results. While these statements may be provided with good intentions as examples, they should not be part of the REP as they contain inherent bias on how to interpret results. Guidance on interpretation can be provided separately on how to perform analysis of measurements.

The introduction to the REP can include justification on the value of having objective measurements for those performing analysis with their own criteria, without results interpretation or conclusions to draw from objective measurements.

All references and examples of results interpretation should be removed, so this is a trusted objective measurement, to avoid having a biased system.

In summary there are several issues to consider and address in REP-2014

  1. Remove tracing from this REP as it’s independent of benchmarking
  2. Use unbiased names
  3. Opaque performance tests
  4. Input data loader and data
  5. Output data monitor and checker
  6. Benchmark parameters
  7. Remove interpretation of results

We run ~200 benchmarks nightly in cloud native Kubernetes systems to provide objective performance measurements, across 4 different compute platforms, including 2 different instruction set architectures for nodes and graphs of nodes in ROS 2. These measurements are used to analyze our work on hardware acceleration. To replace what we use in practice to provide great hardware acceleration into ROS 2, we would like us all to use a level playing field and address the issues above.

Thank you

2 Likes