[Nav2][Discussion] Metrics / framework for quantitative evaluation of navigation performance


I have spent some time working with the nav2 stack recently and I have found that fine tuning it for each use case can be a complex task given the amount of plugins you can use and the number of parameters each of them has.

I have struggled to find standard metrics that allow to quantitatively assess the impact of changing a plugin / parameter in the overall robot navigation performance (if I’m missing some documentation I would really appreciate if you could point me in the right direction :slight_smile: ). If this does not exist I think building a framework for these kind of tests may be a valuable addition to the stack since it would allow users to better tune it for their setup and could potentially be used alongside gazebo simulations to grasp the correlation between the parameters and how the navigation performs.

Are there any metrics people out there are using for evaluating navigation on their robots?


Interesting point, something important to remember is that many systems can be independently tested and benchmarked. For instance, localization and SLAM there are plenty of frameworks described in literature for that kind of thing.

The two parameter-heavy things requiring fine-tuning are localization and trajectory planning. In many ways with trajectory planning, has some degree of subjectivity (what’s better, exact path tracking? maximizing distance from obstacles? Reactivity? Predictable behavior to bystanders? Smoothness of execution? etc) but many of these can be quantitatively analyzed, but there’s no objectively “best” solution - each application will have different applications and requirements (and why its my goal to offer a number of trajectory planners to have the cross-section of behaviors).

With that said, we actually are working on some benchmarking utils for comparing algorithms for mostly sniff-testing & high-level academic benchmarking navigation2/tools at main · ros-planning/navigation2 · GitHub. My intention was not for this to be for fine-performance tuning, but I suppose you could use it in that way and I’d be happy to have contributions to extend the collection of data to help in making more fine-tuned decisions. While right now there is not a trajectory planning benchmark (only for planning and smoothing), there’s an open ticket for it Add Controller Benchmark to tools/benchmarks then break out the benchmarks into a new package · Issue #3239 · ros-planning/navigation2 · GitHub with some movement (I’ve been gone for a couple of weeks, so haven’t caught up on that discussion yet).

In my opinion, these are good for first-level tuning for general performance, but at the end of the day, fine tuning on things like trajectory planning should be done on hardware and localization done on datasets, since simulated data for both of these things are never going to be fully sufficient to model vehicle dynamics or data quality of different surfaces. Additionally, these random goal / task benchmarking scripts will be good general metrics but there are always going to be better benchmarking for a specific application (e.g. using representative maps of your environments with representative goal poses / tasks / obstacles / agents).

Personally, trajectory planning is the type of thing I usually tune generally in simulation and then move onto hardware for a few days to hone in. Planning/localization tuning I usually do offline and metrics oriented since those are alot less behaviorally sensitive. Perception luckily is mostly based on analytical properties, requirements, and general common-sense choices, so that’s never been something I’ve had to spend too much time tuning (just much more time developing).

Its targeting more high level benchmarking, but if you think they can be adapted for more fine-tuning, I’d be happy to chat about it!

1 Like

One of the big issues with quantitative evaluation in this community is that each group/maintainer seems to have the need to create its own tools. This is summarized in here generically, but if you speak with enough maintainers you’ll notice how each has its own scripts. Most of these tools and approaches are great for development, but aren’t really usable in production systems and/or to assess complete graphs in real applications.

We at Acceleration Robotics have been doing lots of performance benchmarking over the last couple of years, across multiple ROS stacks and across hardware. This is specially relevant when involving accelerators (GPUs, FPGAs) as we’ve observed how performance is often reported wrongly, or in a misleading way by some silicon vendors. Our needs for quantitative evaluation of robotics computational graphs is only growing over-time, so we’re trying to push forward community initiatives to standardize performance benchmarking via:

  1. REP-2014, Benchmarking performance in ROS 2
  2. RobotPerf, a consortium of robotics leaders from academia, research labs, and industry whose mission is to build fair and useful robotics benchmarks that provide unbiased evaluations of robotics computing performance for hardware, software, and services—all conducted under prescribed conditions.

We’re discussing both of these topics 2022-11-03T17:00:00Z, in this meeting.

We are working on this to improve navigation2 default CPU performance and leverage hardware acceleration. We are doing so in a non-functional and grey-boxed manner (meaning of this), particularly using a benchmarking approach that leverages LTTng and ros2_tracing which put together a ROS-enabled low-overhead framework for real-time tracing of ROS 2 graphs. Also, we’re pushing our approach into a community standard via REP-2014 (see here for a readable version).

I made an attempt a while ago to start contributing this approach upstream to navigation2 but the effort never got too far. Happy to chat about this again if there’s interest.

LTTng and ros2_tracing are certainly not the simplest tools but in my experience they provide plenty of versatility, are easy to integrate in CI/CD infrastructures and allow to trace distributed ROS 2 systems.


@vmayoral, I don’t think that’s the kind of thing the user is referring to. Its not about run-time performance, its about navigation system and algorithm performance for their robot / application needs. Those aren’t really related topics :slight_smile:

Yeap, you’re right @smac, @Pepis’s probably more interested in functional benchmarking data, but that can also be enabled with our approach described above and that’s the point that I wanted to make:

             Probe      Probe
             +            +
             |            |
    +--------|------------|-------+     +-----------------------------+
    |        |            |       |     |                             |
    |     +--|------------|-+     |     |                             |
    |     |  v            v |     |     |        - latency   <--------------+ Probe
    |     |                 |     |     |        - throughput<--------------+ Probe
    |     |     Function    |     |     |        - memory    <--------------+ Probe
    |     |                 |     |     |        - power     <--------------+ Probe
    |     +-----------------+     |     |                             |
    |      System under test      |     |       System under test     |
    +-----------------------------+     +-----------------------------+

              Functional                            Non-functional

    +-------------+                     +----------------------------+
    | Test App.   |                     |  +-----------------------+ |
    |  + +  +  +  |                     |  |    Application        | |
    +--|-|--|--|--+---------------+     |  |                   <------------+ Probe
    |  | |  |  |                  |     |  +-----------------------+ |
    |  v v  v  v                  |     |                            |
    |     Probes                  |     |                      <------------+ Probe
    |                             |     |                            |
    |       System under test     |     |   System under test        |
    |                             |     |                      <------------+ Probe
    |                             |     |                            |
    |                             |     |                            |
    +-----------------------------+     +----------------------------+

             Black-Box                            Grey-box

With a bit of effort I believe we can actually make these topics related and consistent (and hopefully re-usable across ROS stacks). @christophebedard did a good work creating a data model that could be used to determine functional aspects a posteriori (after the computational graph has run, using the trace data). I think btw this is a good complement to the tools you linked above and believe it’d be interesting to compare the same benchmarks. Happy to partner up on this!


I would be very interested in how functional benchmarking can be performed through the framework mentioned in REP-2014, my work revolves around navigation parameter tuning in the context of AMRs so this would be tremendously useful. Hope this is discussed further in the hardware acceleration meeting tomorrow.

I’m not sure I see how any of this tells me: is my controller tuned well and following the path with the behavior I find optimal or does my localization settings / model meet my requirements for accuracy and not jump from similar areas into others. But maybe I’m not thinking creatively enough. These aren’t functions of latency, CPU time, or similar system metrics.

Hi @Pepis, @gsvikhe,

A brief overview of my background so that you can relate to my thoughts:

Our company develops tools for automated (model-based) evaluation for series development of AMRs. The evaluation of trajectory planning is a common issue we encounter. In addition to @smac’s post, which is in alignment with my experience, I’d like to add the following:

  1. Be aware of your development boundaries: A simple distinction is serial development (“I’m happy if the failure rate is below 1/1000”) vs. PoC/Demo/Research (“I’m happy if it works once”). This pretty much determines how sophisticated the evaluation and the process to achieve it has to be.

  2. Be aware of the business case of the AMR: Examples in intralogistics are mission time and in outdoor area covered. In the field of safety engineering time-to-collision is a common metric. More examples can be found e.g. in ANSI/UL 4600, chapter 16. Each metric has advantages and disadvantages. For series development, robustness is often rated more relevant than efficiency, as the patience of end customers is very limited. In addition, multiple metrics including boundary conditions are used, e.g. power consumption, as it is often directly related to the total cost of ownership (customer procurement decision KPI).

  3. Be aware of technical boundaries: You can use sophisticated automation methods in evaluation most efficiently if your metrics are continuous and deterministic. Discrete or stochastic metrics limit the potential of most algorithmic fine tuning approaches.

These are only examples to provide a basic understanding and are not meant to be collectively exhaustive.

1 Like

I believe functional performance benchmarks like "is my controller tuned well and following the path with the behavior I find optimal?" can be inferred from the computational graph data (and the world model/abstractions, if running in simulation). Which is what I was hinting above.

Building upon @christophebedard’s work and data model, with some additional tracepoints, one could probably collect information about these and then store it a trace file. From my understanding of your scripts, this is not very different from what’s being done (e.g. metrics.py). It’s just that you’ll be using instead LTTng under hood, a low-overhead framework for real-time systems tracing.

Then, afterwards, with data processing scripts similar to the ones you’ve put together (e.g. this one), you could analyze everything. There’re a few examples of such scripts in acceleration_examples using LTTng (e.g. this one). The approach is somewhat similar.

1 Like

We monitor the performance of the trajectory planning in real time by measuring cross-track error of the current position relative to the planned position.

1 Like

That’s the “gotcha” on that one. Trajectory planning cannot be fine tuned in simulation, in my experience. You need to tune based on the actual response and dynamics of your vehicle. There are many elements that can be accurately tuned in simulation, but localization and trajectory planning need real data. Those also happen to be the 2 most parameter intensive subsystems.


Interesting. I trust your experience here but can’t you still use the same approach (i.e. ros2_tracing and LTTng) in the real vehicle in those cases you need to? At the end of the day, that’s exactly what your scripts are doing. Only that instead of relying on various custom-made tools and pickle, you’ll be relaying on a (hopefully soon to get standardized through REP-2014) low-overhead framework for real-time tracing. Purposely made for that. Actually, LTTng has lots of tooling for collecting data over distributed systems.

My point again, I’m pretty certain that ros2_tracing and LTTng can bring additional value to a framework for quantitative evaluation of navigation performance when compared to existing approaches. This includes both functional and non-functional benchmarking tests.

1 Like

I agree that defining what a good behavior is depends a lot on each user’s application, however I think having a metrics that can abstract characteristics of the system-wide navigation behavior can help users grasp how changes they make influence the way in which their robot moves in a more deterministic way (ie: “by changing the costmap size by X my robot was able to move Y% faster on average”), rather than relying just in their qualitative perception. After having that users can choose which metric to prioritize based on their requirements (speed, reaction time, smoothness, predictability, etc).

Have you heard about people doing this kind of evaluations? What are the most common characteristics of navigation you think users are tuning for? Do you think that a set of application independent metrics from which users can choose from does even exist?

The tooling @vmayoral proposes seems interesting, I will definitely take a closer look, thanks for bringing that up. Until now I have been mostly relying on rosbags + python scripts for that.

I’m still not really buying that argument. If you’re proposing essentially we just use these frameworks to store the data, its not like that solves the problem of needing to offline crunch the numbers in a separate script to come up with the final metrics. So really this is just proposing, for our situation (from what I gather) is a file format.

Something worth noting here is that what we’re recording isn’t odd data from internal processes in the server. We’re capturing the standard returns from the servers that are the result of the action requests (e.g. path planner: make me a path, OK, here’s a path back). So we’re already getting access to the data from the automation script to generate the requests. It seems senseless to remove the 1:1 correlation of data recording to returns. In many ways, that’s actually weaker because we cannot control which sets of data are being stored because a given request knows nothing of previous requests. An example of where this is important is in the planning benchmark, we only store data for which all the planners succeed, so we’re comparing apples to apples. If we stored the data internal to the server, then we wouldn’t know if a particular data point was worth keeping or not based on the results of later or previous algorithm calls.

This doesn’t seem like the right tool for the job. If we were interested in capturing server internal information, I think I could buy into this more. But for the most part, our interest is in outputs that are already correlated to requests via the action API. To gather data, you still need an automation script to design the experiment. So if we have something making requests which are receiving responses, it makes sense to keep things aligned in one place.

1 Like

@smac, I think I should probably try and join one of the upcoming navigation WG meetings and discuss this while at it. It’ll be more effective.

In the meantime, I’d encourage you and anyone else interested to review further LTTng and ros2_tracing and consider reading more about it. Both the framework as well as my proposal above aims at much more than a file format (by the way, the file format itself used is CTF, which is pretty common :rofl:).



This is a great practice, and one we use to evaluate the functional performance of the paths produced in planning.

What is the run-time performance of the function, is separate from the navigational performance. Tooling to measure run-time performance does not assess if the path from planning is better or worse as a separate performance metric. The input and correlated output is needed for objective measurement of results.

1 Like

Hi @ggrigor, would you be able to share the specific metrics you use for the functional evaluation?


I would like to add another tool to the discussion: Google Perfetto.

Perfetto is the profiling/tracing tool used by both the Chromium and the Android project. It is available as standard on every reasonably current Android device. It is capable of doing system-level tracing (including kernel metrics, and GPU metrics), and also custom application-level metrics using its C++ Perfetto SDK (a new C ABI-based SDK for non-C++ language integration is in the works and should be released “soon”.)

It has low overhead, and the trace collection points can be enabled/disabled selectively using filters at runtime or completely disabled at compile-time. Runtime-disabled trace points have only 1-2 ns overhead, so it may be feasible to ship with production builds with tracepoints, so when necessary, possible performance problems can be profiled and debugged in production without the need to distribute custom “profiler enabled” builds.

We used it successfully in a ROS1 / ROS2-based automotive AR HUD system. We mainly used it in the rendering component (which was a ROS node itself), but other nodes could be augmented as well.

Its tracing format is based on Protobuf (as it is standard with Google tools), and the standard data collection implementation is using its own shared memory-based ring buffers from the observed processes to the collection daemon.

Data collection is only one side of the story. Perfetto has a beautiful web UI, where you can filter and drill down into traces graphically, and use SQL to run custom queries on the trace data. The core of the web UI, called the Trace Processor is written in C++, and distributed as WASM with the UI. This means that your traces never leave your computer.

If you are working with large traces (larger than ~1GB), the Trace Processor is also available as a standalone native application so you can run it on your local machine to do the heavy lifting, while the Web UI connects to it using a Websocket API. (The Web UI actually detects if a Trace Processor is running on your machine, and offers to use it instead of the built-in WASM version.)

Kind regards,

1 Like

Functional metrics are measured across dynamics, plan, and for consistency on AMR’s. We measure in SIM and REAL. @smac may disagree, but in our experience with accurate drivetrain models in simulation and great physics, we see good SIM to real matching for most results. This saves significant time, as we can run a lot of testing in SIM, and cross-check against real. SIM has the added benefit of measuring with perception (sensors) in the loop, and ground truth to compare differences. We run SIM both on as SIL (software in the loop) and HIL (hardware in the loop) where the task runs on the same compute used in the robot. It’s worth noting that recreating dynamic tests including moving obstacles is more difficult in real testing, so SIM affords a lot more coverage and variations than possible with real testing.

Dynamics measure acceleration, velocity, and jerk across maneuvers laterally and longitudinally, including keeping within the parameterized limits depending on the maneuver.

Plan covers completing a task, accuracy of the final pose, time to complete the task with distance travelled, and proximity to obstacles. Tests are performed on a collection of scenario’s informed by design choices, known issues, and discovered issues.

Consistency measures repeatability of results.

The recently released Mission Dispatch (GitHub - nvidia-isaac/isaac_mission_dispatch: VDA5050-compatible cloud service for fleet mission dispatch) is a key part of our measurement; it allows for repeatable testing at scale by sending tasks to robots in real-environments and digital twins (simulation) to test, and measure functional results.

We separately measure performance metrics including run-time, latency, and resource utilization.


This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.