Evaluation of robotics data recording file formats

Evaluation of robotics data recording file formats.pdf (73.0 KB)

This document is motivated by developing tooling that interoperates with ROS1, ROS2, and other robotic frameworks and speaking with customers about the recording file formats that exist today. I think there is space for a next-generation recording format similar to the ROS1 .bag (v2.0) format, with the flexibility to support ROS2’s pluggable middleware and other robotics frameworks. Before jumping into a discussion on how to lay out bytes on disk though, this evaluation attempts to distill requirements and review what exists today.

9 Likes

If you do write data to disk PLEASE use a format for which readers are already common. It would be good to be able to directly read using R or Python without fitst needing to import anything connected with ROS.

The goal should be to leverage a large ecosystem of data analysis tools that are already in common use in other industries (such as data science)

I/O performance should be a concern too.

3 Likes

Thank you for the feedback. Integrating with the existing ecosystem of data science and analysis tooling is a requirement that should be spelled out. I can add a section to the document covering this.

Unfortunately, there are no Python libraries you can import today that will parse CDR payloads given a ROS2 msgdef or DDS IDL. The next best thing we can do is create a standalone library that is published to package repositories and installable in as many environments as possible, and allow this library to understand the contents of a recording without referencing a ROS installation or any external data dependencies. This is addressed in the requirements section.

2 Likes

We (at Continental Automotive R&D) made good experiences by combining HDF5 (as a container format) with Google Protobuf as a message serialization format. The HDF5 container will provide the generic structure of the measurement with a global measurement header (containing topic names, message descriptors, additional tags) and HDF5 takes care for features like file splitting, compression, interfaces to different languages and so on.
Every recorded sample will then just contain timestamps (sender, receiver for latency analysis), a monotonic counter (to detect drops) and the serialized, dynamic sized payload byte array.
Protobuf is for sure only ONE option and can be replaced by another serialization format if needed. In this case at least the outer HDF5 API is still working, only the second level processing doing message decoding needs to be adapted.

4 Likes

Why is it desirable to have the recording format be serialization agnostic? I can see the benefit, but it seems that an additional storage plugin to rosbag2 may have a dedicated serialization format, and that seems fine. I really like the idea of adding protobuf as a storage plugin for rosbag2…

@aposhian the distinction is between the container format and the message serialization/encoding format. An analogy to video files would be mp4 (container) vs h264 (encoding). The container format needs to be able to support multiple message serialization formats, because:

  • Robots already use different message serialization formats (e.g. ROS 1 msg, Protobuf, and ROS 2 supports pluggable RMW with different serializations)
  • For performance and stability reasons, robots should not be required to re-serialize their messages into a different format during recording
2 Likes

Thank you, that is helpful.

@john-at-foxglove.dev you write that

There are many different concepts of
time in a recording: the time a message was originally published, the time when it was sent by
the publisher (this can be very different for messages that are held in a queue and replayed),
when the data recording process received it, or when the recorder wrote it to disk. Using a
consistent definition of time is the most critical requirement while storing multiple timestamps
can enable additional use cases.

To me it seems that the time that the publisher sent the message is not very relevant to creating a deterministic playback of the system unless you store something about the network topology. The current implementation of rosbag or rosbag2 is a single observer running at a particular place in the network, so it makes sense to me that all of its timestamps should be relative to the receipt of those messages, rather than when the publisher published them. However, as you mentioned, queuing or failure on the recorder’s part to receive messages fast enough could distort things.

Regarding the timestamps as mentioned before, it’s very helpful to record both (publication time and the time when the message finally received the recorder).

You can decide later which one to use or how to correct latencies but at least you have the chance to detect latencies in your distributed system.

An additional option is to not run one single recording instance anywhere in your network but to run one recording node on every host that is part of your infrastructure. All these instances are orchestrated (started, stopped, configured) by a central recording master application. This is how the decentralized eCAL recording is designed.

The advantage is that your recording nodes ‘nearly see the same timing’ as the user nodes running on the same machine and for sure the recording is not adding extra load on the network to collect data from other hosts. For later deterministic replay you can reconstruct the messages for every node more closely to the original timing seen at recording time on the hosting machine.

3 Likes

This may be off-topic, but what about also thinking where non-file-based recording mechanisms fit into all of this (i.e. feeding into a time-series database). There is merit to having files that are easy to pass around and load into different programs, but all file-based solutions will eventually hit performance bottlenecks. The alternative is using a time-series database, which could carry higher overhead for where the recorder is running, but also potentially provide higher write throughput.

I think there are two separate approaches to this:

How can I get real time data from my robot into a time series database?
Usually you need to first solve the problem of how your robot is connected to a server to ingest this real time data. There are several great fleet management tools available that can do this (Formant, Freedom Robotics, InOrbit, etc). It would also be nice to see some lightweight open source tooling to support more simple use cases (Transitive Robotics is working on something in this space).

How can I record data on robot and then later make it accessible in a time series database?

Typically on the robot you want to get messages written to disk with as little CPU or I/O overhead as possible, which is where a file-based recording format excels. However, I believe a good robotics data file format should come with an ecosystem of tools and libraries to make it easy to convert the data into different formats, which could make it easy to ingest and analyze in a time series database after the fact.

2 Likes

As promised in the writeup, here is a draft proposal from the Foxglove team for a recording format. The current working name is “MCAP”, short for Message Capture. The draft document is open for public viewing and comments, so please take a look and comment either here or in the doc if you have any feedback.

A quick side note, I’ve been integrating drone recording formats into Foxglove Studio (starting with PX4 ULog) and there may be interesting concepts from those recording formats that could be captured in MCAP as well. I will comment on the document with findings as I go.

1 Like

@john-at-foxglove.dev I would add to the “Message” one additional (receive) timestamp and a counter.

The counter is created by the the message provider (e.g. a publisher) and can be used to quickly detect message drops later on. The additional receive timestamp is set when the data reached the sink (like the subscribing recorder application). This is very useful for later latency checks.

In the eCAL HDF5 recording format both information helped us a lot in the postprocessing of our AD vehicle measurement sets.

Just to ask … what is the reason to not use HDF5 as the container format ? It has exactly the described options for chunks and indexing and header information and a lot of available API’s to read and write it.

1 Like

In an age long past (10 years ago), @Ingo_Lutkebohle and I did quite a bit of work looking at improving the format used for recording bags in ROS 1. There were three different approaches we looked at.

HDF5R

A format built on using the HDF5 format that has been mentioned so many times. I can’t remember much about what we did, but there’s a partial implementation here. I don’t think we got very far, probably the work done was mainly prototyping and benchmarking. I can’t say why the work was dropped, but I do agree that HDF5, while very widely used and supported, has limitations that reduce its usefulness for us.

Extended ROS 1 bag format

@Ingo_Lutkebohle and his colleagues at Bosch did most of the work on this one. There was a website describing the format, but I can’t find it now. It was an extended version of the rosbag format to deal with some of the shortcomings of that format.

An EBML format based on the Matroska format

I did most of the work on this one, so it’s the one I know the most about. Matroska uses a file format called EBML, which stands for Extensible Binary Meta Language. You can think of it as a binary form of XML. An advantage of EBML is that you can specify a schema that defines the file structure, which is exactly what Matroska does.

While I was able to define the complete format and produce part of an implementation (“tawara” means “sack” in Japanese - I spent way too much time trying to choose a “cool name”), I unfortunately got shifted to a different project before I had time to complete it. It was also up against the Jupiter-sized intertia of the existing rosbag format by that point, and I don’t think it would have ever been adopted even had I completed an implementation. I did, however, get far enough to convince myself that the format solved all the problems of the rosbag format and then some.

The Matroska format looks very complex, and it is, because it has to deal with all of the foibles of various media formats to enable it to be a flexible container format. Fortunately, we didn’t need most of that complexity (no differentiation between audio and video, for example, or support for 3D video), so the adapted format is much simpler while retaining all of the flexibility of a container format. Because ultimately that’s what we need for rosbag2: a fast-to-write, easy-to-read container format.

I remember from when I was doing this work that although the format has more complexity than the rosbag format or the format that Ingo’s team produced, it had advantages and additional features that could have been useful. Some that I can remember are:

  • Ability to write data fast and index it later (or not at all if you don’t care about easy seeking)
  • Fast seeking, when index information is available
  • Chapters, for rapid jumps to interesting parts of a bag file
  • Can store the schema of messages, as well as any other attachment you like (thumbnail of the visualised data, for example)
  • Serialisation-agnostic, including a different serialisation format for each topic if that’s your thing
  • Robust to errors, including the use of CRC-32 checksums, the ability to skip corrupted elements, and re-indexing after recording
  • Robust to version changes, as unknown elements can easily be skipped allowing old parsers to play files produced on newer files (to a limit, of course)
  • Files can be rewritten in place on disc to a degree, if consideration is given at recording time. This is most commonly used to allow index information to be added later without changing the file size, especially when splitting files
  • Segmentation, which means that not only can you split it into multiple files, you can choose which of those files to play back in what order later on - more useful for AV data than robotics data, but could be useful to splice multiple scenes together
  • Support for tags, such as when produced, robot serial number, who reviewed it, information on the parser that produced it, or whatever else you might want to tag a bag file with
  • Relative time stamps, meaning you can move an entire block in time by shifting the block time stamp - useful for adding a sudden delay in data for testing, for example

Keep in mind that this was done 10 years ago, when we all knew a lot less about what we need from a recording format. There are some things I think I would change or add in the format now. A few I can think of off the top of my head are:

  • Explicitly storing the message schema in the topic information rather than using an attachment per track.
  • Change some of the element names to be more like what they are rather than just reusing as-is the names from the Matroska specification.
  • Add an element to the header containing the PNG/HDR5/MCAP magic bytes to detect transfer errors, because those look useful.
  • Add an additional timestamp field for “message received time”.
  • Provide space for whole-file information such as earliest and latest time stamps, total message count, etc. Information that can be calculated and written after recording is complete.
  • Update the format to comply with the new EBML RFC draft, including providing an XML-format schema, and remove redundant information defined in the RFC, such as the definition of void and CRC-32.
  • Change the specification to be Markdown instead of reStructuredText? Depends on table-rendering capability.
  • Choose a better name.

I’d love to revise this format and produce an implementation. In case anyone is interested in iterating on this format, I’ve pushed the specification to a new repository. Fixes, additions, improvements, and other useful contributions are welcome! I can move the repository somewhere less “belongs to me” as well, if that’s preferred (perhaps an OSRF repository?).

6 Likes

wow, I just remembered I’ve actually posted to the mailing list that was setup for this as well (I did a “review” of MPEG-TS, with similar goals/structures as Matroska).

I still believe MKV would make a good container format.

One thing you haven’t mentioned @gbiggs: being able to reuse existing tooling for muxing, splitting, splicing, etc. And many of those tools are (already) cross-platform and well supported.

To short-circuit future discussion: would the evaluation of HDF5 be accessible somewhere? @rex-schilasky makes a good case for using it.

Indeed; that’s one of the ones I forgot, and a very strong one.

Unfortunately I don’t remember where any of that information ended up. Most likely it was on the mailing list, which I can’t even remember the location of now. It might have been a KU Leuven mailing list.

I think HDF5 would be usable but suffers drawbacks from not being a container format in the truest sense of the term.

It seems narkive has archived all of it: RRFC.

And the Wayback Machine has the website: retf.info (well, bits of it).

Would the container format have native support for encryption of the data? We have had to extend the ROS BAG format to include encryption (as well as compression).

What is the use case for encrypting part of the file instead of the entire file? Do you want encryption on a per-topic basis, or per-message, or something else?

Encryption is rarely incorporated directly into container formats since it is trivial to encrypt entire files unless there is a compelling use case for mixing encrypted and unencrypted data in a single container or using different encryption keys or algorithms in a single file.

RE: compression, what compression did you extend the rosbag format with? It already supports LZ4 and BZ2, was this to add ZSTD or another algorithm?

1 Like

An EBML/MKV-based container format would support encryption of individual streams. I’m not sure what the use case is, though.