ROS Resources: Documentation | Support | Discussion Forum | Service Status | Q&A

New perception architecture: message types

@yukkysaito - I think the field you’re referencing is “detection_level” which is meant to indicate whether this object has been “detected” or “tracked.” From the notes in, here are the definitions:

A Detected object is one which has been seen in at least one scan/frame of a sensor.
A Tracked object is one which has been correlated over multiple scans/frames of a sensor.
An object which is detected can only be assumed to have valid pose and shape properties.
An object which is tracked should also be assumed to have valid twist and accel properties.
The validity of the individual components of each object property are defined by the property's covariance matrix.

classification_age indicates the number of “scans” or “detections” made of the object where the classification type is the same. When a sensor classifies an object, it usually tells you how many “scans” of that object have been sent since the object was classified as that type. This helps determine the certainty of the classification.

According to your experience in robotics and autonomous vehicles. I would like to hear your opinions @JWhitleyAStuff @Dejan_Pangercic and @sgermanserrano about this messsage definition not including sensor data (i.e. ImageROI, PointCloud)?
Do you think it is necessary? Or do you think it’s better to keep it like this to add an abstraction layer?

@amc-nu I think this is based on your intent for the message. If you intend to follow a “domain-specific-controller” approach, then you are trusting that the individual sensor processing nodes know how to correctly filter the raw data and produce abstracted objects for the most part. The uncertainty is then encoded into the covariance matrix and the classification quality data.

However, if you intend to do either data fusion before object segmentation or fusion and segmentation in the same node, you would need the raw data in the message as well. My understanding is that Autoware is shooting for the first approach so I would say we don’t need the raw data.

1 Like

Agreed with @JWhitleyAStuff comment, I would expect the raw data to be processed on a separate step so that the filtered sensor data is in a usable stage. I think it also makes sense in a setup where you might have an edge device on the sensor that is pre-processing the raw data for consumption by higher level nodes.

1 Like

I think this work is being blocked by a lack of a shared understanding of what it is we want to achieve. What objects do we want to recognise, where do we want to recognise them, what sorts of data do we want to use, what data rates, should data be synchronised or can information be added on to a detection after the fact, do we or do we not use consecutive detections to strengthen an object’s presence, how interchangeable/optional do we want different algorithms and detection types to be, and so on. There are a huge number of unanswered questions that need to be defined and then answered before we can even begin to think about the messages used.

In other words, we need to define our requirements before we try to solve them. Otherwise we are solving an unknown or undefined problem.

We also need to keep in mind that we are designing Autoware for all Autoware users, not just for Tier IV’s favourite sensor set, or AutonomousStuff’s specific demonstration. I’m not saying that that is what is happening, but it is easy to forget.

Additionally, I think it would be useful to draw up a list of:

  • The different types of sensors we expect to be used. Not just ones we use now, but also ones a potential Autoware user might use.
  • The different types of data we might process. Obviously this closely relates to the sensors used, but don’t forget using post-processed data as an input to an algorithm, e.g. merged dense point clouds versus individual sparse point clouds, or point clouds with or without RGB data added from a camera.
  • The object locating, object identifying, object tracking, object predicting, etc. algorithm types that we might use.
  • Possible orderings of algorithms.

@amc-nu in my experience you need to decide between performance and synchronization. That is if you do not have many nodes and you have fast middleware, then you can use message types that also include raw data. If you have the opposite case then you should go with small messages.

If you split your message types too much you will have to deal with the time synchronization once the messages received by the end node. That is both hard to do and computationally expensive.

In any case I believe that we should finish the computational graph architecture first (how many nodes and composition) and then define the messages and not the other way around. I assume that AS has a solid computational graph architecture that let them define such messages.


I listed the differences between derived_object_msgs/ObjectWithCovariance and DynamicObject, and heard why DynamicObject is constructed in such a way. I’d like to clarify the differences and reasons, and like to make more common view in the whole community.

@JWhitleyAStuff Would you comment about the different information?

Same information

  • geometry_msgs/PoseWithCovariance pose
  • geometry_msgs/TwistWithCovariance twist
  • geometry_msgs/AccelWithCovariance accel
  • geometry_msgs/Polygon polygon and geometry_msgs/Polygon Shape::footprint

New common view

Different information


PoseWithCovariance[] past_paths is removed from DynamicObject since planning is not interested in past information. Though prediction will require past information, it can be resolved inside prediction.


Thank you for the summary. Do you have any opinions?

I think the reasons of DynamicObject are reasonable since I created the table based on the hearing :wink: My opinion is already reflected into the table, e.g. object_classified and past_paths are not necessary.

@JWhitleyAStuff Sorry to bother you again. Would you comment about the different information? I’m not familiar with the background of ObjectWithCovariance.

@kfunaoka I’m sorry it has taken so long to get back to you. Here are responses addressing your issues:

The object does have a header field. However, it does not need to be populated and ObjectWithCovarianceArray also has a header.

The types listed for “classification” are not exhaustive nor definitive. As far as I know, the message type has not been extensively used so it is open to modification. We have only done tests with it internally and have not released any packages that use it. I agree with your assessments for the “UNKNOWN_” types. They were types provided by a sensor vendor so we included them.

Not a problem to change the classification_certainty to a float (0-1).

Many algorithms use a convex hull bounding area to define an object. This is why “shape” was included. There is also geometry_msgs/Polygon polygon in the message for defining a non-normal polygons.

Regarding the rest of the comments: I think the overall concept is that our message structure for these is flexible. We can add or modify just about anything in the message, though I would prefer not to remove much (if any) of the fields.

@JWhitleyAStuff Thank you very much. It seems basic concepts are not so conflict :slight_smile:
I’d prefer to remove might-be-useful fields because they make msgs bloated and
developers confused.

@gbiggs @sgermanserrano @Dejan_Pangercic How do you think for the next step?
Any comments or questions?

My opinion from my previous post is unchanged.

In addition to that, I agree that we should have the most minimal message possible as the starting point. Message types are the API of nodes, and like any API it is relatively easy to add but extremely difficult to take away. Anything we have in there now will be in for the foreseeable future, so if it is not intended to be used right away I would prefer not to include it.

I would also like to hear from the people at Apex.AI who have been doing their object detection stack about what sort of information they think is needed. @cho3 @yunus.caliskan @esteve


I think this representational problem needs to be thought of from a use case perspective. Why are we detecting objects?

I think @yukkysaito has it mostly right: the purpose of objects is so that the information can be used by a planner. I think there is also a potential use case for detection acting as a preprocessing step (i.e. determining RoI) for scene understanding, such as detecting buildings, signage etc.

Fundamentally, I broadly agree with the flow of:

Detection -> Tracking -> Path Prediction -> Planning

I also broadly agree with the general specification of what fields should be present after each stage.

However, I disagree that they should necessarily have the same common type, with fields selectively filled in. I think that would create ambiguity from both a developer and user perspective. An object ID, for example, has no semantic meaning when we’re talking about an instantaneous detection.

For example, as the output of an object detection stack, the purpose is to provide representation of the objects in the world instantaneously detected. This instantaneous detection could be directly used by planning, but it is most probably used by tracking.

Fundamentally, the representation should cover the following at a bare minimum:

  • Where the objects are in 3 space, what space does it cover (size/shape/orientation)

Which is the core functionality needed for the planning use case, and is also sufficient for the tracking case.

You can then add on top of that other features of an object that you can detect instantaneously, for example:

  • Label (with confidence)
  • Orientation (with confidence)
  • Velocity (i.e. using FMCW radars)
  • Shape features (for matching/classification)

I think the fundamental size/shape/position fields can be satisfied by any reasonable object detection stack, i.e. LiDAR, stereo camera, RADAR, etc.

1 Like

That’s the point I was trying to make. Thanks for making it more clearly! We need to go right back to the start and figure out what we want to be able to achieve from perception.

I strongly agree with this. Message types are APIs, and APIs should make it easy for the developer to do things right and hard for them to do things wrong. Fields that are selectively filled in make it easy for the developer to screw up.

However from an efficiency point of view this raises some questions about keeping memory copies to a minimum, and we may need to do some work on the ROS 2 side to enable ensuring data structures are bit-compatible when they become part of a larger message, making messages out of pointers, etc. to reduce memory copying through the processing pipeline.

And FMCW LiDARs, soon! (If Aurora doesn’t buy them all up.)

Are you referring to using any one of those sensors to fulfil the above needs, or a combination?

1 Like

However from an efficiency point of view this raises some questions about keeping memory copies to a minimum, and we may need to do some work on the ROS 2 side to enable ensuring data structures are bit-compatible when they become part of a larger message, making messages out of pointers, etc. to reduce memory copying through the processing pipeline.

True, especially in a high performance context.

However, while I think this is important to keep in mind, I think for the near to medium term, this isn’t a big issue as IIRC ROS2 makes something like 5 copies during a normal pub/sub process.

If it does come to the point where the copies into different message types are the bottleneck, then I imagine we could probably use a base message type (where the memory is allocated from the middleware layer–a feature which I think ROS 2 does not yet have), and add some kind of wrapper around it for different views.

In the near term, I think we should stick with different message types to make things more clear from the development perspective.

And FMCW LiDARs, soon! (If Aurora doesn’t buy them all up.)


Are you referring to using any one of those sensors to fulfil the above needs, or a combination?

I think it’s a minimal requirement for any object detection stack should be able to generate an object list with position and shape information in 3-space.

How this is achieved, I think is implementation defined. Some sensor setups which may be able to achieve this include:

  • LiDAR only
  • Stereo camera
  • LiDAR + Camera (pixel/point-wise fusion, AKA raw/early fusion)
  • RaDAR
  • Some combination of the above

If you combine different implementations (i.e. to support different sensor modalities in different ways), then the result can be combined/fused on the object or tracking level for later processing.

I hope that made some sense.

1 Like

I’m hopeful we will be able to utilise the intra-process zero-copy transport. The API leaves something to be desired currently but most of the issues are isolated to node construction rather than implementation.

I do agree though that for now we should avoid early optimisation and concentrate on designing a good set of message types.

Do you think there is value in designing the interfaces used by common sensor setups, with the idea being that people might chain them? e.g. Take the LiDAR and Radar systems and then combine them downstream to produce a combined result? I’m mainly thinking about this in terms of how we can make the perception pipeline modular and flexible so users can relatively easily construct a pipeline specific to their sensor setup.

1 Like

Yeah I think that’s something you’ll need at some point to have a modular and sustainable ecosystem of parts.

My personal opinion is that I think this common point of contact or language would be the object list representation. You could then have a standard component which combines object lists, or you could do the fusion on the tracking level.

I think trying to do this at a higher level (as in a data flow higher–closer to the drivers) would be difficult as doing this successfully relies on having shared semantics between intermediate components of each particular stack, which is hard to guarantee across many different possible implementations, and maybe not necessarily the best use of developer time at this early stage.

However, I think it should be possible and reasonable to combine modalities on an early/raw level (i.e. combining a LiDAR scan and a camera image seems standard).

1 Like

I agree with @cho3 @gbiggs. Thank you. I also wondered if it should be common to detection, traking and prediction when defining message types. I understand both of Geoff’s and Cho3’s opinion, but I made it a common type, thinking that depending on the requirements, planning might use tracking or detection results without prediction. (It can be solved by pass-through as a solution plan.) I would like to discuss this @cho3 @gbiggs.

Also, we implemented a prototype to verify that there is no information leak in the proposed type.
It works with camera lidar fusion, point pillars, euclidean cluster based on the proposed type.
camera lidar fusion :
point pillars :
euclidean cluster :

Even if I examine the new algorithm, I feel that there is no problem with the type. However, whether it is common to detection, tracking and prediction needs discussion


Yeah I think the aspects you highlighted as being filled for each stage definitely make sense. I’m hardly surprised that the message definition works well for 3 different object detection stacks.

I also definitely agree with and subscribe to your idea of using multiple types for planning/collision checking. In fact, I was inspired by autoware to start thinking about this myself.

However, I think we need to be careful in how much we couple our software components. If the main use case is just for handling collision checking from different sources (also what about point clouds?), then I would very strongly caution against using a common message type for only this reason.

My rationale is that doing that would basically tie everything in the perception and planning stack together, making them all very tightly coupled, which I think is generally understood to not be a good thing architecturally.

Personally, the way I would structure it (for the collision checking only use case) would be to set up the collision checking data structure in such a way that it can handle multiple kinds of inputs, but still allow for sensible queries.

If there are more use cases (i.e. chaining to avoid copies as @gbiggs mentioned ), then it might be viable.

Sorry for the late reply.
I talked with @Kosuke_MURAKAMI @gbiggs.
As a conclusion, we agreed that “decouple msg with detection, tracking, and prediction to make only necessary information”.
In, use this msg type. Thank you.

1 Like