New perception architecture: message types

@JWhitleyWork Thank you very much. It seems basic concepts are not so conflict :slight_smile:
I’d prefer to remove might-be-useful fields because they make msgs bloated and
developers confused.

@gbiggs @sgermanserrano @Dejan_Pangercic How do you think for the next step?
Any comments or questions?

My opinion from my previous post is unchanged.

In addition to that, I agree that we should have the most minimal message possible as the starting point. Message types are the API of nodes, and like any API it is relatively easy to add but extremely difficult to take away. Anything we have in there now will be in for the foreseeable future, so if it is not intended to be used right away I would prefer not to include it.

I would also like to hear from the people at Apex.AI who have been doing their object detection stack about what sort of information they think is needed. @cho3 @yunus.caliskan @esteve

@gbiggs

I think this representational problem needs to be thought of from a use case perspective. Why are we detecting objects?

I think @yukkysaito has it mostly right: the purpose of objects is so that the information can be used by a planner. I think there is also a potential use case for detection acting as a preprocessing step (i.e. determining RoI) for scene understanding, such as detecting buildings, signage etc.

Fundamentally, I broadly agree with the flow of:

Detection -> Tracking -> Path Prediction -> Planning

I also broadly agree with the general specification of what fields should be present after each stage.

However, I disagree that they should necessarily have the same common type, with fields selectively filled in. I think that would create ambiguity from both a developer and user perspective. An object ID, for example, has no semantic meaning when we’re talking about an instantaneous detection.

For example, as the output of an object detection stack, the purpose is to provide representation of the objects in the world instantaneously detected. This instantaneous detection could be directly used by planning, but it is most probably used by tracking.

Fundamentally, the representation should cover the following at a bare minimum:

  • Where the objects are in 3 space, what space does it cover (size/shape/orientation)

Which is the core functionality needed for the planning use case, and is also sufficient for the tracking case.

You can then add on top of that other features of an object that you can detect instantaneously, for example:

  • Label (with confidence)
  • Orientation (with confidence)
  • Velocity (i.e. using FMCW radars)
  • Shape features (for matching/classification)

I think the fundamental size/shape/position fields can be satisfied by any reasonable object detection stack, i.e. LiDAR, stereo camera, RADAR, etc.

1 Like

That’s the point I was trying to make. Thanks for making it more clearly! We need to go right back to the start and figure out what we want to be able to achieve from perception.

I strongly agree with this. Message types are APIs, and APIs should make it easy for the developer to do things right and hard for them to do things wrong. Fields that are selectively filled in make it easy for the developer to screw up.

However from an efficiency point of view this raises some questions about keeping memory copies to a minimum, and we may need to do some work on the ROS 2 side to enable ensuring data structures are bit-compatible when they become part of a larger message, making messages out of pointers, etc. to reduce memory copying through the processing pipeline.

And FMCW LiDARs, soon! (If Aurora doesn’t buy them all up.)

Are you referring to using any one of those sensors to fulfil the above needs, or a combination?

1 Like

However from an efficiency point of view this raises some questions about keeping memory copies to a minimum, and we may need to do some work on the ROS 2 side to enable ensuring data structures are bit-compatible when they become part of a larger message, making messages out of pointers, etc. to reduce memory copying through the processing pipeline.

True, especially in a high performance context.

However, while I think this is important to keep in mind, I think for the near to medium term, this isn’t a big issue as IIRC ROS2 makes something like 5 copies during a normal pub/sub process.

If it does come to the point where the copies into different message types are the bottleneck, then I imagine we could probably use a base message type (where the memory is allocated from the middleware layer–a feature which I think ROS 2 does not yet have), and add some kind of wrapper around it for different views.

In the near term, I think we should stick with different message types to make things more clear from the development perspective.

And FMCW LiDARs, soon! (If Aurora doesn’t buy them all up.)

=)

Are you referring to using any one of those sensors to fulfil the above needs, or a combination?

I think it’s a minimal requirement for any object detection stack should be able to generate an object list with position and shape information in 3-space.

How this is achieved, I think is implementation defined. Some sensor setups which may be able to achieve this include:

  • LiDAR only
  • Stereo camera
  • LiDAR + Camera (pixel/point-wise fusion, AKA raw/early fusion)
  • RaDAR
  • Some combination of the above

If you combine different implementations (i.e. to support different sensor modalities in different ways), then the result can be combined/fused on the object or tracking level for later processing.

I hope that made some sense.

1 Like

I’m hopeful we will be able to utilise the intra-process zero-copy transport. The API leaves something to be desired currently but most of the issues are isolated to node construction rather than implementation.

I do agree though that for now we should avoid early optimisation and concentrate on designing a good set of message types.

Do you think there is value in designing the interfaces used by common sensor setups, with the idea being that people might chain them? e.g. Take the LiDAR and Radar systems and then combine them downstream to produce a combined result? I’m mainly thinking about this in terms of how we can make the perception pipeline modular and flexible so users can relatively easily construct a pipeline specific to their sensor setup.

1 Like

Yeah I think that’s something you’ll need at some point to have a modular and sustainable ecosystem of parts.

My personal opinion is that I think this common point of contact or language would be the object list representation. You could then have a standard component which combines object lists, or you could do the fusion on the tracking level.

I think trying to do this at a higher level (as in a data flow higher–closer to the drivers) would be difficult as doing this successfully relies on having shared semantics between intermediate components of each particular stack, which is hard to guarantee across many different possible implementations, and maybe not necessarily the best use of developer time at this early stage.

However, I think it should be possible and reasonable to combine modalities on an early/raw level (i.e. combining a LiDAR scan and a camera image seems standard).

1 Like

I agree with @cho3 @gbiggs. Thank you. I also wondered if it should be common to detection, traking and prediction when defining message types. I understand both of Geoff’s and Cho3’s opinion, but I made it a common type, thinking that depending on the requirements, planning might use tracking or detection results without prediction. (It can be solved by pass-through as a solution plan.) I would like to discuss this @cho3 @gbiggs.

Also, we implemented a prototype to verify that there is no information leak in the proposed type.
It works with camera lidar fusion, point pillars, euclidean cluster based on the proposed type.
camera lidar fusion : https://www.youtube.com/watch?v=gqBKpq3Ejvs&t=13s
point pillars : https://www.youtube.com/watch?v=sOLCgRkWlWQ&t=2s
euclidean cluster : https://www.youtube.com/watch?v=wmh0OJr7baQ&t=25s

Even if I examine the new algorithm, I feel that there is no problem with the type. However, whether it is common to detection, tracking and prediction needs discussion

@yukkysaito

Yeah I think the aspects you highlighted as being filled for each stage definitely make sense. I’m hardly surprised that the message definition works well for 3 different object detection stacks.

I also definitely agree with and subscribe to your idea of using multiple types for planning/collision checking. In fact, I was inspired by autoware to start thinking about this myself.

However, I think we need to be careful in how much we couple our software components. If the main use case is just for handling collision checking from different sources (also what about point clouds?), then I would very strongly caution against using a common message type for only this reason.

My rationale is that doing that would basically tie everything in the perception and planning stack together, making them all very tightly coupled, which I think is generally understood to not be a good thing architecturally.

Personally, the way I would structure it (for the collision checking only use case) would be to set up the collision checking data structure in such a way that it can handle multiple kinds of inputs, but still allow for sensible queries.

If there are more use cases (i.e. chaining to avoid copies as @gbiggs mentioned ), then it might be viable.

Sorry for the late reply.
I talked with @Kosuke_MURAKAMI @gbiggs.
As a conclusion, we agreed that “decouple msg with detection, tracking, and prediction to make only necessary information”.
In autoware.ai, use this msg type. Thank you.

1 Like