I think this representational problem needs to be thought of from a use case perspective. Why are we detecting objects?
I think @yukkysaito has it mostly right: the purpose of objects is so that the information can be used by a planner. I think there is also a potential use case for detection acting as a preprocessing step (i.e. determining RoI) for scene understanding, such as detecting buildings, signage etc.
Fundamentally, I broadly agree with the flow of:
Detection -> Tracking -> Path Prediction -> Planning
I also broadly agree with the general specification of what fields should be present after each stage.
However, I disagree that they should necessarily have the same common type, with fields selectively filled in. I think that would create ambiguity from both a developer and user perspective. An object ID, for example, has no semantic meaning when we’re talking about an instantaneous detection.
For example, as the output of an object detection stack, the purpose is to provide representation of the objects in the world instantaneously detected. This instantaneous detection could be directly used by planning, but it is most probably used by tracking.
Fundamentally, the representation should cover the following at a bare minimum:
- Where the objects are in 3 space, what space does it cover (size/shape/orientation)
Which is the core functionality needed for the planning use case, and is also sufficient for the tracking case.
You can then add on top of that other features of an object that you can detect instantaneously, for example:
- Label (with confidence)
- Orientation (with confidence)
- Velocity (i.e. using FMCW radars)
- Shape features (for matching/classification)
I think the fundamental size/shape/position fields can be satisfied by any reasonable object detection stack, i.e. LiDAR, stereo camera, RADAR, etc.