Proposal - New Computer Vision Message Standards

Thanks for the awesome feedback, everyone! I’ll try to address everything that was brought up.

First, let me start off by noting that although I only created Classification and Detection messages, I think it makes sense to keep this as a general vision_msgs package, and additional computer vision-related messages can be added as time goes on. I think it’s more useful than making a too-specific classification_msgs or similar.

@reinzor, thanks for linking to your message definitions! I think that annotations are already covered under the existing implementation. You could provide the bounding box coordinates in a Detection message, and the most likely label as the only result in the class probabilities. If we want to add other information, such as color of the outline, etc. then maybe this would be a better fit for visualization_msgs or another package.

On another note, is human pose estimation standardized enough to make a custom message type for it? Or is it best described by a TF tree, arbitrary set of geometry_msgs/Pose, or some other existing ROS construct? I’m thinking of the fact that different human detectors provide different levels of fidelity, so it might be difficult to standardize.

@ruffsl, My idea with having two poses is that the bounding box could actually have a different pose from the expressed object pose. For example, the bounding box center for a coffee mug might have some z-height and be off-center wrt the body of the mug, but the expressed object pose might be centered on the cylindrical portion of the mug and be at the bottom. However, maybe it makes sense to forego the bounding box information, as this could be stored in the object metadata, along with a mesh, etc.

On the topic of nesting, I’m open to the idea of flattening the hierarchy and having Classification/Detection 2D/3D all include a new CategoryDistribution message. I’m not sure how much message nesting is considered standard practice, so I’ll look at some other packages to get an idea.

@Jeremie, I like the idea to add a standardized VisualFeature.msg or other similar message, as long as there is some common baseline that can cover a lot of feature types. From my own understanding of visual features, there’s usually a lot of variation in how the feature is actually defined and represented, so I’m not able to find a “lowest common denominator” from my own experience. If you feel there’s something there that could be broadly useful, please feel free to post it here or make a pull request. I agree with @Loy as well, although many classifiers use features internally, this should be hidden in the implementation except in special cases like the SLAM case described.