Proposal - New Computer Vision Message Standards

Hello computer vision users,

Please help us define a new set of computer vision specific ROS messages by reviewing the proposal and providing your feedback, either here or on the repository.

At OSRF, we are in the process of defining a new standard set of ROS messages for the computer vision community, and we’d like your help. This need was identified from our computer vision survey as a first step towards improving the ROS computer vision ecosystem, so thank you for the feedback!

The end result of this effort may be a new message package in common_msgs, a REP, or both. Our goal is to capture as many common computer vision use cases as possible, with the exception of navigation. (We feel that navigation and localization are already well-defined by the community and REP 105.) Object recognition and image classification are two primary targets we are hoping to hit, and we want to cover both 2D and 3D use cases.

The repository we have created is very much a work in progress, and only with your feedback can we make it better. Any feedback is welcome, bu here are a couple of questions I have identified:

  1. Are there major use cases or edge cases not covered by this set of messages?
  2. Is this set of messages broad enough to encompass both handcrafted and machine learning-based approaches to computer vision?



Nice effort! What about also define interfaces for annotations?

We also try to define “generic” interfaces:

However, it’s hard to capture things as:

I think you would also like to define something as detection or feature groups that belong together. Maybe take a look at how GeoJson does it.

Currently, what is the difference in roles between the two poses in Detection3D vs BoundingBox3D nested inside it?:

  vision_msgs/Classification3D classification
  geometry_msgs/Pose pose <-
  vision_msgs/BoundingBox3D bbox
    geometry_msgs/Pose pose <-
    geometry_msgs/Vector3 size

As in, what is the relationship that would afford the use the nested classification.header to convey the detection’s frame_id?

I suppose this is a larger question of semantics or dichotomy, but perhap I’m of the thought that classifications derive from detections, such as ROI’s, as opposed to viersa. In whichever case, I think the relationship should be made clarified if we are starting to nest standard message types.

To just throw this out here, I’ve been using SPENCER recently, and I’m beginning to really appreciate the message type layout they’ve used. Perhaps we could take some hints from the project:

1 Like

when talking computer vision message standards one thing comes to mind, features.
It is not unusual to have different nodes exploiting the same kind of features (think SIFT/SURF etc), so that rather than extracting several time the same features, a single ‘extraction’ node does the job and publishes them. They are in turns exploited by (several?) others. A standard message may not be straightforward and I’m not sure this problem fits the scope of this proposal, however it certainly would be useful.


I think features should be regarded as an implementation detail of the classifier/detector and should therefore specifically not exposed in these messages.

I really like this proposal, there should be a standard for this in my opinion.

In detail, what I like more about Reinzor’s message definitions, is that there is a shared CategoricalDistribution.msg, which takes the role of the ids+scores combination in Classification2D.msg

I think using the CategorialDistribution reduces a little bit on complexity on the user end, but the difference is very small. Another benefit is that the message definition is reusable.

This topic’s header is ‘Computer Vision Message Standards’ however the discussion seems to focus on classification. Did I misunderstood the point or is the title not appropriated ?

It’s not clear to me if the proposal supports per-pixel segmentations. There is the source field of Classification2D and Classification3D which might be used for segmentations, but from the documentation I’m not entirely sure if it is also meant for segmentations or not. The name source is a bit confusing to me.

There might also be other types of detection besides bounding boxes and segmentation that I’m not currently thinking of, though the two seem like a pretty solid base for now.



# ROS parameter name where the metadata database is stored in XML format.
# The exact information stored in the database is left up to the user.
string database_param

Why XML? Why not YAML or JSON, or completely implementation defined? Or what about the name of a tree of parameters on the ROS parameter server?

It indeed says computer vision and not classification/detection only. My bad.
What kind of messages would be needed for tasks beyond those 2?

No problem @Loy, it’s just that I have a specific use case in mind. It is as follows:
Local features (a point and a descriptor -> SIFT/SURF etc) are one of the basic component of CP and is used for geometry algos as well as for appearance-based algos. In feature-based Visual-SLAM (e.g. ORB-SLAM) you rely on feature both for the poses estimation (geometry) and place recognition (appearance). Those two tasks can be executed in parallel threads. Assuming you are using the same features for both tasks one could communicate a Features.msg (or such) to the other.
It is something (a Features.msg) I have been hackily doing here and there, feeding different classifiers - different processes for that matter.
I am just wondering here if a standardized way of moving such objects around would not make sense ?

ps : To be fair local features are also used from other sensor readings (e.g. laser scan, point cloud) so my question my be a little out of the scope of this thread.

1 Like

Thanks for the awesome feedback, everyone! I’ll try to address everything that was brought up.

First, let me start off by noting that although I only created Classification and Detection messages, I think it makes sense to keep this as a general vision_msgs package, and additional computer vision-related messages can be added as time goes on. I think it’s more useful than making a too-specific classification_msgs or similar.

@reinzor, thanks for linking to your message definitions! I think that annotations are already covered under the existing implementation. You could provide the bounding box coordinates in a Detection message, and the most likely label as the only result in the class probabilities. If we want to add other information, such as color of the outline, etc. then maybe this would be a better fit for visualization_msgs or another package.

On another note, is human pose estimation standardized enough to make a custom message type for it? Or is it best described by a TF tree, arbitrary set of geometry_msgs/Pose, or some other existing ROS construct? I’m thinking of the fact that different human detectors provide different levels of fidelity, so it might be difficult to standardize.

@ruffsl, My idea with having two poses is that the bounding box could actually have a different pose from the expressed object pose. For example, the bounding box center for a coffee mug might have some z-height and be off-center wrt the body of the mug, but the expressed object pose might be centered on the cylindrical portion of the mug and be at the bottom. However, maybe it makes sense to forego the bounding box information, as this could be stored in the object metadata, along with a mesh, etc.

On the topic of nesting, I’m open to the idea of flattening the hierarchy and having Classification/Detection 2D/3D all include a new CategoryDistribution message. I’m not sure how much message nesting is considered standard practice, so I’ll look at some other packages to get an idea.

@Jeremie, I like the idea to add a standardized VisualFeature.msg or other similar message, as long as there is some common baseline that can cover a lot of feature types. From my own understanding of visual features, there’s usually a lot of variation in how the feature is actually defined and represented, so I’m not able to find a “lowest common denominator” from my own experience. If you feel there’s something there that could be broadly useful, please feel free to post it here or make a pull request. I agree with @Loy as well, although many classifiers use features internally, this should be hidden in the implementation except in special cases like the SLAM case described.

I didn’t design the current messages to support per-pixel segmentation, and I’ll have to look into how that is usually represented to get a good idea of how to craft a message for it. My initial guess is that it will be a separate message type from Classification and Detection.

On the topic of the parameter server, I think it’s worth having a discussion about representation format. From talks with other OSRF folks, I don’t think it’s a good idea to use a tree of parameters; a single parameter would be better. For example, if you are loading the ImageNet class names, that’s 1000 items on the parameter server, just to store the names. Add object meshes, sizes, etc., and it could balloon very quickly.

While JSON/XML/YAML might work equally well in terms of expressive power, with XML, we can be sure that both C++ and Python will have the ability to read the database. TinyXML is already included as a low-level dependency in ROS C++, but the same can’t be said for a YAML or JSON parser. Rather than allow people to use whatever’s convenient, I think it’s worth it to restrict/recommend everyone to use a format that can be parsed from more languages. We could do it in the REP, but not enforce it, so if someone really wants to use YAML in their Python-only implementation, they could do so. That’s my position, but I’m interested in hearing other ideas.

I’ve updated the repository with changes as discussed above – CategoryDistribution and flattened the message hierarchy accordingly.

On the topic of dense pixel segmentation, is there a reason that sensor_msgs/Image is inadequate?

In regard to pixel labeling, I’ve also seen sensor_msgs/Image used as a means to publish, along with some custom structure to define the mapping between pixel values and labels on a separate topic. It would be cool to also have a message type to publish a array of convex bounding polygon verticies with label IDs. That’s​ a common use case when labeling regions of an image, and would be good compressed representation to transmit instead for classification modalities that utilise that format.

For pixel-based segmentation, I imagine a message PixelSegmentation.msg like

CategoricalDistribution[] results
sensor_msgs/Image mask

where the pixel value of each pixel in the mask corresponds to an index in the results-array.

This looks like it would be a clean implementation. Just to be sure (since I’m a segmentation newbie), the size of results would be the number of pixels in the mask? There’s a distribution for each pixel?

I am not an expert in this field, but I would like to clarify whether the classification and detection messages are specific to 2D image processing, or if they can also be used for 3D point cloud processing or even 2D laser scan processing. If there is a possibility that they may be used outside of image processing, then perhaps a classification_msgs or similar package actually is appropriate. Just something I think should be considered.

A CategoricalDistribution for every unique pixel value. E.g pixel [45, 89] has value 21. That means it’s labeling can be found at


That CategoricalDistribution e.g. determines that pixel is either a pear or banana with corresponding values in the distribution.

Pixel [45, 90] also has value 21, referring to the same CategoricalDistribution, though pixel [56,94] has value 2 so refers to results[2], which says that pixel is most likely an apple etc.

The max value of the image +1 corresponds to the length of the .results-array.

Is it likely that many pixels in the image will have identical distributions? It seems that “apple” pixels near the edge of the apple would have a different probability distribution than those near the center. All the ML-based segmentation systems I’ve seen either predict a single output class for a pixel (such as a binary classifier), or they produce probability vector

It seems like a bit of a halfway solution to define a small set of distributions that the image uses as an index, then transmit that set with every result. I feel that these two options would work based on use case:

  1. The image is segmented in some small finite set of output classes, which do not have probability distributions that vary in space/time: use an Image message where the lookup value of the pixel is the output class. If desired, static probability distributions for each class can be communicated in a one-time fashion, such as via a single CategoryDistribution[] message, or via the parameter server

  2. The output segmentation includes varying probability distributions that are calculated per-pixel or per-small region: use a CategoryDistribution of length the size of the image, where each pixel has its own unique distribution that may change every frame.

Let me know if I missed something! If you have some code available for a use case, that’s really helpful. I’m currently in the process of writing example classifiers to use the Classification/Detection messages and finding it a useful exercise.

3D point cloud processing generally falls under the topic of “computer vision.” But I had not considered laser scan processing, good point. The package name will probably be subject to review from more senior OSRF architects, and we’ll keep that in mind!