ROS Resources: Documentation | Support | Discussion Forum | Service Status | Q&A

Proposal - New Computer Vision Message Standards

No problem @Loy, it’s just that I have a specific use case in mind. It is as follows:
Local features (a point and a descriptor -> SIFT/SURF etc) are one of the basic component of CP and is used for geometry algos as well as for appearance-based algos. In feature-based Visual-SLAM (e.g. ORB-SLAM) you rely on feature both for the poses estimation (geometry) and place recognition (appearance). Those two tasks can be executed in parallel threads. Assuming you are using the same features for both tasks one could communicate a Features.msg (or such) to the other.
It is something (a Features.msg) I have been hackily doing here and there, feeding different classifiers - different processes for that matter.
I am just wondering here if a standardized way of moving such objects around would not make sense ?

ps : To be fair local features are also used from other sensor readings (e.g. laser scan, point cloud) so my question my be a little out of the scope of this thread.

1 Like

Thanks for the awesome feedback, everyone! I’ll try to address everything that was brought up.

First, let me start off by noting that although I only created Classification and Detection messages, I think it makes sense to keep this as a general vision_msgs package, and additional computer vision-related messages can be added as time goes on. I think it’s more useful than making a too-specific classification_msgs or similar.

@reinzor, thanks for linking to your message definitions! I think that annotations are already covered under the existing implementation. You could provide the bounding box coordinates in a Detection message, and the most likely label as the only result in the class probabilities. If we want to add other information, such as color of the outline, etc. then maybe this would be a better fit for visualization_msgs or another package.

On another note, is human pose estimation standardized enough to make a custom message type for it? Or is it best described by a TF tree, arbitrary set of geometry_msgs/Pose, or some other existing ROS construct? I’m thinking of the fact that different human detectors provide different levels of fidelity, so it might be difficult to standardize.

@ruffsl, My idea with having two poses is that the bounding box could actually have a different pose from the expressed object pose. For example, the bounding box center for a coffee mug might have some z-height and be off-center wrt the body of the mug, but the expressed object pose might be centered on the cylindrical portion of the mug and be at the bottom. However, maybe it makes sense to forego the bounding box information, as this could be stored in the object metadata, along with a mesh, etc.

On the topic of nesting, I’m open to the idea of flattening the hierarchy and having Classification/Detection 2D/3D all include a new CategoryDistribution message. I’m not sure how much message nesting is considered standard practice, so I’ll look at some other packages to get an idea.

@Jeremie, I like the idea to add a standardized VisualFeature.msg or other similar message, as long as there is some common baseline that can cover a lot of feature types. From my own understanding of visual features, there’s usually a lot of variation in how the feature is actually defined and represented, so I’m not able to find a “lowest common denominator” from my own experience. If you feel there’s something there that could be broadly useful, please feel free to post it here or make a pull request. I agree with @Loy as well, although many classifiers use features internally, this should be hidden in the implementation except in special cases like the SLAM case described.

I didn’t design the current messages to support per-pixel segmentation, and I’ll have to look into how that is usually represented to get a good idea of how to craft a message for it. My initial guess is that it will be a separate message type from Classification and Detection.

On the topic of the parameter server, I think it’s worth having a discussion about representation format. From talks with other OSRF folks, I don’t think it’s a good idea to use a tree of parameters; a single parameter would be better. For example, if you are loading the ImageNet class names, that’s 1000 items on the parameter server, just to store the names. Add object meshes, sizes, etc., and it could balloon very quickly.

While JSON/XML/YAML might work equally well in terms of expressive power, with XML, we can be sure that both C++ and Python will have the ability to read the database. TinyXML is already included as a low-level dependency in ROS C++, but the same can’t be said for a YAML or JSON parser. Rather than allow people to use whatever’s convenient, I think it’s worth it to restrict/recommend everyone to use a format that can be parsed from more languages. We could do it in the REP, but not enforce it, so if someone really wants to use YAML in their Python-only implementation, they could do so. That’s my position, but I’m interested in hearing other ideas.

I’ve updated the repository with changes as discussed above – CategoryDistribution and flattened the message hierarchy accordingly.

On the topic of dense pixel segmentation, is there a reason that sensor_msgs/Image is inadequate?

In regard to pixel labeling, I’ve also seen sensor_msgs/Image used as a means to publish, along with some custom structure to define the mapping between pixel values and labels on a separate topic. It would be cool to also have a message type to publish a array of convex bounding polygon verticies with label IDs. That’s​ a common use case when labeling regions of an image, and would be good compressed representation to transmit instead for classification modalities that utilise that format.

For pixel-based segmentation, I imagine a message PixelSegmentation.msg like

CategoricalDistribution[] results
sensor_msgs/Image mask

where the pixel value of each pixel in the mask corresponds to an index in the results-array.

This looks like it would be a clean implementation. Just to be sure (since I’m a segmentation newbie), the size of results would be the number of pixels in the mask? There’s a distribution for each pixel?

I am not an expert in this field, but I would like to clarify whether the classification and detection messages are specific to 2D image processing, or if they can also be used for 3D point cloud processing or even 2D laser scan processing. If there is a possibility that they may be used outside of image processing, then perhaps a classification_msgs or similar package actually is appropriate. Just something I think should be considered.

A CategoricalDistribution for every unique pixel value. E.g pixel [45, 89] has value 21. That means it’s labeling can be found at


That CategoricalDistribution e.g. determines that pixel is either a pear or banana with corresponding values in the distribution.

Pixel [45, 90] also has value 21, referring to the same CategoricalDistribution, though pixel [56,94] has value 2 so refers to results[2], which says that pixel is most likely an apple etc.

The max value of the image +1 corresponds to the length of the .results-array.

Is it likely that many pixels in the image will have identical distributions? It seems that “apple” pixels near the edge of the apple would have a different probability distribution than those near the center. All the ML-based segmentation systems I’ve seen either predict a single output class for a pixel (such as a binary classifier), or they produce probability vector

It seems like a bit of a halfway solution to define a small set of distributions that the image uses as an index, then transmit that set with every result. I feel that these two options would work based on use case:

  1. The image is segmented in some small finite set of output classes, which do not have probability distributions that vary in space/time: use an Image message where the lookup value of the pixel is the output class. If desired, static probability distributions for each class can be communicated in a one-time fashion, such as via a single CategoryDistribution[] message, or via the parameter server

  2. The output segmentation includes varying probability distributions that are calculated per-pixel or per-small region: use a CategoryDistribution of length the size of the image, where each pixel has its own unique distribution that may change every frame.

Let me know if I missed something! If you have some code available for a use case, that’s really helpful. I’m currently in the process of writing example classifiers to use the Classification/Detection messages and finding it a useful exercise.

3D point cloud processing generally falls under the topic of “computer vision.” But I had not considered laser scan processing, good point. The package name will probably be subject to review from more senior OSRF architects, and we’ll keep that in mind!

First off: Thanks for this proposal! A standardized set of vision messages has been sorely missing for years.

Regarding Detection3D:

I strongly believe we need a separate Pose for each object hypothesis. For example, when meshes are used to represent the object classes, the Pose specifies the pose of the mesh’s reference frame in the Detection3D.header.frame_id frame. For example, the reference frame of the following mesh is at the bottom of the mug and in the center of the mug’s round part, not at the center of the mesh’s bounding box:

Without a Pose for each object class, we cannot express “this object could be either a laptop in its usual orientation, or a book lying flat (i.e., rotated by 90° if your mesh is of a book standing upright)”.

My proposal would be to either include an array of Poses in a 3D-specific CategoryDistribution message, or (since we now have 3 arrays that must be the same size) as an array of ObjectHypothesis messages (or whatever we want to call it) that would have one id, score and Pose each.

I was also sorry to see that BoundingBox3D was removed. (This was meant to represent a bounding box of the points in source_cloud, right?) I’ve always included this in my own message definitions, and I’ve found it extremely useful.

On the other hand, this information can be re-computed from the source_cloud, so I can live with that (although it’s a bit wasteful). Also, other people might prefer to use a shape_msgs/Mesh bounding_mesh, like in object_recognition_msgs/RecognizedObject, or something completely different, and it would overcomplicate the message if we’d include all possible kinds of extra information.

You’re right, when you have a CategoryDistribution, it is likely to change per pixel.

For most applications I’ve seen, a single label per pixel is fine. So then:

string[] results
sensor_msgs/Image mask

what do you think about unifying the BoundingBox2D.msg
and BoundingBox3D.msg
we could have the three fields (or vector) to the bounding box size and, in
the case it is a 2d bounding box, the Z field would be zero. I think we
would gain in simplicity, in a similar way we have a pose message
(geometry_msgs/Pose) that is generic and serves both 2d and 3d poses,
according to what is set in Z.

1 Like

something like this, for both 2d and 3d bounding boxes:

# position and rotation of the bounding box
geometry_msgs/Pose pose

# size of the bounding box
geometry_msgs/Point size

EDIT: I created a PR for these proposals:

In case somebody is still following this thread: The discussion seems to have moved to the GitHub repository. I’ve just discovered that my proposed changes were silently accepted! :tada:

1 Like

I am forwarding a comment from a colleague, Jordi Pages:

In general the messages included so far look pretty well IMHO.

We could point to our own vision messages in so you may have some new ideas.

For example, we could check how some specific detections like faces (along with person name/ID, emotions and facial expressions), actions, planar objects or legs (even though the latter is not really vision-based detection but laser-based) can be encoded in their messages or whether some additional fields might be required.

You include in some messages an optional field of type sensor_msgs/Image I suppose for debugging, monitorization of what part of the input image has been processed or for post-processing purposes. Depending on the main purpose you might consider using instead a sensor_msgs/CompressedImage (or both) to not penalizing remote subscribers as the frequency of the topic might reduce for them. The inclusion of an image is not mandatory because if the timestamp of the Header is set equal to the timestamp of the input image a TimeSyncrhonizer can later be used to get the same information using the BoundingBox2D. But I also like to have it here to save the burden of using a TimeSyncrhonizer…

An additional field I found useful in some cases is a geometry_msgs/TransformStamped expressing the pose of the camera in the moment that the processed image was acquired.


I’ve been requested to provide an overall feedback to the proposal. Please find below my review.

In general I feel the effort very valuable and useful. However, from my side, I’d like to add some comments to the proposal:


  • Do we need to agree where is the origin of the BB (top-left, center, …)? If so, say it explicitly in the message comments.

  • It’s not rigurous to provide a pose (point+orientation) to something called “origin” which implicitly is just a point. May be calling it “pose” or “origin_pose” is more adequate.

  • size in 2D is implemented with two ints, while in 3D is as a vector3 double. I’m trying to imagine if there is some situation where a float-2DBB is required (subpixel approaches) . Just warning on that.


  • Sometimes detectors provide also uncertainty on the pose-space of the detection. Providing just a BB for the spatial-related data of a detection does not allow to give this valuable data, specially in fusion (i.e tracking) algorithms requiring to work with pose-space uncertainty.

Lack of Services

  • Think about if it could be useful to add some services in the package, mainly based on the proposed messages. Thus, allowing detectors to work in a client-server mode, with customizable requests.

Best Regards, and thanks again for the effort!

+1. All the message comments should be much more explicit anyway IMO.

If we use rotated bounding boxes with float width + height, the origin should be the center (see below). If we don’t have rotation and use a uint32 width + height, the origin should be the upper left corner, otherwise we can’t represent even-sized bounding boxes properly (the center would need to be at a “*.5” position).

The 2D Bounding Box format is currently being discussed in this PR; everyone, please feel free to contribute to that discussion.

The pose (in the current proposal) is actually meant to have an orientation; otherwise we shouldn’t be using a Pose2D but a point, as you say. I’m unsure whether it’s a good or a bad idea to have rotated bounding boxes and subpixel (float) resolution, see the discussion in the PR. Please chime in.


Good catch on including pose uncertainty information. I’ll update the 3D poses to use a geometry_msgs/PoseWithCovariance.

1 Like