ROS Resources: Documentation | Support | Discussion Forum | Service Status | Q&A

New robust pre-trained semantic segmentation models [CVPR '20]

Hello ROS Community,

We are very excited to announce our recent work on multi-domain semantic segmentation, entitled MSeg, which we presented at CVPR last month. You can try out one of our models in Google Colab on your own images here. We trust these models will be useful to roboticists – please watch our Youtube video to see the results of our single MSeg model in dozens of different environments – indoors, driving, in crowds, mounted on a drone, and more.

We’ve trained semantic segmentation models on MSeg, a single, composite dataset that unifies 7 of the largest semantic segmentation datasets (COCO+COCO Stuff, ADE20K, Mapillary Vistas, Indian Driving Dataset, Berkeley Driving Dataset, Cityscapes, and SUN RGB-D). Since they all utilize different taxonomies, we reconcile the taxonomies and bring the pixel-level annotations into alignment by relabeling more than 220,000 object masks in more than 80,000 images, requiring more than 1.34 years of collective annotator effort.

While models in the literature are often trained and tested on data from the same domain/distribution, we present a different approach to evaluating a model’s robustness – zero-shot cross-dataset generalization – that we believe is more aligned with real-world performance. We test our model on a suite of held-out datasets – PASCAL VOC, PASCAL Context, Camvid, WildDash, KITTI, ScanNet – and find our model can function effectively across domains and even generalizes to datasets that were not seen during training. Training on MSeg yields substantially more robust models than training on any individual dataset or naive mixing of datasets. Our approach also outperforms recent state-of-the-art multi-task learning and domain generalization algorithms (see our paper for more details).

For those of you with interest in autonomous driving, a model trained on MSeg ranks first on the WildDash leaderboard for robust semantic segmentation, with no exposure to WildDash data during training.

A few links:

  • Read our paper: PDF link
  • Download the dataset here.
  • Download pre-trained semantic models, demo scripts, and evaluation code here.
  • Watch our teaser video.

Thank you!

John Lambert*, Zhuang Liu*, Ozan Sener, James Hays, and Vladlen Koltun. MSeg: A Composite Dataset for Multi-domain Semantic Segmentation. Computer Vision and Pattern Recognition (CVPR), 2020.

*Indicates equal contribution.


Whoa! This got buried over the weekend. This looks like absolutely fantastic work.

I just skimmed the paper briefly, do you plan to have controlled hardware performance benchmarks so one can trade-off evaluations between accuracy and computational performance?



Thanks for sharing this! It’s interesting timing because we’re looking at how we might use semantic information in the navigation2 stack for ROS2. We’ve run into an interesting problem I’d be curious to hear your opinion on.

We’ve run into some problems where the models are trained on autonomous driving datasets like you mention. As a result, we’ve seen for 3D detectors and certain classes of segmentors that they just won’t work for a mobile robot use-case out of the box because the data it was trained on is so different from a mobile robotics task / sensor locality (e.g. we don’t get bird-eye view like an AV does with cameras and lidars on the roof rack, they’re often close to the floor with limited frustum views. Similar with other hand-held datasets where the point of view is relatively high for a mobile robot).

Given you posted this on the ROS discourse - I’d be curious if you tested and this works on a mobile robot or manipulator use-case out of the box?


Take a look at the video and the sample gifs. I was surprised at the number of perspectives in there (drones, human eye level, and a bit lower like a vehicle). I don’t think they’ve solved the generalization problem but they certainly put a dent in it. I am more curious about the overall performance, how big of a GPU would you need to get the network to 30FPS+?

One +1 for GPU sizing is the new Jetson NX Xavier is so much GPU that I can’t imagine needing more GPU for a long time if only running 1-2 networks. Its an order of magnitude more GPU than the TX2 that was underpowered and its 75% of the autonomous vehicle Xavier AGX boards at the price point of the TX2.

See the figure at the bottom of this article. If you’ve made a network that can’t hit 30 fps with that hardware, maybe reconsider the architecture design :wink:

This is true, but I’ve seen a bunch of papers cough things that I’m testing cough that have similar videos and when I stick the network on a robot-height camera, the performance I actually see is much, much lower. I don’t think any of those datasets are representative of a robotics use-case.

Hi @smac and @Katherine_Scott, thanks very much for your interest in our work. Let me see if I can answer a few of your questions.

@smac Generalizing to all viewpoints and environments is still an extremely challenging and problem but I believe our work has made a strong step in the right direction. A semantic segmentation model pre-trained on COCO is probably the “go-to” these days for all-purpose use, and we outperform such a COCO model (trained for the same number of iterations) by 29% – our harmonic mean of 59.2 mIoU vs. their 45.8 mIoU over all held-out test datasets (see table below)


If there was a model I would put on my robot today, this would be the one I would choose.

@Katherine_Scott – we were most focused on the generalization aspect and haven’t explored the accuracy vs. speed tradeoff formally. However, we have seen that an input image can be downscaled even to 360p before being fed into the network and get results just around 6% mIoU lower than what you would get w/ 720p or 1080p input. Multi-scale aggregation of category probabilities helps a bit (usually 1-2% boost in mIoU), so if your GPU RAM can fit different scaled versions of the image in a single batch, that will help improve accuracy as well.

1 Like

It would be nice to have some high level number like XYZ resolution at ABC frames per second on 123 platform (e.g. qVGA quality at 22 fps on a Jetson TX2 or something). Just as a datapoint for users to know if its practical for their platform or not.

I’ll keep this on my queue to have tested, it seems compelling to at least try out!

Thanks for the feedback about hardware/timing @smac, that’s totally valid, we’ll do some timing analysis and get back to you about that.

Regarding whether our held-out test datasets are be representative of a robotics use-case: this is interesting question, and I believe they are representative of several use cases (especially autonomous driving or indoor navigation, e.g. something like Hello-Robot). Here are a few details about 4 of the 6 datasets we evaluate on, which all come from “streaming” video sources:

ScanNet – handheld moving camera in 1513 indoor environments. significant motion blur.
KITTI – video from robot mounted camera (autonomous vehicle in Karlsruhe, Germany)
WIldDash – dashcam mounted camera videos from YouTube with hazardous scenarios (captured in countries all over the world)
CamVid – video from robot mounted camera (autonomous vehicle in Cambridge, UK)

And some randomly sampled images from each:

![](upload://7V1vhsa261X8irWV999HKU5qHTi.jpeg) ![](upload://raafhBP3gBN4cMRtlizBGx1EUJW.jpeg) ![](upload://vi0vYko6I08uRNDmIke6r5wt8ni.jpeg) ![](upload://tZqN58RhcoZlgPEIk1usWsUv5cv.jpeg) ![](upload://313YCcKYK0EuecwafOB7aPoE6gE.jpeg)


![](upload://wdIhQbLn8YpXtbYM3AEKkLONleY.jpeg) ![](upload://pQ27M155eyYiEf818EWVOvT7phM.jpeg) ![](upload://8e52p1a6NplfWyBoq85Jr0g6qQv.jpeg) ![](upload://9t9jh1w4b7MiG0NBXG8Lvfivteq.jpeg) ![](upload://bA8YAwjBLgcXQGx9agW6d0GTHm7.jpeg) ![](upload://wpmgywU5ARWrRKZUBOGUit1rCyf.jpeg) ![](upload://o3jHPllFTwC4115yolElYQFz0zv.jpeg) ![](upload://lq3TahimrwKSofvAZZ3YmsfNOBa.jpeg) ![](upload://ushJTM3Vp8h8dpcFdl72gOEQxnG.jpeg)


![](upload://zwlqOh3Y0EaRhJhHj4YF0CHosrP.png) ![](upload://tpnugI4Hbc0dDa2YWGROd59aPUb.png) ![](upload://nSD5E3rwzuEUnc0Ioz72WWa1rJZ.png)


![](upload://gGH81ToDKdlXvxPwY96RKQweC4n.png) ![](upload://pesbF6yBUd8gefv8OvW2FLY6mY4.png) ![](upload://uFuIZKGaZaEg20XXIMTBFJipecM.png)

Looks like my photo uploads didn’t work before – here are a few examples from the test datasets


1 Like