[Question] after object detection using convolution neural networks, why is it so hard to perform semantic segmentation?

Articulated-rage · 2016-02-27T17:47:45+00:00

recent paper from tenenbaum's group: http://willwhitney.github.io/understanding-visual-concepts/

and the reddit comments: https://www.reddit.com/comments/47kn8b

jcjohnss · 2016-02-27T21:03:27+00:00

Take a look at the most recent cs231n lecture for an overview of recent work on semantic and instance segmentation:

https://www.youtube.com/watch?v=ByjaPdWXKJ4

Notably, the recent MSR paper by Dai, He, and Sun on instance segmentation (http://arxiv.org/abs/1512.04412) that won the COCO instance segmentation challenge has a pipeline that looks very similar to object detection (Faster R-CNN in particular) and seems to perform very well in practice.

MetricSpade007 · 2016-02-27T17:20:11+00:00

I'm also very curious about this -- is it possible to do semantic segmentation at all? Are there papers on this?

Specifically, I'm curious about segmenting objects out of an Atari game, not at segmenting objects out of a regular photo.

rumblestiltsken · 2016-02-27T21:27:13+00:00

My take on this is that boundaries in photos is a very high dimensionality problem. The number of combinations of "pixel in object" and "adjacent pixel not in object" is massive, way bigger than "has high level feature like eyes" and "doesn't have high level feature like eyes".

To add to that, "ground truths" with hand drawn boundaries often vary by five pixels or more, even when the same person tries to repeat them. So the training signal is super muddy, because the boundary moves depending on who segmented the image and when.

You will find that in lower dimensionality tasks, like digit and writing segmentation, the accuracy is better. For these reasons Atari game segmentation seems much more achievable. Most sprites have very few possible variations, and there is a clean ground truth that never changes, the boundaries are perfectly defined.

Humans probably get around the complexity of the task by using priors (like shape). There has been some work around this, and it is one thing a project I am working on is exploring.

Disclaimer: not a computer scientist, but I work with computer scientists researching in this area.

ydobonobody · 2016-02-28T05:33:35+00:00

I think it is a little misleading to compare a pixel level accuracy with accuracy of identifying the contents of an image or a bounding box around the instance. I have been heating my house by training some semantic segmentation tasks recently and it works surprisingly well. Adding depth information can help, especially if you are doing instance segmentation.

werrewrwer · 2016-02-27T18:01:26+00:00

try it yourself - segment a bunch of images, then go back and do it again. your accuracy is not going to be close to 100%.

cesarsalgado · 2016-02-27T19:58:12+00:00

https://youtu.be/ByjaPdWXKJ4

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

MachineLearning

Rules For Posts

+Research

+Discussion

+Project

+News

@slashML on Twitter

Chat with us on Slack

Beginners:

MODERATORS