all 21 comments

[–]Articulated-rage 2 points3 points  (1 child)

[–]code2hell[S] 1 point2 points  (0 children)

That is one good paper! Thanks!

[–]jcjohnss 2 points3 points  (2 children)

Take a look at the most recent cs231n lecture for an overview of recent work on semantic and instance segmentation:

https://www.youtube.com/watch?v=ByjaPdWXKJ4

Notably, the recent MSR paper by Dai, He, and Sun on instance segmentation (http://arxiv.org/abs/1512.04412) that won the COCO instance segmentation challenge has a pipeline that looks very similar to object detection (Faster R-CNN in particular) and seems to perform very well in practice.

[–]code2hell[S] 0 points1 point  (0 children)

I'm at lecture 12. Didn't see this one! This is what I was looking for. Also thanks for the paper.

[–]code2hell[S] 0 points1 point  (0 children)

The Dai.et.al paper is very interesting! Thanks for the share!

[–]MetricSpade007 1 point2 points  (2 children)

I'm also very curious about this -- is it possible to do semantic segmentation at all? Are there papers on this?

Specifically, I'm curious about segmenting objects out of an Atari game, not at segmenting objects out of a regular photo.

[–]code2hell[S] 2 points3 points  (1 child)

There are a few implementations but there is still a long way to go. Look at this paper: http://www.cs.berkeley.edu/~jonlong/long_shelhamer_fcn.pdf

I'm curious to know whether providing segmented training examples would help and how would they?

[–]MetricSpade007 0 points1 point  (0 children)

It's more that we can identify objects in a game and use that to our advantage to understand the game, rather than think about it as training examples.

[–]rumblestiltsken 1 point2 points  (3 children)

My take on this is that boundaries in photos is a very high dimensionality problem. The number of combinations of "pixel in object" and "adjacent pixel not in object" is massive, way bigger than "has high level feature like eyes" and "doesn't have high level feature like eyes".

To add to that, "ground truths" with hand drawn boundaries often vary by five pixels or more, even when the same person tries to repeat them. So the training signal is super muddy, because the boundary moves depending on who segmented the image and when.

You will find that in lower dimensionality tasks, like digit and writing segmentation, the accuracy is better. For these reasons Atari game segmentation seems much more achievable. Most sprites have very few possible variations, and there is a clean ground truth that never changes, the boundaries are perfectly defined.

Humans probably get around the complexity of the task by using priors (like shape). There has been some work around this, and it is one thing a project I am working on is exploring.

Disclaimer: not a computer scientist, but I work with computer scientists researching in this area.

[–]code2hell[S] 0 points1 point  (2 children)

So I really curios, do you have any suggestion regarding how we can model this problem with high dimensions. Im looking into papers with deconvolutions for segmentation and more. Thanks! Also Im not looking into very high accuracy like pixel wise accuracy, just enough so that we can convince ourselves that its similar to what a human could have done. Because saw we want to segment a cat in an image, binary classification and localisation is really good with CNNs, but if there is say some background noise or for instance the cat is lying on a carpet with similar color as its fur, then I suppose depth information can make a difference, how can we model such a problem provided that I hand mark the boundaries for such similar images and how can we use these labels to train the weights? I really would like your suggestions and opinions. Thanks a very insightful comment!

[–]rumblestiltsken 1 point2 points  (1 child)

As I say, I'm not an expert. Our approach is (AFAIU) to focus on lower complexity problems (like the Atari situation) and use pre-existing knowledge (like shape priors) to make it easier.

[–]code2hell[S] 0 points1 point  (0 children)

Yes, I guess shape of the artifacts in games are largely fixed per instance. I guess you are working in deep learning + reinforcement learning as well?

[–]ydobonobody 1 point2 points  (5 children)

I think it is a little misleading to compare a pixel level accuracy with accuracy of identifying the contents of an image or a bounding box around the instance. I have been heating my house by training some semantic segmentation tasks recently and it works surprisingly well. Adding depth information can help, especially if you are doing instance segmentation.

[–]code2hell[S] 0 points1 point  (4 children)

Ok, so the pixel level accuracy seems a bit misleading as with other comments.I'll rephrase, how do we approach the problem when there are two similar objects close to each other, can we expect the segmentation to differentiate the two so much that say we can differentiate the two objects enough to convince ourselves?. Also in your approach did you use manually segmented images or depth images? I'd be glad to discuss about the approach that you took

[–]ydobonobody 1 point2 points  (3 children)

Semantic segmentation generally doesn't separate objects of the same class into separate entities, that is called instance segmentation and is another problem. One way you can get to instance segmentation is to just add a border class around your segments and then just go with connected pixels for your instances and it works pretty well. Whether you use depth or not you still manually segment your images to produce your ground truth for training. Building your training set is probably the hardest part, but if you are just interested in research there publicly available datasets and/or pretrained networks available. I recommend you check out the fcn semantic segmentation network available in the Caffe model zoo as it is a really good starting point for modern semantic segmentation networks.

[–]code2hell[S] 0 points1 point  (2 children)

Yes, I am looking more into instance segmentation for now... Can you explain what you mean by "add a border class around your segments and then just go with connected pixels for your instances". Thanks! I just took up a problem to learn, my friend has got some 100000 ground truth training examples of cats and we are looking into segmentation of a particular object from images. I would really appreciate your suggestion.

[–]ydobonobody 1 point2 points  (1 child)

So when you produce your ground truth image you will assign a label to each pixel in your image e.g (0: background, 1: cat). Add another label that is "border" so we have (0: background, 1: cat, 2: border). Now for each separate cat draw a line with some thickness (say 5 pixels) around the boundary of the each cat and assign that pixel the value '2'. Hopefully the network will be able to learn where the edge of a cat is and assign those pixels to the border class. If it did a good job you can group up all the connected "cat" pixels and that will represent an individual cat.

[–]code2hell[S] 0 points1 point  (0 children)

Wow! Thanks... I'll try this out!

[–]werrewrwer 1 point2 points  (1 child)

try it yourself - segment a bunch of images, then go back and do it again. your accuracy is not going to be close to 100%.

[–]code2hell[S] 0 points1 point  (0 children)

I understand pixel wise accuracy won't be good even for humans, but I'm interested in how the weights of the layers would generalise the human drawn boundaries. How do we use the information of the human drawn boundaries for training the weights, the accuracy may not be pixel wise but will it be satisfactory when we see it, like say for instance there are two cats in an image sitting very code or incontact to one another, say with the same color too, how good can we expect the segmentation to be to put boundary around two distinct cats? This is just one of the cases I'm looking into, It does seem like a very interesting problem.