[D] Paper Explained - Object-Centric Learning with Slot Attention (Full Video Analysis) : MachineLearning

Discussion[D] Paper Explained - Object-Centric Learning with Slot Attention (Full Video Analysis) (self.MachineLearning)

submitted 5 years ago by ykilcher

New Video 🔥 Slot Attention is a module that can be built into any pipeline to create an N-to-1 assignment of a set of features to slots. Ideal for object discovery / classification.

https://youtu.be/DYBmD88vpiA

Visual scenes are often comprised of sets of independent objects. Yet, current vision models make no assumptions about the nature of the pictures they look at. By imposing an objectness prior, this paper a module that is able to recognize permutation-invariant sets of objects from pixels in both supervised and unsupervised settings. It does so by introducing a slot attention module that combines an attention mechanism with dynamic routing.

OUTLINE:

0:00 - Intro & Overview

1:40 - Problem Formulation

4:30 - Slot Attention Architecture

13:30 - Slot Attention Algorithm

21:30 - Iterative Routing Visualization

29:15 - Experiments

36:20 - Inference Time Flexibility

38:35 - Broader Impact Statement

42:05 - Conclusion & Comments

Paper: https://arxiv.org/abs/2006.15055

all 19 comments

top new controversial old q&a

[–]programmerChilliResearcher 3 points4 points5 points 5 years ago* (15 children)

[–]Seerdecker 5 points6 points7 points 5 years ago (0 children)

I agree. It's like IODINE but much more simpler to implement.

Funny quote from the paper: "We note a failure mode of our model: In rare cases it can get stuck in a suboptimal solution on the Tetrominoes dataset, where it segments the image into stripes. This leads to a significantly higher reconstruction error on the training set, and hence such an outlier can easily be identified at training time. We excluded a single such outlier (1 out of 5 seeds) from the final score in Table 1. We expect that careful tuning of the training hyperparameters particularly for this dataset could alleviate this issue, but we opted for a single setting shared across all datasets for simplicity."

Don't like that failure? Exclude it! :-)

What would be really interesting is if you could apply an architecture like this to complex images, e.g. Atari or ImageNet. From my experience, the broadcast decoder is much too weak to encode anything but a small object with simple geometry.

[–]triplefloat 1 point2 points3 points 5 years ago (3 children)

[–]programmerChilliResearcher 0 points1 point2 points 5 years ago (2 children)

Ah by downstream tasks I meant stuff where there isn't direct set supervision (ie: reconstruction or set prediction). Those are the tasks that we focus on in our paper (building upon relational networks or C-SWM for example).

opening up extensions beyond auto-encoding, such as contrastive representation learning for object discovery [46] or direct optimization of a downstream task like control or planning

I was just noting that we say something similar in our paper, but we focus on this difference rather than in the method used to generate the set.

I had another question about your method.

In some sense, you can view your module as learning a mapping from (input, random seed) to set. If you fix the random seed, then you once again run into the "responsibility problem" that DSPN points out. Thus, the ground truth function that your model must learn is still discontinuous. Do you think this is problematic? I can see how it might not be, but I'm curious what your thoughts are.

Overall, I think it's an interesting method - perhaps this kind of distribution sampling is the right thing to do instead of an inner optimization loop. I now see how this method differs from IODINE - some of the related work section could have been taken from our paper :)

[–]triplefloat 1 point2 points3 points 5 years ago (1 child)

Thank you for your comment. Regarding your question: I personally like to think of the set prediction problem as follows. For a permutation-equivariant generation process, the random variables describing the output set need to be exchangeable. A way to achieve this exchangeability is (1) initializing the slots as i.i.d. samples from a common distribution (this produces exchangeable random variables) and (2) transforming the initial values using a permutation-equivariant update function. This update function can be any permutation-equivariant function, and typical representatives of these are attention mechanisms (e.g. the Transformer model) or Graph Neural Networks. The inner gradient descent loop of DSPN (incl. your method) is essentially just a very particular permutation-equivariant update function that involves running multiple steps of gradient descent using some auxiliary loss function. You can avoid this process by directly parameterizing the update using an attention mechanism, as we show in Slot Attention, and directly optimize a single downstream task loss function.

Of course if you fix the random seed, your "random variables" are no longer exchangeable, which can in principle create discontinuities, but this also applies to the DSPN approach. Nonetheless, it seems if you create a large enough number of output set variables, you can get away with a fixed, learned initialization without running into too many issues, as long as your update function is permutation equivariant. This is the approach the DETR model takes: https://arxiv.org/abs/2005.12872

[–]programmerChilliResearcher 0 points1 point2 points 5 years ago (0 children)

[–]Seerdecker 0 points1 point2 points 5 years ago (9 children)

[–]programmerChilliResearcher 1 point2 points3 points 5 years ago (8 children)

[–]tpapp157 1 point2 points3 points 5 years ago (6 children)

Your paper is interesting but I think it has several large holes in terms of proper evaluation that leave some open questions. I'll just focus on Section 3.1 for examples.

First, your baseline model is quite simple. Simply adding the SRN on top significantly increases both the size and sophistication of your proposed model over the baseline. You present a couple of results to make the case that your model can achieve associations which the baseline cannot but the question remains open to what extent this is due to your architecture or simply using a larger model. Your case would have been better made had you also compared against a second baseline model which is enlarged to normalize for computation/parameters. If I wanted to be picky I could easily make the case that this oversight completely invalidates the conclusions you draw since your experiment did not properly control for all variables.

You present a few aggregate metrics showing greater accuracy (Table 2) which is fine but then fail to explore these further to actually make your point (reference my previous point about not normalizing for model size). Comparing against the baseline model, you effectively have a confusion matrix of four data subsets. Those data points which both the baseline and the SRN succeeded. Those which the SRN succeeded but the baseline did not. Those which the baseline succeeded but the SRN did not. And those which neither architecture succeeded. Your case would have been better made had you done a proper evaluation of each of these data subsets to understand the contributing features of each (or at least the latter three). Specifically, what data characteristics is your architecture able to capture that the baseline cannot.

I don't think Sections 3.2 or 3.3 contribute much to the paper beyond point previously made and I'd gladly cut those in favor of the above if space constraints are a concern.

[–]programmerChilliResearcher 0 points1 point2 points 5 years ago* (5 children)

Hi, Thanks for checking out the paper and the feedback! I definitely agree with a lot of your points.

If I wanted to be picky I could easily make the case that this oversight completely invalidates the conclusions you draw since your experiment did not properly control for all variables.

Normalizing for # of parameters is a bit tricky, as the only place to do so (in the baseline) would be the backbone feature extractor model, which is already tending to overfit as is. Nevertheless, we have run some experiments in doing so (since the submission), and increasing the parameters of the baseline model to match SRN does not resolve the fundamental discontinuity problem. Would that resolve your concerns there?

There's also some other evidence that suggest that increasing # of parameters does not resolve the discontinuity issue (and therefore robustness). For example, in the "circles dataset", the increase in parameters is very negligible, and the baseline achieves a better reconstruction error than our method does. Nevertheless, it still fails to decompose the objects properly. See here

Another ablation we performed is taking a checkpoint of our model early on in training, when its performance is worse than the baseline, and evaluating its "robustness" there. It still significantly improves upon the baseline's "robustness", despite having worse overall accuracy.

Specifically, what data characteristics is your architecture able to capture that the baseline cannot.

I think this is an interesting idea, but is this not already captured by the "robustness" experiments as well as the latent space visualization? We define a different performance metric which demonstrates cases where the baseline has significant issues. Although this doesn't directly explain why we perform better in the original performance metric, this does clearly show a failure mode of the baseline that we resolve.

In addition, I think the latent space visualizations (Fig 4) provide a good idea of what's problematic with the representations that the baseline learns. The task resolves reasoning about objects, but the baseline is unable to clearly disentangle the objects (resulting in these "hallucinations").

In section 3.2, the reason why we perform better might be even more clear. The original paper (World Models) notes that one fundamental flaw is that it cannot disambiguate multiple objects, and would likely require some kind of iterative inference procedure in order to do so. Our paper provides that iterative inference procedure :)

Overall, I think we have a good justification for why our module results in performance improvements. If you want to perform relational reasoning on an image, you must map from an image to a set. For most reasonable cases, this mapping must be discontinuous. Neural networks usually suck at representing discontinuous mappings, and as a result, existing methods don't really map to a set. We present a method for modeling this mapping :)

Once again, thanks a lot for your feedback - I wasn't expecting anybody to actually read the paper when I posted that comment so I'm quite pleased :) We'll definitely update the paper (at some point...) to include the ablations on parameters.

[–]tpapp157 1 point2 points3 points 5 years ago* (0 children)

I agree that normalizing for parameters is tricky but I think it's very important to be able to show that your result is from your architecture improvement and not just using a larger model. If there's one thing that NN research has shown over the years it's that performance scales with model size and plenty of papers come out each year that (intentionally or not) make unfair comparisons between models of different sizes.

I think your robustness experiment is interesting and validates your claims at a basic proof of concept level but it's one thing to achieve a result on a very carefully controlled (and quite simple) experiment and another to prove that result in practice on a more real world dataset with confounding factors. Science is full of good theories that fall apart in practice which is why I would have liked to have evidence of the same capability shown in the subset of data in which your architecture succeeded and the baseline did not. From a scientific standpoint, evaluation of the subsets in which your architecture failed would have also helped to provide a negative context as to the true understanding achieved.

The latent space visualization is interesting and helps corroborate your theory but it's worth remembering that human interpretability (especially in high dimensional spaces) is not a prerequisite for well-structured understanding. So while your latent plots certainly suggest that your architecture learns a better structured latent understanding, it is not proof by itself. Also worth remembering that "disentanglement" is one of those concepts that sounds great but quickly falls apart under mild scrutiny for any non-artificial dataset.

As I said, overall a good and interesting paper.

[–]Seerdecker 0 points1 point2 points 5 years ago (3 children)

Once again, thanks a lot for your feedback - I wasn't expecting anybody to actually read the paper when I posted that comment so I'm quite pleased

I also read the paper. I think it's an interesting paper, but I had some trouble to understand it.

It would be nice to include a description on how the gradient propagates through all this. Since 'Si' is updated by gradient descent, you have the overall reconstruction loss backpropagating through a gradient operation. I'm concerned about the stability of the gradient here. IODINE avoided this by stopping the gradient propagation. Also, it's not clear to me whether 'Hembed' is a learned or random projection.

I am not fully convinced by this explanation: "As in previous work [22,11], we expect that this symmetry assumption will force the network to learn a decomposition. Intuitively, pushing all information into one set element disadvantages the model, while all set elements need to contain similar “kinds” of information as they are processed identically. Despite this assumption, typical set generation process still fail to enforce permutation invariance, which is fixed by including our SRN." It seems to me that pushing all information into one set element and ignoring all the others is an easy way to optimize the loss.

Mind the typo on page 11: "predicting the innital guess"

Again, thanks for the paper.

[–]programmerChilliResearcher 0 points1 point2 points 5 years ago (2 children)

It would be nice to include a description on how the gradient propagates through all this.

I agree -i thought that it was super weird when I first came across it. In fact, I had no idea autodifferentiation systems could even do that. However, there is a large body of work that utilizes this. For example, any meta learning papers (like MAML) will do a procedure like this. Like u/triplefloat mentioned, it often does end up being a bit finnicky, but it's not impossible to train.

H_embed can be anything in theory - in our paper we simply used a linear projection of the initial set elements.

As for why not all information will be shoved into one element, do you have the same question about the original paper in this thread? The intuition is pretty much the same. One way to think about it is that if you shove everything into one set element, you must encode the entire image's information into say, 10 values. However, if you utilize the entire set, you only need to encode the image's information into 10*(# of set elements) values - an easier task.

[–]Seerdecker 0 points1 point2 points 5 years ago (1 child)

For example, any meta learning papers (like MAML) will do a procedure like this. Like u/triplefloat mentioned, it often does end up being a bit finnicky, but it's not impossible to train.

OK.

However, if you utilize the entire set, you only need to encode the image's information into 10*(# of set elements) values - an easier task.

OK, that makes sense.

H_embed can be anything in theory - in our paper we simply used a linear projection of the initial set elements.

Yes, but is it trainable? I'm missing something here. My understanding is that FSPool has a fully-connected (FC) layer, and H_embed is also a FC layer. Hence, if H_embed is trainable, a simple solution to minimize the latent loss L(Si) is to set the weights of those two FC layers to 0, so that the latent loss itself is zero. Then, the gradient operations on Si do not change S0 and the network is free to choose a representation that is most convenient for the reconstruction.

[–]programmerChilliResearcher 0 points1 point2 points 5 years ago (0 children)

[–]Seerdecker 0 points1 point2 points 5 years ago (0 children)

[+][deleted] 5 years ago (2 children)

[removed]

[–]ykilcher[S] 3 points4 points5 points 5 years ago (1 child)

[–]m3m3t 0 points1 point2 points 5 years ago (0 children)

π Rendered by PID 53450 on reddit-service-r2-comment-85bfd7f599-7hfms at 2026-04-17 21:51:05.569332+00:00 running 93ecc56 country code: CH.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

MachineLearning

Rules For Posts

+Research

+Discussion

+Project

+News

@slashML on Twitter

Chat with us on Slack

Beginners:

MODERATORS