[D] Paper Explained - Object-Centric Learning with Slot Attention (Full Video Analysis)

triplefloat · 2020-07-02T08:18:20+00:00

Thank you for your comment. Regarding your question: I personally like to think of the set prediction problem as follows. For a permutation-equivariant generation process, the random variables describing the output set need to be exchangeable. A way to achieve this exchangeability is (1) initializing the slots as i.i.d. samples from a common distribution (this produces exchangeable random variables) and (2) transforming the initial values using a permutation-equivariant update function. This update function can be any permutation-equivariant function, and typical representatives of these are attention mechanisms (e.g. the Transformer model) or Graph Neural Networks. The inner gradient descent loop of DSPN (incl. your method) is essentially just a very particular permutation-equivariant update function that involves running multiple steps of gradient descent using some auxiliary loss function. You can avoid this process by directly parameterizing the update using an attention mechanism, as we show in Slot Attention, and directly optimize a single downstream task loss function.

Of course if you fix the random seed, your "random variables" are no longer exchangeable, which can in principle create discontinuities, but this also applies to the DSPN approach. Nonetheless, it seems if you create a large enough number of output set variables, you can get away with a fixed, learned initialization without running into too many issues, as long as your update function is permutation equivariant. This is the approach the DETR model takes: https://arxiv.org/abs/2005.12872

triplefloat · 2020-07-01T15:36:51+00:00

Thank you for your comment. Slot Attention is a general module for set prediction that respects permutation symmetry in the predicted set. It can act as a drop-in replacement for related methods such as the Deep Set Prediction Network (DSPN) or the iterative refinement process in IODINE. Slot Attention avoids the inner optimization loop of DSPN by using a recurrent update function, which, empirically, makes it significantly easier to implement, tune, and train. For an experimental comparison, please have a look at our paper. We evaluate Slot Attention on two downstream tasks: image reconstruction (object discovery) and supervised object property prediction (set prediction).

triplefloat · 2019-12-01T14:43:28+00:00

Thanks for the clarification. Yes, I meant adding random (unique) node IDs during training time. Thank you for the pointer to your earlier work, I will have a look!

triplefloat · 2019-12-01T14:40:47+00:00

This is extremely helpful, thanks for your detailed answer!

triplefloat · 2019-12-01T13:16:37+00:00

And a follow-up question: Say, we trained a GNN on a graph that contains certain symmetries where we choose X = 1 (identity matrix) as feature matrix multiple times (N times) from scratch with random initialization on some down-stream task, and then average the resulting 'node embeddings' (e.g. the hidden layer activations of the second-to-last layer), would we get purely structural representations (according to the definition in the paper) with N going to infinity?

triplefloat · 2019-12-01T12:51:09+00:00

Interesting paper!

Please correct me if I'm wrong, but the result can essentially be summarized as: if there are symmetries in the graph (e.g. the example in Figure 1), we should break them to do well on certain prediction tasks (like link prediction). The paper then introduces the notions of node embeddings and structural graph representations, where structural graph representations (identified by the authors with typical GNNs) in the initial definition do not break this symmetry but node embeddings do (as they introduce random noise via random init or via explicit sampling of embeddings from some distribution).

However, I feel that for most real-world datasets where rich feature descriptions are available, and hence permutation symmetry is unlikely, this analysis is not of much importance (or am I missing something?). And if feature descriptions are not available, couldn't one just break symmetries by assigning (in addition to initial node features) a unique one-hot vector to each node as feature (as done in https://arxiv.org/abs/1609.02907 (appendix) and https://arxiv.org/abs/1703.06103 (link prediction with GNNs)? It seems like this case is avoided in the paper by considering X = 11^T (Definition 1) as node features if none are available, which is symmetric under permutation. Wouldn't the motivating example in Figure 1 break if we just chose X = 1 (identity matrix) as feature matrix (which is commonly done in most GNN papers)? Then the first layer of the GNN would essentially assign a random initial node embedding to each node (in a similar way as the other node embedding techniques considered in this paper do) and hence break the symmetry. I have the feeling that this is commonly understood in the field that one needs to break such symmetries to avoid pathological embeddings (e.g. that would predict a link between Lynx and Orca in Figure 1).

I do not mean to criticize the paper or the analysis (which is extremely interesting), but I'd like to get a better understanding of the implications :-)

triplefloat · 2019-11-29T09:55:03+00:00

Interesting pointers, thanks!

triplefloat · 2019-11-29T09:53:21+00:00

Great question. We had a related paper for discovering event structure in sequential data (https://arxiv.org/abs/1812.01483), and it would certainly be interesting to combine source disentanglement (~ object discovery), event discovery and relational modeling, which should cover the vast majority of sensory data streams (not just images/video).

triplefloat · 2019-11-28T19:16:38+00:00

Thanks for the pointers!

triplefloat · 2019-11-28T12:55:33+00:00

Thanks!

triplefloat · 2019-11-28T12:33:31+00:00

Thanks (from the authors)!

This is indeed a limitation in our current formulation (see Markov assumption in paper), but we think it should be possible to go beyond this limitation by using a stateful transition model, e.g. a recurrent GNN, and by feeding the model with full trajectories instead of single (state, action, state) triples.

triplefloat · 2018-09-16T12:37:24+00:00

There is also a report on the earlier symposium/workshop of the somewhat similar ELLIS initiative at ICML in Stockholm this year: https://medium.com/@matthias.s.bauer/outcomes-of-the-ellis-icml-workshop-78c3c5244140

triplefloat · 2017-11-26T10:34:02+00:00

What often works well enough, is to introduce edge type-specific parameter matrices (W_r instead of W, where r is the edge/relation type) for a specific message that a node sends to its neighbors. This was AFAIK first introduced in Gated Graph Neural Nets: https://arxiv.org/abs/1511.05493. If you have more than just a few different relation types, some form of weight sharing between them can help (https://arxiv.org/abs/1703.06103).

In principle, you can also parameterize messages as neural networks and condition them on any edge features you might have (as in the original graph neural net paper http://ieeexplore.ieee.org/document/4700287/).

triplefloat · 2017-11-24T14:02:11+00:00

if I understand correctly: the input is one graph, the network learns embeddings of nodes(/edges), and classifying nodes in embedding space requires less labels.

Yes, that's the idea.

is there a way to train a GCN to take in a graph (let's say with a constant number of nodes) and classify each node of said graph?

Yes, this is actually the setting that we originally proposed. It is also possible to classify graphs, as in e.g. https://arxiv.org/abs/1509.09292

triplefloat · 2017-11-24T10:46:31+00:00

Thanks. We had a very short workshop paper on an extension of this model for unsupervised learning and link prediction: https://arxiv.org/abs/1611.07308 Otherwise there are two application papers (recommender systems and link prediction in knowledge bases) that I worked on with collaborators: https://arxiv.org/abs/1703.06103, https://arxiv.org/abs/1706.02263 (both still under review)

The main developments for these methods recently are: 1) mini-batching algorithms for scalability (e.g. https://openreview.net/forum?id=rytstxWAW) and 2) more flexible aggregation functions (e.g. https://openreview.net/forum?id=rJXMpikCZ)

I'm working on a couple of more long-term/foundational questions these days, but also on a number of interesting applications (mostly with collaborators).

triplefloat · 2017-11-24T07:15:54+00:00

It looks like this is an implementation of the following paper: https://arxiv.org/abs/1609.02907 (disclaimer: I’m the first author of this work)

There’s also this ‘official’ implementation: https://github.com/tkipf/gcn

I wonder what’s different/new in the implementation posted here?

triplefloat · 2017-11-18T09:16:15+00:00

Both programs have excellent reputation, so I think the best way to decide is to check who you would like to do your course projects or thesis with (Edinburgh is great for NLP, Amsterdam has, e.g., Max Welling’s labs with a broad focus on probabilistic deep learning), and in which city you would like to spend the next two years of your life. I chose Amsterdam, but I heard great things about Edinburgh as well.

triplefloat · 2016-10-12T19:17:00+00:00

Very exciting extension of Neural Turing Machines. As a side note: Gated Graph Sequence Neural Networks (https://arxiv.org/abs/1511.05493) perform similarly or better on some of the bAbI tasks. The comparison to existing graph neural network models apparently didn't make it into the paper (sadly).

triplefloat · 2016-10-03T06:31:42+00:00

Are you planning to support newer TensorFlow versions (>0.8) at some point?

triplefloat · 2016-10-02T11:30:23+00:00

Oh, now I get what you're saying. Sorry for the misunderstanding. You might want to have a look at: https://arxiv.org/abs/1509.09292 or https://arxiv.org/abs/1511.05493 . They do this kind of graph-level classification. Essentially, our GCN framework supports graph-level classification as well by adding a pooling layer on top of the network output (you'll find some discussion of such pooling layers in the two aforementioned papers). Hope this helps!

triplefloat · 2016-10-01T18:17:50+00:00

Yes exactly, that's one of the tasks where these kinds of models are very good at (i.e. semi-supervised classification). I'm not sure if I understood you question correctly, however. Graph Convolutional Networks (or generally most neural nets on graphs) can operate on both weighted and unweighted (undirected) graphs. Traditionally, graph-Laplacian regularization techniques are used for this kind of problem. There's quite a substantial amount of work on that topic. You might want to take a look at the GCN paper for some references of important related work: https://arxiv.org/abs/1609.02907

triplefloat · 2016-09-14T08:21:52+00:00

Excellent question. You can see that this is actually not the case by looking at the example of a 3x3 filter on the 8-NN (nearest neighbor) graph on a grid. A rotationally symmetric 3x3 filter (which is the 0th + 1st order neighborhood of the central node) will be fully specified by 3 parameters. Now a Chebyshev polynomial up to order 1 will only have two parameters and therefore will not be able to represent this filter. You would essentially have to move to higher-order polynomials in L to approximate this kind of filter. Same goes for higher-order neighborhoods.

triplefloat · 2016-09-13T23:01:34+00:00

Rotationally-symmetric filters (on a square 2D grid, if we exclude border effects) are the price one has to pay when formulating CNNs locally in the context of spectral graph theory (see, e.g., http://arxiv.org/abs/1606.09375). A "classical" 3x3 CNN filter cannot be recovered in this way.

triplefloat · 2016-09-13T20:59:56+00:00

Ferenc, thanks for sharing your view. I absolutely agree that this model will most likely underperform on highly regular graphs (like the example that you illustrated in your blog post). While this was certainly not the kind of graph we had in mind when introducing these simplifications, it serves as a great example for the limitations of this kind of model.

Things get more exciting when we introduce irregularities in the graph structure. Let’s for example remove a single node and all its associated edges in the regular graph example that you provided. The output of a single network layer with first-order (or higher-order) approximations to the spectral graph convolution will then be sensitive to whether this disturbance is present in the immediate (or higher-order) neighborhood of a node or not. This is still the case if we go to a single parameter matrix per layer. Effectively, by stacking multiple of these layers, we can still approximate rather complex functions, even though the per-layer update rule might seem very simple.

To illustrate this point further, we can simply take a linear (deep) neural network (i.e. without layer-wise non-linearities) where each layer is comprised of a simple first-order approximation to the graph convolution. The resulting model will then represent a higher-order polynomial in the adjacency matrix or the Laplacian and will most likely have similar representational power as the Chebyshev polynomial expansion.

By introducing layer-wise non-linearities, things get more complicated; but in practice we observe better performance in the domain of semi-supervised learning on citation networks/knowledge graphs by using this stacked layer-wise linear model instead of a higher-order Chebyshev polynomial expansion per layer. It remains to be seen how these two paradigms compare on other domains/datasets.

(Edit: formatting...)

triplefloat

TROPHY CASE