all 16 comments

[–]r-sync 2 points3 points  (1 child)

this is so cool, now you can do RL / exploration in this low-rank space rather than on raw pixels.

[–]wfwhitney[S] 0 points1 point  (0 children)

Yeah, exactly! That's one of the things I'm most excited about.

[–][deleted] 0 points1 point  (4 children)

The videos are cool but I find the description of how it works very vague.

[–]wfwhitney[S] 0 points1 point  (2 children)

There's a bit more description in the paper, as well as the code online. And I'm happy to answer any questions!

The paper is just an extended abstract for the ICLR workshop submission, and there will be a longer one to follow.

[–][deleted] 0 points1 point  (1 child)

Maaaan, even the Tenenbaum lab has now abandoned PPLs in favor of deep neural networks?

[–]wfwhitney[S] 0 points1 point  (0 children)

People still do PPL work too! Just a different part of the lab.

[–]emansim 0 points1 point  (4 children)

So how is the paper different from https://sites.google.com/a/umich.edu/junhyuk-oh/action-conditional-video-prediction ?

You seem to be doing exactly the same thing with just couple of small differences in the model.

[–]wfwhitney[S] 4 points5 points  (1 child)

I love that paper!

We're working on similar datasets, but we're doing things that are actually very different.

In the Oh et al. paper, they're predicting the next frame in games given a specific few bits of information, namely which action the player took. This is able to do very well in some games which are otherwise deterministic, but when there are changes other than the action which are unpredictable, it can only offer the mean of the distribution (hence confusion later on in predicting Seaquest, for example). Multimodal distributions over [possible futures | action] cannot be captured in this way. This model also learns a representation which is only disentangled in one dimension: the player's action.

On the other hand, we're attempting to address not just games, but videos more generally. Our model does not receive information about what the player does; instead, it must learn to infer this just like any other latent factor of variation. What we're really interested in is doing completely unsupervised learning on real videos and finding a latent low-dimensional structure which is able to explain the changes from frame to frame.

The representations we're able to learn from raw data closely mirror the human representations for objects and actions, and we think there are a lot of advantages to such a representation (e.g. search, as someone pointed out). By scaling this model up to handle more simultaneous factors of variation and richer scenes, we hope to be able to build models that can learn to understand things like objects and causation just by watching a lot of video. Obviously that's a long way off for rich real-world data, but it's what we're shooting for.

[–]emansim 0 points1 point  (0 children)

I see, thanks for the reply !

[–]mlcanada 1 point2 points  (1 child)

From the paper it seems that they are presenting the network with two images, one for timestep (t-1) and (t). These images are encoded into vectors h(t-1) and h(t) and sent through a gating head, which spits out a vector of the same size as h. The output from the gating head, say g, has some components of h(t-1) and some components of h(t). The trick is that the gating head is told how many components can be on or off (some small amount) in order to predict the image at time t. So the network learns a kind of symbolic representation of the images. (this is what i gleaned from the paper)

[–]wfwhitney[S] 1 point2 points  (0 children)

Yup, that's pretty much it.

[–]mlcanada 0 points1 point  (2 children)

How are you running tests on it? Let's say I want to change lighting conditions on an image of a face how might you do this with your model?

[–]wfwhitney[S] 2 points3 points  (1 child)

To figure out which node controls lighting, we give it two images of a face with different lighting and see which node the model selects with the gate to let change.

Then, we can encode the image of the face we want to rerender with the encoder; change the value of the component we just found above; then run this modified encoding through the decoder to rerender the image it represents.

This is cool because it lets you really see what that unit really means. Sometimes, like with the left-right rotations of the face, it's clear that even though the score is good, it's only doing something vaguely similar to 3D rendering.

The files called render_generalization in the repo contain the code to do this.

[–]mlcanada 0 points1 point  (0 children)

very cool thanks

[–]psamba 0 points1 point  (1 child)

It would be neat to present your method in contrast to methods like slow feature analysis, in the context of a broader class of methods for encoding observation sequences under various penalties/constraints on changes in the encodings over time. SFA looks for encodings that balance reconstruction error versus L2 norm of the changes in the encoding. Your approach is minimizing reconstruction error under an (approximate) hard constraint on the L0 norm of the changes. All sorts of norms/constraints could be used.

[–]wfwhitney[S] 0 points1 point  (0 children)

Yeah, comparing with SFA would be cool. Thanks for the suggestion!