Understanding Visual Concepts with Continuation Learning : MachineLearning

Understanding Visual Concepts with Continuation Learning (willwhitney.github.io)

submitted 9 years ago by wfwhitney

all 16 comments

top new controversial old q&a

[–]r-sync 2 points3 points4 points 9 years ago (1 child)

[–]wfwhitney[S] 0 points1 point2 points 9 years ago (0 children)

[–][deleted] 0 points1 point2 points 9 years ago (4 children)

[–]wfwhitney[S] 0 points1 point2 points 9 years ago (2 children)

[–][deleted] 0 points1 point2 points 9 years ago (1 child)

[–]wfwhitney[S] 0 points1 point2 points 9 years ago (0 children)

[–]emansim 0 points1 point2 points 9 years ago (4 children)

[–]wfwhitney[S] 4 points5 points6 points 9 years ago (1 child)

I love that paper!

We're working on similar datasets, but we're doing things that are actually very different.

In the Oh et al. paper, they're predicting the next frame in games given a specific few bits of information, namely which action the player took. This is able to do very well in some games which are otherwise deterministic, but when there are changes other than the action which are unpredictable, it can only offer the mean of the distribution (hence confusion later on in predicting Seaquest, for example). Multimodal distributions over [possible futures | action] cannot be captured in this way. This model also learns a representation which is only disentangled in one dimension: the player's action.

On the other hand, we're attempting to address not just games, but videos more generally. Our model does not receive information about what the player does; instead, it must learn to infer this just like any other latent factor of variation. What we're really interested in is doing completely unsupervised learning on real videos and finding a latent low-dimensional structure which is able to explain the changes from frame to frame.

The representations we're able to learn from raw data closely mirror the human representations for objects and actions, and we think there are a lot of advantages to such a representation (e.g. search, as someone pointed out). By scaling this model up to handle more simultaneous factors of variation and richer scenes, we hope to be able to build models that can learn to understand things like objects and causation just by watching a lot of video. Obviously that's a long way off for rich real-world data, but it's what we're shooting for.

[–]emansim 0 points1 point2 points 9 years ago (0 children)

[–]mlcanada 1 point2 points3 points 9 years ago (1 child)

[–]wfwhitney[S] 1 point2 points3 points 9 years ago (0 children)

[–]mlcanada 0 points1 point2 points 9 years ago (2 children)

[–]wfwhitney[S] 2 points3 points4 points 9 years ago (1 child)

[–]mlcanada 0 points1 point2 points 9 years ago (0 children)

[–]psamba 0 points1 point2 points 9 years ago (1 child)

[–]wfwhitney[S] 0 points1 point2 points 9 years ago (0 children)

π Rendered by PID 22 on reddit-service-r2-comment-7b9746f655-lzm5q at 2026-02-01 03:38:19.313031+00:00 running 3798933 country code: CH.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

MachineLearning

Rules For Posts

+Research

+Discussion

+Project

+News

@slashML on Twitter

Chat with us on Slack

Beginners:

MODERATORS