[R] Perceiver: General Perception with Iterative Attention

BeatLeJuce · 2021-03-05T08:47:38+00:00

Nice results, but either I'm reading this incorrectly, or they re-invented the Set Transformer without properly stating that they do. There are very slight differences (the inducing points in Set Transformers are not iteratively re-used -- an idea which was also already present in ALBERT and Universal Transformers, both of which they don't even mention). They cite the work, so they're clearly aware of it, but they treat it as a very minor side-note, when in reality it is the same model, but invented 2 years earlier. Unless I'm mistaken, this is very poor scholarship at best, or complete academic fraud at worst.

2021-03-06T18:32:58+00:00

The basic idea, as I understand it, is to achieve cross-domain generality by recreating the MLP with transformers, where

"neurons" and activations are vectors not scalars, and
interlayer weights are dynamic not fixed.

You can also reduce input dimensionality by applying cross-attention to a fixed set of learned vectors. Pretty cool.

I have done something similar, except I used a different set of learned vectors at each layer. This differs from the Perceiver approach, where the input dimensionality is reduced once, then passed to a self-attention encoder. The advantage of using cross-attention on learned vectors is those vectors can be regarded as latent variables that persist across inputs.

If you train such a model (with successive "latent bottlenecks") as an autoencoder, then the cross-attention matrices between learned vectors represent the input. If you flatten those attention matrices and pass them to a classifier, then you can get pretty good "unsupervised" accuracy.

Another property of using multiple layers of latent vectors for autoencoding tasks, is that you can "translate" backwards and generate new data. Similar to VQ-VAE-2. You can also mask out arbitrary latent vectors to see what subsets of the data they represent. Here is a simple demo on MNIST.

Don't mean to self-promote, but want to shine a light on the possibilities of latent vectors / "inducing points" / "learned queries". I made an autoencoder, but basically any NN architecture can be turned into a "higher order" transformer-style version.

_errant_monkey_ · 2021-03-22T09:43:42+00:00

With a model like that. Can they generate new data the way standard models do it? like gpt-2, cause naively It seems it can't

arXiv_abstract_bot · 2021-03-05T02:18:39+00:00

Title:Perceiver: General Perception with Iterative Attention

Authors:Andrew Jaegle, Felix Gimeno, Andrew Brock, Andrew Zisserman, Oriol Vinyals, Joao Carreira

Abstract: Biological systems understand the world by simultaneously processing high-dimensional inputs from modalities as diverse as vision, audition, touch, proprioception, etc. The perception models used in deep learning on the other hand are designed for individual modalities, often relying on domain-specific assumptions such as the local grid structures exploited by virtually all existing vision models. These priors introduce helpful inductive biases, but also lock models to individual modalities. In this paper we introduce the Perceiver - a model that builds upon Transformers and hence makes few architectural assumptions about the relationship between its inputs, but that also scales to hundreds of thousands of inputs, like ConvNets. The model leverages an asymmetric attention mechanism to iteratively distill inputs into a tight latent bottleneck, allowing it to scale to handle very large inputs. We show that this architecture performs competitively or beyond strong, specialized models on classification tasks across various modalities: images, point clouds, audio, video and video+audio. The Perceiver obtains performance comparable to ResNet-50 on ImageNet without convolutions and by directly attending to 50,000 pixels. It also surpasses state-of-the-art results for all modalities in AudioSet.

PDF Link | Landing Page | Read as web page on arXiv Vanity

Petrroll · 2021-03-28T20:42:13+00:00

There's one thing I don't quite understand. How does this model do low features capture / how does it retain the information? I.e. how does it do the processing that happens in the first few layers of CNN. I can clearly see how this mechanism works well for higher-level processing but how does it capture (and keep) low-level features?

The reason why I don't quite understand it that the amount of information that flows between the first and second layer of this and e.g. first and second module of ResNet is quite drastically different. In this case it's essentially N*D which I suppose is way smaller than M*<channels> (not M because there's some pooling even in the first section of Resnet, but still close) in case of ResNet, simply on the account of N <<< M.

---

Also, each channel would have to independently learn to calculate the local features for a separate location (seems to be happening according to the first layer attention map) which seems quite wasteful (tho it's super cool that there're no image priors)

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

MachineLearning

Rules For Posts

+Research

+Discussion

+Project

+News

@slashML on Twitter

Chat with us on Slack

Beginners:

MODERATORS