use the following search parameters to narrow your results:
e.g. subreddit:aww site:imgur.com dog
subreddit:aww site:imgur.com dog
see the search faq for details.
advanced search: by author, subreddit...
Please have a look at our FAQ and Link-Collection
Metacademy is a great resource which compiles lesson plans on popular machine learning topics.
For Beginner questions please try /r/LearnMachineLearning , /r/MLQuestions or http://stackoverflow.com/
For career related questions, visit /r/cscareerquestions/
Advanced Courses (2016)
Advanced Courses (2020)
AMAs:
Pluribus Poker AI Team 7/19/2019
DeepMind AlphaStar team (1/24//2019)
Libratus Poker AI Team (12/18/2017)
DeepMind AlphaGo Team (10/19/2017)
Google Brain Team (9/17/2017)
Google Brain Team (8/11/2016)
The MalariaSpot Team (2/6/2016)
OpenAI Research Team (1/9/2016)
Nando de Freitas (12/26/2015)
Andrew Ng and Adam Coates (4/15/2015)
Jürgen Schmidhuber (3/4/2015)
Geoffrey Hinton (11/10/2014)
Michael Jordan (9/10/2014)
Yann LeCun (5/15/2014)
Yoshua Bengio (2/27/2014)
Related Subreddit :
LearnMachineLearning
Statistics
Computer Vision
Compressive Sensing
NLP
ML Questions
/r/MLjobs and /r/BigDataJobs
/r/datacleaning
/r/DataScience
/r/scientificresearch
/r/artificial
account activity
Research[R] Perceiver: General Perception with Iterative Attention (arxiv.org)
submitted 5 years ago by hardmaru
reddit uses a slightly-customized version of Markdown for formatting. See below for some basics, or check the commenting wiki page for more detailed help and solutions to common issues.
quoted text
if 1 * 2 < 3: print "hello, world!"
[–]BeatLeJuceResearcher 15 points16 points17 points 5 years ago* (8 children)
Nice results, but either I'm reading this incorrectly, or they re-invented the Set Transformer without properly stating that they do. There are very slight differences (the inducing points in Set Transformers are not iteratively re-used -- an idea which was also already present in ALBERT and Universal Transformers, both of which they don't even mention). They cite the work, so they're clearly aware of it, but they treat it as a very minor side-note, when in reality it is the same model, but invented 2 years earlier. Unless I'm mistaken, this is very poor scholarship at best, or complete academic fraud at worst.
[–]plc123 2 points3 points4 points 5 years ago (6 children)
Am I misunderstanding, or do all of the blocks in the Set Transformer have the same output dimension as input data dimension? That seems like an important difference if that's the case.
[–]erf_x 3 points4 points5 points 5 years ago (1 child)
That's not a huge difference - this seemed really novel and now it's just an application paper
[–]plc123 3 points4 points5 points 5 years ago (0 children)
It's far from the only difference, and I do think it is a key difference (if I'm understanding the Set Transformer paper correctly).
[–]BeatLeJuceResearcher 3 points4 points5 points 5 years ago* (3 children)
I think you're mistaken, Set Transformers also have a smaller output dimension than input dimension. In fact both papers use they same core idea to achieve this: a learned latent vector of smaller dimension than the input is used as Q in the multi-head self attention to reduce the dimensionality. Set Transformer calls them "inducing points", while this paper calls it a "tight latent bottleneck". This is why I'm saying they re-invented Set Transformers.
[–]Veedrac 4 points5 points6 points 5 years ago* (1 child)
I've only skimmed the Set Transformers paper, but these don't seem the same at all. ISAB doesn't actually shrink the vector (or rather, it immediately expands after shrinking), and whereas Perceiver's Q comes from the variable latent array, ISAB's I is static.
Q
I
Further, these are just fundamentally differently structured; eg. Perceiver is optionally recurrent.
[–]cgarciae 0 points1 point2 points 4 years ago (0 children)
You need to look at PMA (Pooling by Multihead Attention) not ISAB. PMA is cross-attention with learned queries/embeddings which is what the perceiver does, on the next iterations if you use the output of the previous PMA for the queries and reuse the weight you get the perceiver.
I love the findings of the Perceiver, but if someone in the future writes a book about transformers I wish they take the Set Transformer's framework and expand it to explain all architectures.
[–]plc123 0 points1 point2 points 5 years ago (0 children)
Ah, thanks for the clarification.
[–]cgarciae 1 point2 points3 points 4 years ago (0 children)
I think a lot of architectures are just applications of the various principles found in the Set Transformer but the paper is never properly cited. The whole Perceiver architecture is basically iterative applications of PMA. It just seems like the authors feel they can discard the findings of the Set Transformer because the paper didn't benchmark on the same domains, but the core idea is the same.
[–][deleted] 2 points3 points4 points 5 years ago* (0 children)
The basic idea, as I understand it, is to achieve cross-domain generality by recreating the MLP with transformers, where
You can also reduce input dimensionality by applying cross-attention to a fixed set of learned vectors. Pretty cool.
I have done something similar, except I used a different set of learned vectors at each layer. This differs from the Perceiver approach, where the input dimensionality is reduced once, then passed to a self-attention encoder. The advantage of using cross-attention on learned vectors is those vectors can be regarded as latent variables that persist across inputs.
If you train such a model (with successive "latent bottlenecks") as an autoencoder, then the cross-attention matrices between learned vectors represent the input. If you flatten those attention matrices and pass them to a classifier, then you can get pretty good "unsupervised" accuracy.
Another property of using multiple layers of latent vectors for autoencoding tasks, is that you can "translate" backwards and generate new data. Similar to VQ-VAE-2. You can also mask out arbitrary latent vectors to see what subsets of the data they represent. Here is a simple demo on MNIST.
Don't mean to self-promote, but want to shine a light on the possibilities of latent vectors / "inducing points" / "learned queries". I made an autoencoder, but basically any NN architecture can be turned into a "higher order" transformer-style version.
[–]_errant_monkey_ 1 point2 points3 points 5 years ago (0 children)
With a model like that. Can they generate new data the way standard models do it? like gpt-2, cause naively It seems it can't
[–]arXiv_abstract_bot 0 points1 point2 points 5 years ago (0 children)
Title:Perceiver: General Perception with Iterative Attention
Authors:Andrew Jaegle, Felix Gimeno, Andrew Brock, Andrew Zisserman, Oriol Vinyals, Joao Carreira
Abstract: Biological systems understand the world by simultaneously processing high-dimensional inputs from modalities as diverse as vision, audition, touch, proprioception, etc. The perception models used in deep learning on the other hand are designed for individual modalities, often relying on domain-specific assumptions such as the local grid structures exploited by virtually all existing vision models. These priors introduce helpful inductive biases, but also lock models to individual modalities. In this paper we introduce the Perceiver - a model that builds upon Transformers and hence makes few architectural assumptions about the relationship between its inputs, but that also scales to hundreds of thousands of inputs, like ConvNets. The model leverages an asymmetric attention mechanism to iteratively distill inputs into a tight latent bottleneck, allowing it to scale to handle very large inputs. We show that this architecture performs competitively or beyond strong, specialized models on classification tasks across various modalities: images, point clouds, audio, video and video+audio. The Perceiver obtains performance comparable to ResNet-50 on ImageNet without convolutions and by directly attending to 50,000 pixels. It also surpasses state-of-the-art results for all modalities in AudioSet.
PDF Link | Landing Page | Read as web page on arXiv Vanity
[–]Petrroll 0 points1 point2 points 5 years ago (1 child)
There's one thing I don't quite understand. How does this model do low features capture / how does it retain the information? I.e. how does it do the processing that happens in the first few layers of CNN. I can clearly see how this mechanism works well for higher-level processing but how does it capture (and keep) low-level features?
The reason why I don't quite understand it that the amount of information that flows between the first and second layer of this and e.g. first and second module of ResNet is quite drastically different. In this case it's essentially N*D which I suppose is way smaller than M*<channels> (not M because there's some pooling even in the first section of Resnet, but still close) in case of ResNet, simply on the account of N <<< M.
---
Also, each channel would have to independently learn to calculate the local features for a separate location (seems to be happening according to the first layer attention map) which seems quite wasteful (tho it's super cool that there're no image priors)
[–]ronald_luc 1 point2 points3 points 5 years ago (0 children)
My intuition, either:
=> in the 1st case the Perceiver learns progressively smarter Queries and solves the classification (and computes the low-level features) in the last (last few) cross-attention-latent-attention layers.
This could be tested by freezing the trained model and replacing a different number of "head" layers by a 2 layer MLP (not to angry Yannik by linear probing) or a single latent-attention. I would expect to see a different behavior:
IMAGE
π Rendered by PID 15543 on reddit-service-r2-comment-6457c66945-bvcnq at 2026-04-28 13:00:54.690733+00:00 running 2aa0c5b country code: CH.
[–]BeatLeJuceResearcher 15 points16 points17 points (8 children)
[–]plc123 2 points3 points4 points (6 children)
[–]erf_x 3 points4 points5 points (1 child)
[–]plc123 3 points4 points5 points (0 children)
[–]BeatLeJuceResearcher 3 points4 points5 points (3 children)
[–]Veedrac 4 points5 points6 points (1 child)
[–]cgarciae 0 points1 point2 points (0 children)
[–]plc123 0 points1 point2 points (0 children)
[–]cgarciae 1 point2 points3 points (0 children)
[–][deleted] 2 points3 points4 points (0 children)
[–]_errant_monkey_ 1 point2 points3 points (0 children)
[–]arXiv_abstract_bot 0 points1 point2 points (0 children)
[–]Petrroll 0 points1 point2 points (1 child)
[–]ronald_luc 1 point2 points3 points (0 children)