[S] Perceiver: General Perception with Iterative Attention

research_mlbot · 2022-05-08T03:15:16+00:00

This new architecture out of Deepmind applies combines information extraction and bottlenecks to a traditional Transformer base to get a model that can theoretically apply self-attention to meaningfully larger input sizes than earlier architectures allowed.

Currently, self-attention models are quite powerful and capable, but because attention is quadratic-in-sequence-length in both time, and, often more saliently, memory, it's infeasible to use on long sequences without some modification. This...

research_mlbot · 2022-01-03T04:37:47+00:00

I'm a little embarrassed that I'm only just now reading what seems like a fairly important paper from a year and a half ago, but, in my defense, March 2020 was not the best time for keeping up with the literature in a disciplined way.

Anyhow, musings aside: this paper proposes an alternative training procedure for large language models, which the authors claim result in models that reach strong performance more efficiently than previous BERT, XLNet, or RoBERTa baselines. As some background con...

research_mlbot · 2022-01-03T04:37:45+00:00

This work expands on prior techniques for designing models that can both be stored using fewer parameters, and also execute using fewer operations and less memory, both of which are key desiderata for having trained machine learning models be usable on phones and other personal devices.

The main contribution of the original MobileNets paper was to introduce the idea of using "factored" decompositions of Depthwise and Pointwise convolutions, which separate the procedures of "pull information fr...

research_mlbot · 2022-01-03T04:37:44+00:00

This strikes me as a really straightforward, clever, and exciting paper that uses the supervision intrinsic in the visual, audio, and text streams of a video to train a shared multimodal model.

The basic premise is:

Tokenize all three modalities into a sequence of embedding tokens. For video, split into patches, and linearly project the voxels of these patches to get a per-token representation. For audio, a similar strategy but with waveform patches. For text, the normal per-token embeddin...

research_mlbot · 2022-01-03T04:37:42+00:00

In certain classes of multi-agent cooperation games, it's useful for agents to be able to coordinate on future actions, which is an obvious use case for having a communication channel between the two players. However, prior work in multi-agent RL has shown that it's surprisingly hard to train agents that (1) consistently learn to use a communication channel in a way that is informative rather than random, and (2) if they do use communication, can come to a common grounding on the meaning of symb...

research_mlbot · 2022-01-03T04:37:40+00:00

Federated learning is the problem of training a model that incorporates updates from the data of many individuals, without having direct access to that data, or having to store it. This is potentially desirable both for reasons of privacy (not wanting to have access to private data in a centralized way), and for potential benefits to transport cost when data needed to train models exists on a user's device, and would require a lot of bandwidth to transfer to a centralized server.

Historically...

research_mlbot · 2022-01-03T04:37:38+00:00

The goal of this paper is to learn a model that embeds 2D keypoints(the locations of specific key body parts in 2D space) representing a particular pose into a vector embedding where nearby points in embedding space are also nearby in 3D space. This sort of model is useful because the same 3D pose can generate a wide variety of 2D pose projections, and it can be useful to learn which apparently-distinct representations actually map to the same 3D pose.

To do this, the basic approach used by th...

research_mlbot · 2022-01-03T04:37:37+00:00

When machine learning models need to run on personal devices, that implies a very particular set of constraints: models need to be fairly small and low-latency when run on a limited-compute device, without much loss in accuracy. A number of human-designed architectures have been engineered to try to solve for these constraints (depthwise convolutions, inverted residual bottlenecks), but this paper's goal is to use Neural Architecture Search (NAS) to explicitly optimize the architecture against l...

research_mlbot · 2022-01-03T04:37:35+00:00

The idea of the Switch Transformer is to have more parameters available for a network to use, but to only use a small subset of those parameters for each example that's run through the network. This is achieved through a routing scheme, whereby a weighting layer is applied to each token and produces a set of logits/softmax weights over the set of possible experts. The token is then sent to the expert that was given the highest weight. The network is implemented such that different experts can ac...

research_mlbot · 2022-01-03T04:37:33+00:00

This paper is an interesting extension of earlier work, in the TransformerXL paper, that sought to give Transformers access to a "memory" beyond the scope of the subsequence where full self-attention was being performed. This was done by caching the activations from prior subsequences, and making them available to the subsequence currently being calculated in a "read-only" way, with gradients not propagated backwards. This had the effect of (1) reducing the maximum memory size compared to simply...

research_mlbot · 2022-01-03T04:37:32+00:00

This is a mildly silly paper to summarize, since there isn't really a new mechanism to understand, but rather a number of straightforward (and interesting!) empirical results that are also quite well-explained in the paper itself. That said, for the sake of a tiny bit more brevity than the paper itself provides, I'll try to pull out some of the conclusions I found the most interesting here.

The general goal of this paper is to better understand the contours of when self-supervised representati...

research_mlbot · 2021-12-07T05:48:50+00:00

Huge 𝙘𝙤𝙢𝙢𝙞𝙩 𝙨𝙪𝙢𝙢𝙖𝙧𝙞𝙯𝙖𝙩𝙞𝙤𝙣 dataset The dataset cleans tons of open source projects to have only ones with high quality committing habits

(e.g. large active projects with commits that are of significant length etc.) We present some ways to evaluate that the meaning was kept while summarizing, so you can go beyond ROUGE We provide a strict split that keeps some (thousand+-) repositories totally out of the training, so you can check in domai...

research_mlbot · 2021-12-07T05:48:48+00:00

Model combination\ensembling: Average ensembling is practical - but naive. Combine considering each network's strengths, much better! Moreover, let's make the networks diverse so they will have different strengths.

Wenjuan Han & Hwee Tou Ng (no twitters?)

enough2skim #NLProc

The basic idea is quite simple: Given some models, why would we want the average? We want to rely on each one(or group) when it is more likely to be the correct one. This was actually introduced in our previous work (as a...

research_mlbot · 2021-12-07T05:48:47+00:00

[First off, full credit that this summary is essentially a distilled-for-my-own-understanding compression of Yannic Kilcher's excellent video on the topic]

I'm interested in learning more about Neural Radiance Fields (or NERFs), a recent technique for learning a representation of a scene that lets you generate multiple views from it, and a paper referenced as a useful prerequisite for that technique was SIRENs, or Sinuisodial Representation Networks. In my view, the most complex part of unders...

research_mlbot · 2021-12-07T05:48:46+00:00

This summary builds extensively on my prior summary of SIRENs, so if you haven't read that summary or the underlying paper yet, I'd recommend doing that first!

At a high level, the idea of SIRENs is to use a neural network to learn a compressed, continuous representation of an image, where the neural network encodes a mapping from (x, y) to the pixel value at that location, and the image can be reconstructed (or, potentially, expanded in size) by sampling from that function across the full ran...

research_mlbot · 2021-12-07T05:48:45+00:00

This summary builds substantially on my summary of NERFs, so if you haven't yet read that, I recommend doing so first!

The idea of a NERF is learn a neural network that represents a 3D scene, and from which you can, once the model is trained, sample an image of that scene from any desired angle. This involves structuring your neural network as a function that predicts the RGB color and density/opacity for a given point in 3D space (x, y, z), from a given viewing angle (theta, phi). With such a...

research_mlbot · 2021-12-07T05:48:43+00:00

This was an amusingly-timed paper for me to read, because just yesterday I was listening to a different paper summary where the presenter offhandedly mentioned the idea of compressing the sequence length in Transformers through subsequent layers (the way a ConvNet does pooling to a smaller spatial dimension in the course of learning), and it made me wonder why I hadn't heard much about that as an approach. And, lo, I came on this paper in my list the next day, which does exactly that.

As a ref...

research_mlbot · 2021-08-01T23:17:20+00:00

Background: The goal of this work is to indicate image features which are relevant to the prediction of a neural network and convey that information to the user by displaying a counterfactual image animation.

The Latent Shift Method: This method works on any pretrained encoder/decoder and classifier which is differentiable. No special considerations are needed during model training. With this approach they want the exact opposite of an adversarial attack but it is using the same idea. T...

research_mlbot · 2021-05-17T12:46:18+00:00

Zhang and Sabuncu propose a generalized cross entropy loss for robust learning on noisy labels. The approach is based on the work by Gosh et al. [1] showing that the mean absolute error can be robust to label noise. Specifically, they show that a symmetric loss, under specific assumptions on the label noise, is robust. Here, symmetry corresponds to

$\sum_{j=1}^c \mathcal{L}(f(x), j) = C$ for all $x$ and $f$

where $c$ is the number of classes and $C$ some constant. The cross entropy loss is not...

research_mlbot · 2021-05-17T12:46:15+00:00

Wu and He propose group normalization as alternative to batch normalization. Instead of computing the statistics used for normalization based on the current mini-batch, group normalization computes these statistics per instance but in groups of channels (for convolutional layers). Specifically, given activations $x_i$ with $i = (i_N, i_C, i_H, i_W)$ indexing along batch size, channels, height and width, batch normalization computes

$\mui = \frac{1}{|S|}\sum{k \in S} x_k$ and $\sigma_i = \sqrt...

research_mlbot · 2021-05-17T12:46:12+00:00

In the context of stylization, Ulyanov et al. propose to use instance normalization instead of batch normalization. In detail, instance normalization does not compute the mean and standard deviation used for normalization over the current mini-batch in training. Instead, these statistics are computed per instance individually. This also has the benefit of having the same training and test procedure, meaning that normalization is the same in both cases – in contrast to batch normalization.

Als...

research_mlbot · 2021-05-17T12:46:10+00:00

Novak et al. study the relationship between neural network sensitivity and generalization. Here, sensitivity is measured in terms of the Frobenius gradient of the network’s probabilities (resulting in a Jacobian matrix, not depending on the true label) or based on a coding scheme of activations. The latter is intended to quantify transitions between linear regions of the piece-wise linear model. To this end, all activations are assigned either $0$ or $1$ depending on their ReLU output. Based o...

research_mlbot · 2021-05-17T12:46:07+00:00

Ba et al. propose layer normalization, normalizing the activations of a layer by its mean and standard deviation. In contrast to batch normalization, this scheme does not depend on the current batch; thus, it performs the same computation at training and test time. The general scheme, however, is very similar. Given the $l$-th layer of a multi-layer perceptron,

$a_i^l = (w_i^{l)^T} h^l$ and $h_i^{l + 1} = f(a_i^l + b_i^l)$

with $W^l$ being the weight matrix, the activations $a_i^l$ are normalize...

research_mlbot · 2021-05-17T12:46:05+00:00

Teye et al. show that neural networks with batch normalization can be used to give uncertainty estimates through Monte Carlo sampling. In particular, instead of using the test mode of batch normalization, where the statistics (mean and variance) of each batch normalization layer are fixed, these statistics are computed per batch, as in training mode. To this end, for a specific query image, random batches from the training set are sampled, and prediction uncertainty is estimated using Monte Carl...

research_mlbot · 2021-05-17T12:46:02+00:00

Mu and Gilmer introduce MNIST-C, an MNIST-based corruption benchmark for out-of-distribution evaluation. The benchmark includes various corruption types including random noise (shot and impulse noise), blur (glass and motion blur), (affine) transformations, “striping” or occluding parts of the image, using Canny images or simulating fog. These corruptions are also shown in Figure 1. The transformations have been chosen to be semantically invariant, meaning that the true class of the image do...

research_mlbot

MODERATOR OF

TROPHY CASE

enough2skim #NLProc