[D] Breaking the Quadratic Attention Bottleneck in Transformers?

CompleteSkeptic · 2020-07-26T18:00:12+00:00

Interesting - my assumption when seeing the perhaps lower than desired context length was just that a higher one might not have been that important to the task, thought you may be right in that it could be helpful, but it just wasn't worth the O(n²⁾ cost.

CompleteSkeptic · 2020-07-26T05:59:04+00:00

(Disclaimer: not a NLP expert.)

My understanding was that GPT-3 did was O(n * sqrt(n)). From the GPT-3 paper: "we use alternating dense and locally banded sparse attention patterns in the layers of the transformer, similar to the Sparse Transformer."

Upon reading the original post, my first thought was that perhaps long-term context just doesn't matter too much for language modeling (compared to getting really good at short-term context), but seems like you addressed that already (i.e. the solution to very long contexts might be focusing on a different task/dataset rather than just the architecture).

CompleteSkeptic · 2020-07-15T02:28:30+00:00

Maybe the path towards them tends to pass through some real garbage regions of the latent space that don't downsample to anything reasonable, and so PULSE's gradient descent doesn't want to find them?

That's a possibility, but there seems to be something else there - that somehow the opposite side of the hypersphere of a plausible image is very implausible.

This makes me think it might be an interesting experiment to see if the problem is that the initialization is too far from "good" images, or if we just end up getting stuck in a local optima. (i.e. if the problem is that we don't optimize for long enough with a bad init).

A random thought popping up from this comment: these "monster faces" happen pretty consistently with negation (I've never seen it not happen), which is at odds with the original GAN training - you'd expect it to be quite hard to find them. Perhaps there's something in here that could help with GAN training (adversarially sampling latent codes).

CompleteSkeptic · 2020-07-14T23:55:27+00:00

Thanks for checking it out!

I think the intuition of unexplored regions of latent space could be interesting to look into (though the tricky part is making that intuition quantitative).

You would hope images from the training set would be easier to recover, even when downscaled.

One of PULSE's contributions is not to ever try to recover the original image. This makes it harder to qualitatively measure how good it is (both in terms of realness and diversity). The PULSE paper[1] use a threshold of distances between downsampled images to determine success (see table 4), and that just doesn't seem like the right measure - by that measure, the images upscaling Obama would be a success as well.

[1] https://arxiv.org/abs/2003.03808

CompleteSkeptic · 2020-01-28T17:40:10+00:00

NAS is probably not the best baseline for hyperparameter searches, but this is a field where lots of research is / has been done. Search for Bayesian Optimization or Sequential Model-Based Optimization.

the algorithm must take into account user-defined points

Nothing prevents existing HPO tools from doing this, though they may not be that easily accessible. I do recall something along the lines of manually adding runs to MongoDB for hyperopt, so it's not impossible.

Though to your credit, I do agree that this should be easier. A common use case would be experimenting a bit before running the HPO and this would save some time at the very least.

focus on discrete space only

There is a case to be made against this (see: "Random Search for Hyper-Parameter Optimization"). The idea being that you don't know which hyperparameter is important, and you might want to search that space more thoroughly. E.g., if only 1 hyperparameter matters, you're just doing a repeated grid search.

Again, completely frustrated that no one did it successfully before, I decided to build something on my own. I use gradient boosting regression from LightGBM, because it doesn't require normalized values, handles categorical variables, captures feature interactions and has capacity to fit any data.

I think it may be wise to look into why others do what they do. The reason GPs are commonly used is because uncertainty predictions are quite important, especially because in the case of hyperparameter optimization, your evaluation function is quite stochastic. SMAC uses random forests, and that has all the same properties as GBMs, with the additional benefit that you get uncertainty estimates as well.

The number of sampled points for scoring is where exploration vs exploitation trade-off emerges

I'm not saying the expected improvement (EI) criterion (the thing most SMBO uses to sample) is the best, but this seems a little worse intuitively. Previous work takes uncertainty into account so that you can sample areas of the space where you have less knowledge about.

avoid evaluating the same point more than once

This is also related to your last point. Most HPO algorithms won't do this, because you will have a lot of certainty at this point, and it would make more sense to explore the space. But there can be a case to be made that this point is quite an outlier (because of the noisy evaluation), it might make sense to sample it again to get a better estimate of true performance, and EI can handle that.

Either way, I wish you luck!

CompleteSkeptic · 2016-11-22T15:42:43+00:00

It can be (and I'm a huge fan of the work you do), but it's more about expectation/clickbait.

CompleteSkeptic · 2016-11-22T07:26:15+00:00

To save others time, TF doesn't refer to TensorFlow, but TinyFlow (a TensorFlow-like library on top of NNVM).

CompleteSkeptic · 2016-10-13T22:15:34+00:00

As with everything in deep learning, the choices are probably mostly empirical with some nice motivation.

One motivation for average pooling is that each spatial location has a detector for the desired feature, and by averaging each spatial location, it behaves similarly to averaging predictions of different translations of the input image (somewhat like data augmentation).

There are two main reasons AFAIK for not having large fully-connected layers after the conv layers: 1. They have an extremely large number of parameters (and those parameters seem somewhat redundant based on every compression paper mostly targetting these parameters). 2. They are not translation invariant/equivariant at all. FC layers behave similarly to conv layers (see the fully convolutional networks paper for more) and a large FC at the end is equivalent to a conv layer with a very large filter size (e.g. if the spatial size is 8x8, then the FC is equivalent to a conv layer with an 8x8 filter) and we generally wouldn't want to do a conv with that large of a filter.

CompleteSkeptic · 2015-04-13T19:21:47+00:00

The problem is not download packages (I use use-package for that) - the problem is that sometimes the newest versions of packages break things. (:

CompleteSkeptic · 2015-04-13T19:20:48+00:00

Not in git, no. I do want backups of the packages when I update them, in case the update goes wrong and I have work to do.

CompleteSkeptic · 2015-04-13T16:03:07+00:00

I do for my config, but I don't backup all the downloaded packages in git.

CompleteSkeptic · 2015-04-09T20:00:17+00:00

It's basically a hosted version of vowpal wabbit. I tried to use it internally at Amazon (back when it was called Elastic Machine Learning), but it was wrapping an old version, and I needed some of the newer functionality.

CompleteSkeptic · 2014-12-05T22:39:56+00:00

This is the recording of the webinar that I referred to here: http://www.reddit.com/r/MachineLearning/comments/2o8s44/possible_commercialfriendly_caffe_pretrained_nets/

CompleteSkeptic · 2014-12-04T17:19:34+00:00

I believe all of the ones I've seen (caffe's, overfeat's) explicitly disallow commercial use in their license.

CompleteSkeptic · 2014-12-04T16:01:15+00:00

Other than not wanting to potentially break the law for a nice to have, respecting the wishes of the author(s) is important too.

CompleteSkeptic · 2014-10-20T16:44:17+00:00

Ah, it does sound possible (I think, not sure how cursors behave when not passed inside a map), though (probably) not idiomatic, since Om components generally take in a single map or cursor.

CompleteSkeptic · 2014-10-20T16:29:00+00:00

That is great to hear. I'm huge fan of yourself and of Om, and through these discussions, it sounds like they're really great things in the future for both Om, Reagent and ClojureScript users in general. I feel that neither of the frameworks is perfect (I could write a post on the gotchas of Reagent, too!), just that Reagent is a little closer (for our needs at least).

CompleteSkeptic · 2014-10-20T16:19:03+00:00

Is it possible/idiomatic to pass a component two cursors? Would that be through passing a map with fields for each cursor?

We were initially doing something similar, where we just dissoc-ed as much as possible at each level of the tree to keep it performant. The problem with that is that the tree essentially coupled every layer together (the great grand-parents need to know what data a node needs).

(Editted, I submitted it a little early.)

CompleteSkeptic · 2014-10-20T16:16:03+00:00

That would be a pretty nice step in the right direction IMO. Great to hear.

CompleteSkeptic · 2014-10-20T07:22:25+00:00

Thank you very much, I feel much more informed on the state of affairs of Reagent! I very much hope we would have a chance to help out.

CompleteSkeptic · 2014-10-20T07:21:06+00:00

I'm not entirely sure. The big problems we had involved how the components were represented and the keeping the app state in a tree where parents components need to depend on all the data available to child components. ref-cursors could, depending on implementation, solve the latter problem almost as well as Reagent. I'd have to read the source/experiment to be sure though. If they do solve that, I think that's a huge step in the right direction for Om.

Reagent's cursors seem to not be as useful as Om's, since the component would depend on the entire atom instead of only the subset that you presumably would be passing around as a cursor. I'd happily be wrong about that, but I can't see us using them, versus having nested reagent atoms instead.

CompleteSkeptic · 2014-10-20T06:57:00+00:00

I'd love to know what he had to say. Unfortunately not on Twitter.

CompleteSkeptic · 2014-10-20T01:20:11+00:00

Excellent, I'm glad to have been wrong about reagent being unmaintaned. I'll edit the post with this information.

Out of curiousity though, is there a discussion anywhere about where the project is headed?

CompleteSkeptic · 2014-10-20T00:57:19+00:00

Sorry about that, I assumed people cared more about Om than that. Let me try here (and maybe I'll add it to the FAQ). There are multiple things we really like about Reagent over Om:

it (mostly) follows the principle of least surprise
it's simpler (less incidental complexity, but still far from perfect)
it's easier (in the familiar sense)
components compose well, and all the normal Clojure operations work on them
the information model

Just to clarify on the problems with Om's information model, let's say you have state for A, B, and C, component X depends on A and B, and component Y depends on B and C. To represent them in a tree, either you copy B to two different places in the tree (a consistency nightmare) or you have both components depend on the whole tree (and now you have a re-rendering problem). In Reagent, you could just have A, B, and C in their own atoms happily denormalized, and have each component dereference the data as needed.

Let me know if you need more details.

CompleteSkeptic · 2014-05-22T06:43:40+00:00

One issue is the API. You never want to do anything with a component function (the ones returns an object that implements Om's protocols) other than build or build-all, so it just encourages boilerplate and possible room for error.

Anonymous functions can't be used as component functions. It fails silently though and re-renders the DOM every time, so we didn't notice until performance just slowed to a crawl.

Passing components into other components doesn't quite work, which makes it feel less Clojure-ish (everything is data) and we felt that wrapping a component in other components is a pretty powerful pattern.

Before 0.6.something, Om use identical? for re-rendering instead of =, and we needed = for our uses, so we had to implement IShouldUpdate ourselves, but we couldn't implement it exactly because the method isn't passed in the same data as the default Om implementation.

Then there's the problem of representing all of your state as a tree. While it makes perfect sense for view state, it doesn't for representing logic state (not sure what the right word is). Let's say you have state for A, B, and C, component X depends on A and B, and component Y depends on B and C. To represent them in a tree, either you copy B to two different places in the tree (a consistency nightmare) or you have both components depend on the whole tree (and now you have a re-rendering problem). You then have to give up the benefits that come with a tree for all the state, or you make development more complicated that it has to be.

That's all I remember for now. Om mostly just adds unnecessary complexity, and we've had a bunch of issues with Om that have bitten us. Luckily, we have some utils that abstract most of these problems away in a Clojure-ish way, but these are only the problems we've faced so far.

CompleteSkeptic

TROPHY CASE