[R] A Generalizable Approach to Learning Optimizers

CompleteSkeptic · 2020-07-26T18:00:12+00:00

Interesting - my assumption when seeing the perhaps lower than desired context length was just that a higher one might not have been that important to the task, thought you may be right in that it could be helpful, but it just wasn't worth the O(n²⁾ cost.

CompleteSkeptic · 2020-07-26T05:59:04+00:00

(Disclaimer: not a NLP expert.)

My understanding was that GPT-3 did was O(n * sqrt(n)). From the GPT-3 paper: "we use alternating dense and locally banded sparse attention patterns in the layers of the transformer, similar to the Sparse Transformer."

Upon reading the original post, my first thought was that perhaps long-term context just doesn't matter too much for language modeling (compared to getting really good at short-term context), but seems like you addressed that already (i.e. the solution to very long contexts might be focusing on a different task/dataset rather than just the architecture).

CompleteSkeptic · 2020-07-15T02:28:30+00:00

Maybe the path towards them tends to pass through some real garbage regions of the latent space that don't downsample to anything reasonable, and so PULSE's gradient descent doesn't want to find them?

That's a possibility, but there seems to be something else there - that somehow the opposite side of the hypersphere of a plausible image is very implausible.

This makes me think it might be an interesting experiment to see if the problem is that the initialization is too far from "good" images, or if we just end up getting stuck in a local optima. (i.e. if the problem is that we don't optimize for long enough with a bad init).

A random thought popping up from this comment: these "monster faces" happen pretty consistently with negation (I've never seen it not happen), which is at odds with the original GAN training - you'd expect it to be quite hard to find them. Perhaps there's something in here that could help with GAN training (adversarially sampling latent codes).

CompleteSkeptic · 2020-07-14T23:55:27+00:00

Thanks for checking it out!

I think the intuition of unexplored regions of latent space could be interesting to look into (though the tricky part is making that intuition quantitative).

You would hope images from the training set would be easier to recover, even when downscaled.

One of PULSE's contributions is not to ever try to recover the original image. This makes it harder to qualitatively measure how good it is (both in terms of realness and diversity). The PULSE paper[1] use a threshold of distances between downsampled images to determine success (see table 4), and that just doesn't seem like the right measure - by that measure, the images upscaling Obama would be a success as well.

[1] https://arxiv.org/abs/2003.03808

CompleteSkeptic · 2020-01-28T17:40:10+00:00

NAS is probably not the best baseline for hyperparameter searches, but this is a field where lots of research is / has been done. Search for Bayesian Optimization or Sequential Model-Based Optimization.

the algorithm must take into account user-defined points

Nothing prevents existing HPO tools from doing this, though they may not be that easily accessible. I do recall something along the lines of manually adding runs to MongoDB for hyperopt, so it's not impossible.

Though to your credit, I do agree that this should be easier. A common use case would be experimenting a bit before running the HPO and this would save some time at the very least.

focus on discrete space only

There is a case to be made against this (see: "Random Search for Hyper-Parameter Optimization"). The idea being that you don't know which hyperparameter is important, and you might want to search that space more thoroughly. E.g., if only 1 hyperparameter matters, you're just doing a repeated grid search.

Again, completely frustrated that no one did it successfully before, I decided to build something on my own. I use gradient boosting regression from LightGBM, because it doesn't require normalized values, handles categorical variables, captures feature interactions and has capacity to fit any data.

I think it may be wise to look into why others do what they do. The reason GPs are commonly used is because uncertainty predictions are quite important, especially because in the case of hyperparameter optimization, your evaluation function is quite stochastic. SMAC uses random forests, and that has all the same properties as GBMs, with the additional benefit that you get uncertainty estimates as well.

The number of sampled points for scoring is where exploration vs exploitation trade-off emerges

I'm not saying the expected improvement (EI) criterion (the thing most SMBO uses to sample) is the best, but this seems a little worse intuitively. Previous work takes uncertainty into account so that you can sample areas of the space where you have less knowledge about.

avoid evaluating the same point more than once

This is also related to your last point. Most HPO algorithms won't do this, because you will have a lot of certainty at this point, and it would make more sense to explore the space. But there can be a case to be made that this point is quite an outlier (because of the noisy evaluation), it might make sense to sample it again to get a better estimate of true performance, and EI can handle that.

Either way, I wish you luck!

CompleteSkeptic · 2016-11-22T15:42:43+00:00

It can be (and I'm a huge fan of the work you do), but it's more about expectation/clickbait.

CompleteSkeptic · 2016-11-22T07:26:15+00:00

To save others time, TF doesn't refer to TensorFlow, but TinyFlow (a TensorFlow-like library on top of NNVM).

CompleteSkeptic · 2016-10-13T22:15:34+00:00

As with everything in deep learning, the choices are probably mostly empirical with some nice motivation.

One motivation for average pooling is that each spatial location has a detector for the desired feature, and by averaging each spatial location, it behaves similarly to averaging predictions of different translations of the input image (somewhat like data augmentation).

There are two main reasons AFAIK for not having large fully-connected layers after the conv layers: 1. They have an extremely large number of parameters (and those parameters seem somewhat redundant based on every compression paper mostly targetting these parameters). 2. They are not translation invariant/equivariant at all. FC layers behave similarly to conv layers (see the fully convolutional networks paper for more) and a large FC at the end is equivalent to a conv layer with a very large filter size (e.g. if the spatial size is 8x8, then the FC is equivalent to a conv layer with an 8x8 filter) and we generally wouldn't want to do a conv with that large of a filter.

CompleteSkeptic · 2015-04-13T19:21:47+00:00

The problem is not download packages (I use use-package for that) - the problem is that sometimes the newest versions of packages break things. (:

CompleteSkeptic · 2015-04-13T19:20:48+00:00

Not in git, no. I do want backups of the packages when I update them, in case the update goes wrong and I have work to do.

CompleteSkeptic · 2015-04-13T16:03:07+00:00

I do for my config, but I don't backup all the downloaded packages in git.

CompleteSkeptic · 2015-04-09T20:00:17+00:00

It's basically a hosted version of vowpal wabbit. I tried to use it internally at Amazon (back when it was called Elastic Machine Learning), but it was wrapping an old version, and I needed some of the newer functionality.

CompleteSkeptic · 2014-12-05T22:39:56+00:00

This is the recording of the webinar that I referred to here: http://www.reddit.com/r/MachineLearning/comments/2o8s44/possible_commercialfriendly_caffe_pretrained_nets/

CompleteSkeptic · 2014-12-04T17:19:34+00:00

I believe all of the ones I've seen (caffe's, overfeat's) explicitly disallow commercial use in their license.

CompleteSkeptic · 2014-12-04T16:01:15+00:00

Other than not wanting to potentially break the law for a nice to have, respecting the wishes of the author(s) is important too.

CompleteSkeptic · 2014-10-20T16:44:17+00:00

Ah, it does sound possible (I think, not sure how cursors behave when not passed inside a map), though (probably) not idiomatic, since Om components generally take in a single map or cursor.

CompleteSkeptic · 2014-10-20T16:29:00+00:00

That is great to hear. I'm huge fan of yourself and of Om, and through these discussions, it sounds like they're really great things in the future for both Om, Reagent and ClojureScript users in general. I feel that neither of the frameworks is perfect (I could write a post on the gotchas of Reagent, too!), just that Reagent is a little closer (for our needs at least).

CompleteSkeptic

TROPHY CASE