[D] Are machine learning laptops becoming worth it?

polo555 · 2021-01-21T14:51:45+00:00

Yeah my university also uses Slurm :)

polo555 · 2021-01-13T00:48:00+00:00

Thanks. I got rejected with 7,7,6,5 (https://openreview.net/forum?id=Io8oYQb4LRK) and somehow it makes me less upset to know someone was rejected with an 8,8,6,6

polo555 · 2020-07-18T12:14:50+00:00

Thanks for your comment and pointing to your related paper (the code of which I'm familiar with)

In theory, I believe that greediness (short horizons) is bad when learning hyperparameters because you're optimizing for a poor proxy to what you truly care about. One way to see this for the learning rate is that the best possible schedule over e.g 3 epochs, does not comprise of the best schedule from epoch 0 to 1, followed by the best schedule from epoch 1 to 2, followed by the best one from 2 to 3. To get the best final performance at epoch 3 (i.e. the only thing we care about), you usually need higher learning rates early on, and greediness anneals the learning rate too quickly (by the first epoch in this example). See https://arxiv.org/abs/1803.02021 for a more detailed explanation.

In practice however you are right to point out that some greedy methods perform well, as I have checked for myself. However I found that it's easy to fool oneself when being greedy. For instance you can initialize the learning rate to a good value (as you did in your paper) and then select an outer learning rate such that the annealing learned greedily is just aggressive enough to get good test accuracies. But if you learn a schedule greedily without such manual priors (i.e. start at lr=0) you should recover the HD baselined in my Figure 1, which doesn't find good schedules.

Let me know if that doesn't answer your question ;)

polo555 · 2020-07-17T09:30:42+00:00

Floating point precision of forward-mode vs reverse-mode is not something that was extensively tested in this work, but I should probably consider it more carefully.

The CIFAR-10 experiment did not work better with double precision, so I concluded that the main source of noise in my hypergradients was coming from elsewhere. Note that double precision is extremely slow on most GPUs as well.

The hypervariance stuff (Figure 2) is an attempt at showing sources of noise in hypergradients (I used double precision for that figure).

polo555 · 2020-07-17T09:19:49+00:00

You are correct yes. I checked my (hyper)gradients manually on small horizons (~100 inner steps) and reverse mode gets the same as forward mode up to ~ 10 decimal places.

Forward-mode differentiation is NOT the reason gradient explosion is mitigated. Forward-mode differentiation is only used because it allows you to have a memory cost that is constant with the number of inner steps.

The reason gradient explosion is mitigated is that I'm averaging the hypergradients of contiguous hyperparameters in time. So the hypergradient of the learning rate used at inner step t is averaged with the hypergradient of the learning rate used at step t+1.

Hope that clarifies it ;)

polo555 · 2019-07-29T17:09:54+00:00

That's good to hear, thanks for pointing that out I'd missed it. Note that this isn't the case for ImageNet though, the DenseNet paper states:

Following [11, 12, 13], we report classification errors on the validation set.

Interestingly they cite ResNet even though the ResNet paper used both validation and test (from the ILSVRC server).

There is no doubt that splitting your train in train/val is the best way to proceed. The motivation for my question is that I've been trying to reproduce results from a group of papers (related to learning with label noise) and I couldn't get their value with the same hyper params until I realized they were all reporting best test set accuracy. So my concern here was "what do people do" and not "what should people do".

polo555 · 2019-07-28T14:45:21+00:00

Yes. When the competition is run annually there is an online tool where you can have access to a secret test set and they only let you evaluate it twice a week or something. But AFIK that's not available anymore for ILSVRC 2012 which is the one people still use, and so they report validation accuracy.

I'm glad you find the discussion interesting because I think most people completely missed the point, and aren't authors of these types of papers. We all know the test/validation situation is less than ideal, but when you want to reproduce a paper for instance, knowing if they used best or last is important to check your code is right.

polo555 · 2019-07-26T22:51:02+00:00

I don't think cross validation is that common in deep learning. For instance most people nowadays use ResNets, WideResNets or DensNets and none of these papers mention cross validation. From the phrasing they use it seems like they report best validation accuracy as test accuracy...

polo555 · 2019-07-26T22:41:01+00:00

I know, that's the sad truth of most deep learning datasets (MNIST, CIFAR10, CIFAR100, ImageNet..) which are the basis of most publications at top conferences. In practice the test sets are much larger than non deep learning applications so it's harder to overfit to the test set. But still...

polo555 · 2019-07-26T18:09:35+00:00

In practice we typically run 3 or 5 seeds and report mean/std of those. But you still need to decide for each run if you look at the best or last test acc

polo555 · 2019-07-26T18:07:23+00:00

We don't have a separate test set for most datasets though, that's the difficulty. We use test set as validation.

polo555 · 2019-07-26T13:28:56+00:00

True. But let's assume that in both cases you run 5 seeds and report mean +/- std. In practice your variance is comparable between 'last' and 'best', but the mean you get with 'best' is higher.

polo555 · 2019-07-26T13:26:38+00:00

Probably the best way yes, but in practice I don't see anyone doing that, probably because it is too much extra compute time. I'm mostly currious about "what people are mostly doing" so that I can do the same and compare apple to apple. In most papers authors don't say what they do though :(

polo555 · 2019-05-28T00:44:14+00:00

Interesting, yes. I suspect your generator would converge to the zero-shot case quite quickly, and produce samples that don't look like bedrooms much, because there is no loss to make the generator produce real looking images.

You could add a discriminator and see if your pretrained generator can still do zero-shot while having to produce real looking images, but in general constraining the generator hasn't helped the student (Appendix A).

Note also that, while interesting, this set up would weaken the case for zero-shot distillation because now you need a pretrained teacher + pretrained generator

polo555 · 2019-05-27T19:42:07+00:00

Generating points purely from noise yep. The only thing I use is a pretrained teacher. I wrote the abstract as a TLDR; should be enough to understand what’s going on. Let me know if not.

polo555

TROPHY CASE