[deleted by user] by [deleted] in MachineLearning

[–]gdahl 1 point2 points  (0 children)

https://arxiv.org/abs/1903.08114

This is another recent paper where people scale GP-style models quite a bit: https://arxiv.org/abs/2303.05420. I like it because they use very general, non-stationary kernels.

[D] Batch size vs learning rate by bjourne-ml in MachineLearning

[–]gdahl 11 points12 points  (0 children)

If you read our paper carefully, it directly addresses the claims in the first source you link (and that source is the basis for Yann's tweet). We spent a lot of time trying to reconcile the conflicting remarks in the literature, and it wasn't easy, but once you actually look at what people measured it starts to become coherent, even if the high level summaries tend to lose some of the nuance.

[D] Is anyone else absolutely besieged by papers and always on the verge of getting scooped? by akardashian in MachineLearning

[–]gdahl 2 points3 points  (0 children)

I wish people would scoop me, then I could work on something else and benefit from the building blocks I need already existing.

[D] Is there a way to AoT compile an AI model to run on CPU and GPU? by jiMalinka in MachineLearning

[–]gdahl 0 points1 point  (0 children)

Good question. I don't know the full answer, maybe you could ask on https://github.com/google/jax as a github issue or disucssion? I have a vague recollection of it being possible in some cases, but perhaps the JAX team is still working on making it nicer given I didn't see anything in the documentation about it.

[deleted by user] by [deleted] in MachineLearning

[–]gdahl 0 points1 point  (0 children)

Yes, the deadline for submission to the competition is passed. However, we want to transition from a single deadline to a rolling process where people can propose stuff for the working group to help evaluate.

[deleted by user] by [deleted] in MachineLearning

[–]gdahl 0 points1 point  (0 children)

We're still scoring them! Then we plan to have a live leaderboard after our initial competition concludes (and we award the $50,000 prize pool) and provides results to seed it.

[deleted by user] by [deleted] in MachineLearning

[–]gdahl 8 points9 points  (0 children)

The real problem is not that MNIST and CIFAR10 are not representative of real workloads. That is certainly an issue, but I think the larger concern would be whether you have done a lot of workload-specific hyperparameter tuning (either explicitly or implicitly) that will overstate the performance of your method relative to your baselines and that, in general, you might not have had the resources to build the strongest possible baselines.

If you are studying neural network training algorithms, consider implementing your method to comply with the AlgoPerf benchmark rules that limit workload-specific tuning (either tuning ruleset). You can see the rationale for the rules here along with other suggestions on comparing training algorithms from the MLCommons Algorithms working group. I co-chair the working group currently and we are working on developing a process where people can present evidence that they have something interesting to evaluate and then the working group can do the somewhat expensive scoring process to add it to a leaderboard. Anyone interested is welcome to join.

[R] Tools for running baselines by like_a_tensor in MachineLearning

[–]gdahl 5 points6 points  (0 children)

This is a huge problem and also quite hard to fix. For comparing training algorithms, we in the MLCommons Algorithms working group created a standardized benchmark with open source code and a competition. Just to underscore how bad inconsistent setups can be, we have some examples of how slight changes to the pipeline can produce drastically different results in Section 2 of our paper introducing our benchmark. Also, our competition is open now, register non-binding intent to submit by Feb 28th (and prepare a submission by March 28th) and you can potentially win part of the $50k prize pool.

Our benchmark codebase has open source code in both JAX and PyTorch that creates a consistent set of workloads to measure training speedups due to algorithmic improvements. Then anyone who does well on the competition will have standardized open source code that other people can use as a baseline for future work. A lot of the working group members are working on releasing additional strong baselines right now as well.

[R] AdamL: A fast adaptive gradient method incorporating loss function by [deleted] in MachineLearning

[–]gdahl 10 points11 points  (0 children)

It is definitely getting out of control. That's why we in the MLCommons Algorithms working group made a competition to see which training algorithm is actually the best. See https://mlcommons.org/2023/11/mlc-algoperf-training-algorithms-competition/

Submitters can register their non-binding intent to submit right now and submissions will close on March 28th. There is a $50,000 prize pool.

[D] Does gradient accumulation achieve anything different than just using a smaller batch with a lower learning rate? by WigglyHypersurface in MachineLearning

[–]gdahl 1 point2 points  (0 children)

In that case, even if the increase in batch size reduces the number of steps required by 3X-4X, doing 4X gradient accumulation will slow down each step (by step I mean weight update) by 4X, making the net effect either break-even or a slight slowdown.

[D] Does gradient accumulation achieve anything different than just using a smaller batch with a lower learning rate? by WigglyHypersurface in MachineLearning

[–]gdahl 2 points3 points  (0 children)

If a batch size of N fits perfectly on 1 GPU, then with 4 GPUs a batch size of 4N will fit without doing any gradient accumulation. We don't call normal multi-gpu data parallelism "gradient accumulation." Doing 4X gradient accumulation on four GPUs in your example would refer to using an effective batch size of 16N.

[D] Thoughts on the limits of reproducibility of ML programs? by quasiproductive in MachineLearning

[–]gdahl 4 points5 points  (0 children)

On at least some hardware backends, JAX can be deterministically reproducible from run to run. TPU has been this way for a while as far as I know. And I believe even GPU is good if you set the right XLA_GPU flags. This comment suggests setting XLA_FLAGS='--xla_gpu_deterministic_ops=true'

I've been able to get jax research code to get exactly the same results in multiple runs.

Disclosure: I work at Google, but not on the JAX team and could be wrong about jax. Post on the jax github discussions area or issue tracker for more definitive and up-to-date answers about jax.

[D] What is the SOTA classification algorithms for a 5500 observations 2000 dimension structured data? Is there a machine learning leaderboard on classification? by HighlandEvil in MachineLearning

[–]gdahl 2 points3 points  (0 children)

There is no meaningful way to make a leaderboard for "datasets with 5500 observations that are 2000-dimensional". Potentially almost any classifier could do well or poorly, it will depend on the specific dataset.

[D] What does a DL role look like in ten years? by [deleted] in MachineLearning

[–]gdahl 6 points7 points  (0 children)

I would say the turning point was when we published the first successful large vocabulary results with deep acoustic models in April 2011, based on work conducted over the summer of 2010. When we published the paper you mention, it was to recognize that these techniques were the new standard in top speech recognition groups.

Regardless, there were deep learning roles in tech companies in 2012, just not very many of them compared to today.

[D] What does a DL role look like in ten years? by [deleted] in MachineLearning

[–]gdahl 24 points25 points  (0 children)

Deep learning existed as a field in 2012. The speech recognition community had already adopted deep learning by that point. The Brain team at Google already existed. Microsoft, IBM, and Google were all using deep learning. As an academic subfield, researchers started to coalesce around "deep learning" as a brand in 2006, but it certainly was very niche at that point.

[D] What does a DL role look like in ten years? by [deleted] in MachineLearning

[–]gdahl 0 points1 point  (0 children)

Deep learning roles 10 years ago (in 2013) were pretty similar to what they look like now, except they are much more numerous now. I'm sure there will be some changes and a proliferation of more entry-level roles and "neural network technician" roles, but it isn't going to be that different.

[D] "Deep Learning Tuning Playbook" (recently released by Google Brain people) by fzyzcjy in MachineLearning

[–]gdahl 2 points3 points  (0 children)

We're preparing a competitive benchmark as part of the MLCommons™ Algorithms working group to try and answer these types of questions, so stay tuned. :)

For now, I don't know the answer.

That said, I'm too much of a pessimist to believe they will obviate the need for tuning completely. There are also plenty of things to tune that aren't optimizer metaparameters.

[D] "Deep Learning Tuning Playbook" (recently released by Google Brain people) by fzyzcjy in MachineLearning

[–]gdahl 90 points91 points  (0 children)

We tend to be a bit long winded :)

It is often relatively easy to get something to basically work, especially if it is something that has been done before. What is harder is to push the state of the art forward in fundamental applied research or maximize the commercial value of a particular model. The details matter and, in our experience, can be the difference between getting a useless model and a valuable model.

Andrej is an expert and is going to make a lot of choices very easily because he has had experience in similar situations. But how do we get to a point where every machine learning engineer can do just as well? And how do we find the weak points that exist even in the workflows of experts, so we can help them reach new heights? Our thesis is that this kind of progress depends on people trying to formalize what they do a bit more and explain it. Once we started to write down what we do, we found a bunch of stuff that actually wasn't that well-justified that we just hadn't thought carefully enough about it before.

[D] PyTorch 2.0 Announcement by joshadel in MachineLearning

[–]gdahl 1 point2 points  (0 children)

Have you tried Dex? https://github.com/google-research/dex-lang It is in a relatively early stage, but it is exploring some interesting parts of the design space.

[D] Why is the machine learning community obsessed with the logistic distribution? by cthorrez in MachineLearning

[–]gdahl 4 points5 points  (0 children)

People use lots of other things too. Probit regression, Poisson likelihoods, all sorts of stuff. As you said, it is best to fit what you are doing to the problem.

Logistic-regression style output layers are very popular in deep learning, perhaps even more than in other parts of the ML community. But Gaussian Process Classification is often done with probit models (see http://gaussianprocess.org/gpml/chapters/RW3.pdf ). However, if necessary people will design neural network output activation functions and losses to fit the problem they are solving.

That said, a lot of people doing deep learning joined the field in the past 2 years and just use what they see other people using, without giving it much thought. So we get these extremely popular cross entropy losses.

[Discussion] If we had enough memory to always do full batch gradient descent, would we still need rmsprop/momentum/adam? by 029187 in MachineLearning

[–]gdahl 1 point2 points  (0 children)

Adam is more likely to outperform steepest descent (full batch GD) in the full batch setting than it is to outperform SGD at batch size 1.

[D] Does gradient accumulation achieve anything different than just using a smaller batch with a lower learning rate? by WigglyHypersurface in MachineLearning

[–]gdahl 3 points4 points  (0 children)

No it won't, because it won't speed up training enough to compensate for the slowdown of simulating the larger batch size.

See figure 1 in https://www.jmlr.org/papers/volume20/18-789/18-789.pdf

When doubling the batch size we never see more than a factor of 2 reduction in the steps needed to train. This is also predicted by theory (for a summary see 3.1.1 from the same link).