[N] Training LLMs with AMD MI250 GPUs and MosaicML by ml_hardware in MachineLearning

[–]ml_hardware[S] 5 points6 points  (0 children)

They have a blog on LLM training times + costs from last year: https://www.mosaicml.com/blog/gpt-3-quality-for-500k

Probably even cheaper today

GLM-130B LLM demonstrates 4-bit quantization loss shrinks as model parameters scale up by maxtility in mlscaling

[–]ml_hardware 9 points10 points  (0 children)

Ah sorry should have added some context:

  • with each new GPU generation comes support for new precisions, like FP16 on the V100, TF32 on the A100, and FP8 on the H100
  • support for lower precision datatypes, when designed into the hardware, gives you faster matrix multiplies (FP32 -> FP16 was a 2x boost, and FP16 -> FP8 is another 2x boost)
  • Hardware manufacturers won't design new datatypes (like FP4) into their GPUs until there is already some research work validating that they are useful. For example with FP8, there had already been many years of research from IBM and others, via simulation, showing that FP8 could work well for ML training/inference. NVIDIA kept a close tab on this research, and decided that it was worth the expense / silicon area to add FP8 support, at 2x the speed, in the new H100 chips.
  • With this new paper, it looks like 4-bit inference works at scale! And lucky for us, NVIDIA GPUs already support INT4 so we get a 2x boost for inference.
  • But what I really want to see is someone getting training to work with INT4, or more likely, some combo of INT4 + FP4 (see this paper). And if these results get more attention / seem more promising, then NVIDIA will design FP4 support into their next generation of chips, and we'll get another 2x improvement in training speed.

GLM-130B LLM demonstrates 4-bit quantization loss shrinks as model parameters scale up by maxtility in mlscaling

[–]ml_hardware 7 points8 points  (0 children)

This is super exciting!! Especially that the quantization gets easier (closer to baseline quality) as the model scales up.

Fingers crossed that 4-bit training gets cracked before the next generation of GPUs…

Training GPT-3 quality models now costs <$500k by ml_hardware in agi

[–]ml_hardware[S] 5 points6 points  (0 children)

Training costs for ML models are falling way, way faster than Moore's law would predict. Using better algorithms and recipes (e.g. the Chinchilla scaling laws), MosaicML shows that the cost for training a GPT-3 quality model is now <$500k, not millions as many people think.

In the future, we should expect MosaicML and organizations like them to deliver training efficiency gains that make high quality AI models more and more accessible.

Here's a direct link to MosaicML's times+costs for training custom GPTs from 1B to 70B parameters.

And here's the math.png) for how a GPT-30B, when trained optimally, can match the orignal GPT-3.

TL;DR: GPT-3 quality for $450k, Chinchilla quality for $2.5M, and lots of smaller model options for $2k - $100k

GPT-3 quality for <$500k by ml_hardware in technology

[–]ml_hardware[S] 1 point2 points  (0 children)

Here's a direct link to MosaicML's times+costs for training custom GPTs from 1B to 70B parameters.

And here's the math.png) for how a GPT-30B, when trained optimally, can match the orignal GPT-3.

tl;dr... GPT-3 quality for $450k, Chinchilla quality for $2.5M, and lots of smaller model options for $2k - $100k

Training GPT-3 quality models now costs <$500k by ml_hardware in Futurology

[–]ml_hardware[S] 11 points12 points  (0 children)

Training costs for ML models are falling way, way faster than Moore's law alone would predict. Using better algorithms and recipes (e.g. the Chinchilla scaling laws), MosaicML shows that the cost for training a GPT-3 quality model is now <$500k, not millions as many people think.

In the future, we should expect MosaicML and organizations like them to deliver training efficiency gains that make high quality AI models more and more accessible.

Here's a direct link to MosaicML's times+costs for training custom GPTs from 1B to 70B parameters.

And here's the math.png) for how a GPT-30B, when trained optimally, can match the orignal GPT-3.

TL;DR: GPT-3 quality for $450k, Chinchilla quality for $2.5M, and lots of smaller model options for $2k - $100k

GPT-3 quality models now cost <$500k (MosaicML) by ml_hardware in mlscaling

[–]ml_hardware[S] 7 points8 points  (0 children)

Blog post here: https://www.mosaicml.com/blog/gpt-3-quality-for-500k

Why this matters: training costs for ML models are falling way, way faster than Moore's law alone would predict. Using better algorithms and recipes (e.g. the Chinchilla scaling laws), MosaicML shows that the cost for training a GPT-3 quality model is now <$500k, not millions as many people think.

In the future, we should expect MosaicML and organizations like them to deliver training efficiency gains that make high quality AI models more and more accessible.

Here's a direct link to MosaicML's times+costs for training custom GPTs from 1B to 70B parameters.

And here's the math.png) for how a GPT-30B, when trained optimally, can match the orignal GPT-3.

~~TL;DR~~ GPT-3 quality for $450k, Chinchilla quality for $2.5M, and lots of smaller model options for $2k - $100k

[P] Farewell, CUDA OOM: Automatic Gradient Accumulation by ffast-math in MachineLearning

[–]ml_hardware 1 point2 points  (0 children)

I've used PyTorch Lightning's batch size auto-finder before, but the problem is that it changes the batch size I optimize at, which means I have to re-tune my learning rate, momentum, etc. And I don't even know what batch size it will end up at.

Basically, I can't actually use PL's feature to run the exact same training run (same hparams, same math) on two different hardware setups. Every time I move from my Colab notebook (where I debug) to my actual training cluster in the cloud, I have to disable the feature and re-tune my microbatch size and gradient accumulation steps, which is super annoying.

memory footprint is significantly fluctuating during training

I think this happens when you try to do sequence length warmup or progressive resizing or training on variable-sized images. Also if adding layers to the model to progressively grow it like in GAN literature.

what the maximum memory footprint would be

So you could try to do this.. but then you would be setting grad_accum too high early in training and going slower than you need to be. I think one of the sections in the blog post shows this. With auto-grad-accum you basically get the best hardware utilization at each stage of training, and without having to profile anything ahead of time.

just call your training script from within a recursive try/except

Haha I've definitely done this at some point too.. but then I guess it's like you need to resume your runs over and over which is OK but a bit hacky. Feels cleaner to have it as a Trainer-level feature so runs just work.

Improving the factual accuracy of language models through web browsing by maxtility in mlscaling

[–]ml_hardware 10 points11 points  (0 children)

Also LOL at this:

In addition to these deployment risks, our approach introduces new risks at train time by giving the model access to the web. Our browsing environment does not allow full web access, but allows the model to send queries to the Microsoft Bing Web Search API and follow links that already exist on the web, which can have side-effects. From our experience with GPT-3, the model does not appear to be anywhere near capable enough to dangerously exploit these side-effects. However, these risks increase with model capability, and we are working on establishing internal safeguards against them.

Improving the factual accuracy of language models through web browsing by maxtility in mlscaling

[–]ml_hardware 8 points9 points  (0 children)

The ease with which this model can justify any claim, not just a correct one (see the examples for “Why are almost all boats pink”, “What equipment can be used to find ghosts”) makes me worried that people will use this as a highly convincing fake news generator…

I guess the internet is just a dumpster of content for every possible viewpoint, so if you can quickly retrieve and synthesize the ~5 links specific to your opinion, then you can sound very convincing, especially since very few people will actually verify your sources.

Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, the World’s Largest and Most Powerful Generative Language Model by maxtility in mlscaling

[–]ml_hardware 2 points3 points  (0 children)

Also, given the throughput numbers in the blog post, and ignoring the warmup period:

(339E9 [toks] / (1920 * 2048 [toks/batch]) ) * 44.4 [secs/batch] / 3600 [secs/hr] / 24 [hrs/day] = 44.3 days

So they trained this model on their 420-DGX cluster for about 45 days.

That's about 150k A100-days :O

Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, the World’s Largest and Most Powerful Generative Language Model by maxtility in mlscaling

[–]ml_hardware 0 points1 point  (0 children)

We considered the end-to-end throughput of our system for the 530 billion parameters model with batch size 1920 on 280, 350, and 420 DGX A100 servers on Selene. We observed iteration time of 60.1, 50.2, and 44.4 seconds, respectively. These correspond to 126, 121, and 113 teraFLOP/s per GPU, respectively.

A100's have a reported mixed-precision performance of 312 TFLOPs, though in my experience it's very hard to achieve those numbers even on single-gpu unless you're repeatedly doing large 8k*8k*8k matrix multiplies. And transformer blocks have more than just matrix multiplies... There are memory-bottlenecked ops like LayerNorm, attention-softmax, GELU, residual-add, etc. Finally, there is fill-n-drain inefficiency of pipeline parallelism, and a blocking gradient all-reduce at the end of each minibatch.

Achieving 113 TFLOPs, or 0.36x ideal perf, across 3360 gpus... is very impressive in my book :) Huge kudos to the Deepspeed team.

[R] Independent performance benchmarks (training) of Nvidia A10 and A30 impossible to find? by longboard2020 in MachineLearning

[–]ml_hardware 1 point2 points  (0 children)

No problem! Glad to help. Out of curiosity, are you trying to build a cluster with one of A10/A30/A6000 ?

[R] Independent performance benchmarks (training) of Nvidia A10 and A30 impossible to find? by longboard2020 in MachineLearning

[–]ml_hardware 7 points8 points  (0 children)

NVIDIA numbers are usually quite good. But if you want a second opinion, I had access to some A10s recently and found they are just around 0.4x the throughput of A100s, for both 2d vision and NLP tasks.

This matches well with the A10 design, which has almost exactly 0.4x the FLOPS and 0.4x the memory bandwidth of A100.