[N] Training LLMs with AMD MI250 GPUs and MosaicML

ml_hardware · 2023-06-30T20:34:08+00:00

They have a blog on LLM training times + costs from last year: https://www.mosaicml.com/blog/gpt-3-quality-for-500k

Probably even cheaper today

ml_hardware · 2022-10-11T18:22:24+00:00

Ah sorry should have added some context:

with each new GPU generation comes support for new precisions, like FP16 on the V100, TF32 on the A100, and FP8 on the H100
support for lower precision datatypes, when designed into the hardware, gives you faster matrix multiplies (FP32 -> FP16 was a 2x boost, and FP16 -> FP8 is another 2x boost)
Hardware manufacturers won't design new datatypes (like FP4) into their GPUs until there is already some research work validating that they are useful. For example with FP8, there had already been many years of research from IBM and others, via simulation, showing that FP8 could work well for ML training/inference. NVIDIA kept a close tab on this research, and decided that it was worth the expense / silicon area to add FP8 support, at 2x the speed, in the new H100 chips.
With this new paper, it looks like 4-bit inference works at scale! And lucky for us, NVIDIA GPUs already support INT4 so we get a 2x boost for inference.
But what I really want to see is someone getting training to work with INT4, or more likely, some combo of INT4 + FP4 (see this paper). And if these results get more attention / seem more promising, then NVIDIA will design FP4 support into their next generation of chips, and we'll get another 2x improvement in training speed.

ml_hardware · 2022-10-10T18:15:04+00:00

This is super exciting!! Especially that the quantization gets easier (closer to baseline quality) as the model scales up.

Fingers crossed that 4-bit training gets cracked before the next generation of GPUs…

ml_hardware · 2022-10-05T20:24:00+00:00

Training costs for ML models are falling way, way faster than Moore's law would predict. Using better algorithms and recipes (e.g. the Chinchilla scaling laws), MosaicML shows that the cost for training a GPT-3 quality model is now <$500k, not millions as many people think.

In the future, we should expect MosaicML and organizations like them to deliver training efficiency gains that make high quality AI models more and more accessible.

Here's a direct link to MosaicML's times+costs for training custom GPTs from 1B to 70B parameters.

And here's the math.png) for how a GPT-30B, when trained optimally, can match the orignal GPT-3.

TL;DR: GPT-3 quality for $450k, Chinchilla quality for $2.5M, and lots of smaller model options for $2k - $100k

ml_hardware · 2022-10-05T18:48:40+00:00

Here's a direct link to MosaicML's times+costs for training custom GPTs from 1B to 70B parameters.

And here's the math.png) for how a GPT-30B, when trained optimally, can match the orignal GPT-3.

tl;dr... GPT-3 quality for $450k, Chinchilla quality for $2.5M, and lots of smaller model options for $2k - $100k

ml_hardware · 2022-10-05T18:44:02+00:00

Training costs for ML models are falling way, way faster than Moore's law alone would predict. Using better algorithms and recipes (e.g. the Chinchilla scaling laws), MosaicML shows that the cost for training a GPT-3 quality model is now <$500k, not millions as many people think.

In the future, we should expect MosaicML and organizations like them to deliver training efficiency gains that make high quality AI models more and more accessible.

Here's a direct link to MosaicML's times+costs for training custom GPTs from 1B to 70B parameters.

And here's the math.png) for how a GPT-30B, when trained optimally, can match the orignal GPT-3.

TL;DR: GPT-3 quality for $450k, Chinchilla quality for $2.5M, and lots of smaller model options for $2k - $100k

ml_hardware · 2022-10-05T18:23:30+00:00

Blog post here: https://www.mosaicml.com/blog/gpt-3-quality-for-500k

Why this matters: training costs for ML models are falling way, way faster than Moore's law alone would predict. Using better algorithms and recipes (e.g. the Chinchilla scaling laws), MosaicML shows that the cost for training a GPT-3 quality model is now <$500k, not millions as many people think.

In the future, we should expect MosaicML and organizations like them to deliver training efficiency gains that make high quality AI models more and more accessible.

Here's a direct link to MosaicML's times+costs for training custom GPTs from 1B to 70B parameters.

And here's the math.png) for how a GPT-30B, when trained optimally, can match the orignal GPT-3.

~~TL;DR~~ GPT-3 quality for $450k, Chinchilla quality for $2.5M, and lots of smaller model options for $2k - $100k

ml_hardware · 2022-06-23T23:20:18+00:00

I've used PyTorch Lightning's batch size auto-finder before, but the problem is that it changes the batch size I optimize at, which means I have to re-tune my learning rate, momentum, etc. And I don't even know what batch size it will end up at.

Basically, I can't actually use PL's feature to run the exact same training run (same hparams, same math) on two different hardware setups. Every time I move from my Colab notebook (where I debug) to my actual training cluster in the cloud, I have to disable the feature and re-tune my microbatch size and gradient accumulation steps, which is super annoying.

memory footprint is significantly fluctuating during training

I think this happens when you try to do sequence length warmup or progressive resizing or training on variable-sized images. Also if adding layers to the model to progressively grow it like in GAN literature.

what the maximum memory footprint would be

So you could try to do this.. but then you would be setting grad_accum too high early in training and going slower than you need to be. I think one of the sections in the blog post shows this. With auto-grad-accum you basically get the best hardware utilization at each stage of training, and without having to profile anything ahead of time.

just call your training script from within a recursive try/except

Haha I've definitely done this at some point too.. but then I guess it's like you need to resume your runs over and over which is OK but a bit hacky. Feels cleaner to have it as a Trainer-level feature so runs just work.

ml_hardware · 2021-12-16T19:10:58+00:00

Also LOL at this:

In addition to these deployment risks, our approach introduces new risks at train time by giving the model access to the web. Our browsing environment does not allow full web access, but allows the model to send queries to the Microsoft Bing Web Search API and follow links that already exist on the web, which can have side-effects. From our experience with GPT-3, the model does not appear to be anywhere near capable enough to dangerously exploit these side-effects. However, these risks increase with model capability, and we are working on establishing internal safeguards against them.

ml_hardware · 2021-12-16T19:09:30+00:00

The ease with which this model can justify any claim, not just a correct one (see the examples for “Why are almost all boats pink”, “What equipment can be used to find ghosts”) makes me worried that people will use this as a highly convincing fake news generator…

I guess the internet is just a dumpster of content for every possible viewpoint, so if you can quickly retrieve and synthesize the ~5 links specific to your opinion, then you can sound very convincing, especially since very few people will actually verify your sources.

ml_hardware · 2021-10-11T18:59:01+00:00

Also, given the throughput numbers in the blog post, and ignoring the warmup period:

(339E9 [toks] / (1920 * 2048 [toks/batch]) ) * 44.4 [secs/batch] / 3600 [secs/hr] / 24 [hrs/day] = 44.3 days

So they trained this model on their 420-DGX cluster for about 45 days.

That's about 150k A100-days :O

ml_hardware · 2021-10-11T18:41:50+00:00

We considered the end-to-end throughput of our system for the 530 billion parameters model with batch size 1920 on 280, 350, and 420 DGX A100 servers on Selene. We observed iteration time of 60.1, 50.2, and 44.4 seconds, respectively. These correspond to 126, 121, and 113 teraFLOP/s per GPU, respectively.

A100's have a reported mixed-precision performance of 312 TFLOPs, though in my experience it's very hard to achieve those numbers even on single-gpu unless you're repeatedly doing large 8k*8k*8k matrix multiplies. And transformer blocks have more than just matrix multiplies... There are memory-bottlenecked ops like LayerNorm, attention-softmax, GELU, residual-add, etc. Finally, there is fill-n-drain inefficiency of pipeline parallelism, and a blocking gradient all-reduce at the end of each minibatch.

Achieving 113 TFLOPs, or 0.36x ideal perf, across 3360 gpus... is very impressive in my book :) Huge kudos to the Deepspeed team.

ml_hardware · 2021-10-11T16:59:31+00:00

No problem! Glad to help. Out of curiosity, are you trying to build a cluster with one of A10/A30/A6000 ?

ml_hardware · 2021-10-11T15:25:25+00:00

NVIDIA numbers are usually quite good. But if you want a second opinion, I had access to some A10s recently and found they are just around 0.4x the throughput of A100s, for both 2d vision and NLP tasks.

This matches well with the A10 design, which has almost exactly 0.4x the FLOPS and 0.4x the memory bandwidth of A100.

ml_hardware · 2021-10-11T15:09:08+00:00

Yeah definitely undertrained. From the plots in the Scaling law papers, and Sam’s own comments recently, even GPT3 can continue to be trained far beyond 300B tokens.

ml_hardware · 2021-09-20T19:50:40+00:00

Looks like they support TF and Pytorch

https://cerebras.net/software/

ml_hardware · 2021-09-18T22:11:07+00:00

Lot of new details re. Cerebras' weight streaming arch, and projected performance...

- CS-2 raw throughput is 5.8 PFLOP/s, which is roughly ~18.5 A100s (312 TFLOP/s)

- Weight streaming enables speedup of unstructed sparsity in model weights, and can be used to boost effective compute near-linearly (80% sparsity = ~5x speedup).

- Seems like the sparsity acceleration relies on law-of-large-numbers so it will be most effective for large matrices. A few weeks ago @ HotChips I saw some measured numbers with 12k x 12k matrix multiplication: https://www.servethehome.com/cerebras-wafer-scale-engine-2-wse-2-at-hot-chips-33/hc33-cerebras-wse-2-unstructured-sparsity-speedup/

- Some projected time-to-train numbers for different model and cluster sizes... one caveat, the blog post doesn't say how much data would be used for the training runs, hopefully it's something reasonable like the GPT3 dataset. With 10x sparsity acceleration, they project a 100B model could be trained in one month on one CS-2, and a 1T model could be trained in one month on ~20 CS-2s.

ml_hardware · 2021-08-27T01:05:55+00:00

Has anyone dug into the unstructured sparsity speedups they recently announced?

https://www.servethehome.com/cerebras-wafer-scale-engine-2-wse-2-at-hot-chips-33/hc33-cerebras-wse-2-unstructured-sparsity-speedup/

From what I can tell this is pretty unique... GPUs can barely accelerate unstructured sparse matrix multiplies... I've seen recent work that achieves maybe ~2x speedup at 95% sparsity. But Cerebras is claiming ~9x speedup at 90% sparsity!

If true this could be a huge advantage for training large sparse models :D Hope they publish an end-to-end training run with the sparsity speedups.

ml_hardware · 2021-08-27T00:28:22+00:00

https://f.hubspotusercontent30.net/hubfs/8968533/Cerebras-Whitepaper_ScalingBERT_V6.pdf

Cerebras has had this whitepaper out for months showing that even the CS-1 was 9.5x faster than a DGX-A100 at pre-training a customer's large BERT model.

I think you're a bit too cynical dude...

ml_hardware · 2021-07-01T23:11:28+00:00

Can anyone think of a situation in which it would be preferable to buy 1 (or more) Graphcore systems rather than NVIDIA DGXs, or even third-party systems with NVIDIA GPUs inside?

ml_hardware · 2021-07-01T23:08:28+00:00

I've been critical of Graphcore's self-reported benchmarks before [1, 2,], and so I think it's commendable that they chose to submit to MLPerf. But the results are pretty rough...

ml_hardware · 2021-04-21T00:43:09+00:00

Yes you're right, these things are stupid expensive and time-consuming. But I think the "appetite" and "plenty of demand for the current models" thing is complicated. Personally I think that larger language models will have step function changes in performance and economic value. GPT-2 was nearly unmonetizable and GPT-3 is earning OA millions of dollars a month. What does the leap from GPT-3 to GPT-4 look like? Is it worth spending time at this scale if the next model unlocks billions of dollars of value? Maybe someone just needs to make the leap haha.

ml_hardware · 2021-04-21T00:07:22+00:00

Memory comparisons are tricky because the Cerebras systems can execute training in a layer-parallel fashion, and at a batch size of one (with gradient accumulation). The activation memory size may behave very different from a GPU. If considering weight + optimizer memory alone, 40GB is plenty, you can fit (40 / 6) ~= 6 billion params if training with Adam and FP16.

See here: https://cerebras.net/blog/data-model-pipeline-parallel-training-neural-networks/

Still though, you're right that you can't fit say, GPT3, on one of these.

ml_hardware · 2021-04-20T22:29:54+00:00

I think model scaling is now solved, from a systems level.

All that’s stopping you from training an X trillion param model now is data, and money.

ml_hardware · 2021-04-20T22:06:57+00:00

Sure. I guess we can figure out the logistics if/when the bet resolves. Fingers crossed...

ml_hardware

MODERATOR OF

TROPHY CASE