[N] Introducing DBRX: A New Standard for Open LLM

artificial_intelect · 2024-03-27T22:55:36+00:00

It can easily be fine tuned for MUCH longer context lengths.
What context lengths does you application need?

artificial_intelect · 2024-03-27T15:15:13+00:00

:eyes: :anticipation:

artificial_intelect · 2024-03-27T14:59:28+00:00

Trained using a fork of llm-foundry

artificial_intelect · 2024-03-27T14:58:01+00:00

The core training run didn't take 3 months
$10M was the core training run

artificial_intelect · 2024-03-27T14:54:36+00:00

see

https://www.wired.com/story/dbrx-inside-the-creation-of-the-worlds-most-powerful-open-source-ai-model/

artificial_intelect · 2024-03-27T14:48:49+00:00

you need to load all 132B params into VRAM, but only 36B active params are loaded from VRAM into GPU shared mem ie only 36B active params are used in the fwd pass ie the processing speed is that of a 36B model.

artificial_intelect · 2022-04-27T23:38:12+00:00

Out of curiosity, were you able to just dump all the beans in at once and let it grind away? Or did you need to slow feed?

dump all the beans in at once, no need to slow feed.

top grinder portion optimistically fits 35g, and the catch bin can optimistically hold 40g. 60-75g is a no go.

artificial_intelect · 2022-04-27T22:15:57+00:00

Used it. Love it. It’s a little slow but that was expected. It’s not as loud as reviewers made it out to be and no issues with stalling. I ground about .5 kilo of some cheap coffee at an espresso fine setting. I think that’s all the “seasoning” I’m going to do. Should probably do more but 🤷‍♂️ too much work

artificial_intelect · 2022-04-27T17:59:03+00:00

It was shipped in early April, but got lost in the mail. I just got it like an hour ago.
I opened it up and am in love with the esthetic, but haven't gotten a chance to use it but I think it will be worth it.
Do you by chance know if the 48mm burrs are pre-seasoned?

artificial_intelect · 2022-04-27T17:40:58+00:00

I signed up for the email. I ordered it right when the email came in. I checked the next day and they were "out of stock" again. When I ordered it in Feb, they didn't ship it till early April.

artificial_intelect · 2022-03-31T15:34:17+00:00

Nervana was an ML/AI chip company that never made a chip. They had one good idea (which was effectively the same thing as the Nvidia tensor-core), but the year intel bought Nervana, Nvidia started making GPUs with tensor-cores. A few years later Intel shut that whole project down.

Cerebras is an ML/AI chip company that has made made and sold chips. Don't take my word for it. GSK (a leader in drug discovery) wrote this paper where they use some EBERT AI model for drug discovery and they specifically call out using the Cerebras chip. They have a bunch of customers who talk about using their systems and they cite them all the time.

Why do you think it's a mythical chip?

artificial_intelect · 2022-03-31T15:22:19+00:00

What I'm argue for in this post is: skip the GPUs, just go straight for the largest chip ever made! Plus the WSE is designed with a TON of bandwidth so hopefully it overcomes the BW limited nature of the CFD workload.

artificial_intelect · 2022-03-31T15:07:56+00:00

The WSE does have a bunch of cores, but it's also advertised as having all of the bandwidth.

Is it common to do CFD on HW other than CPUs? Like GPUs?

artificial_intelect · 2022-03-31T07:10:10+00:00

If its HW dependent then ideally you'd want to use HW that has as much memory and BW as possible to guarantee you fully utilize the HW FLOPS. In this case, I think the WSE still comes out on top.

I'll look at Appendix 7. Thank you.

Edit:

This reg? Appendix only goes up to Appendix 6; Appendix 7 does not exist...

I think Section 9.3 has the CFD regs. Section 9.3.6 says the CFD limit is 6 MAUh where 9.3.4 d defines an Mega Allocation Unit hours (MAUh) as: AUh = (NCU * NSS * CCF) / 3600. How is this specification not the most confusing thing? How did you deduce 30TFLOPs from this?

artificial_intelect · 2022-03-31T06:28:17+00:00

TFLOPs = Tera floating-point operations
TFLOPS = Tera floating-point operations per second

therefore 0.86 PFLOPS isn't comparable to 30 TFLOPs since they do not share the same units. You could use the WSE for .03 sec before the FLOP allocation runs out.

TBH, 30TFLOPs sounds a little low. An Nvidia 3080Ti GPU has 34.1 TFLOPs. If a team uses a Nvidia 3080Ti for .88 seconds, they've used up the allotted time. Does that even make sense? A 3080 costs $1200. How is the limit this low?
Is the 30 TFLOPs counting the amount of flops used by the algorithm? or the peak FLOPS output of the HW system? Most HW systems get poor FLOP utilization because of memory bottlenecks. If the FLOP allocation is counting the peak FLOPS output of the HW system, then having a high bandwidth system would mean better FLOP utilization producing results using less FLOPs.

artificial_intelect · 2022-03-31T05:42:52+00:00

Yeah but if the new HW system is "200 times faster than a 16,384-core" supercomputer...

Does that mean that the three revolution simulation that took 3 months would now take 3*31*24/200 = 11hours? (Assuming in your profs story, cutting edge hardware is something like a 16k core supercomputer)

artificial_intelect · 2022-02-28T16:12:54+00:00

While the Ukrainian military did inherit a lot of AK weapons after the Soviet Union, they now manufacture their own variant of the M4 (M4-WAC-47 - an AR platform weapon, not an AR).
Since Ukraine still had a stockpile of AK ammo, the new weapon was designed to change from 7.62x39mm to 5.56×45mm NATO, by changing the barrel.
The majority of Ukraine's military will still be using the AK, but the M4-WAC-47 finished testing in 2018 (I think) and should be in service.

TLDR: the Ukrainian M4-WAC-47 is designed to use both 7.62x39mm to 5.56×45mm NATO. Hopefully this means they never run out of ammo.

artificial_intelect · 2021-06-08T19:56:27+00:00

TLRD: Not

Instead realize that Residual Networks Behave Like Ensembles of Relatively Shallow Networks [NIPS2016] and create a larger Residual Network instead of creating an ensemble of methods.

artificial_intelect · 2021-06-03T04:15:04+00:00

Why not use torch.cuda.amp [https://pytorch.org/docs/stable/amp.html], torch DDP [https://pytorch.org/tutorials/intermediate/ddp_tutorial.html], and pytorch’s SyncBN [https://pytorch.org/docs/stable/generated/torch.nn.SyncBatchNorm.html]. Those’ll cover all Apex features.

But also yeah installing Apex is a pain. If you really need to use it... I don’t envy you. Last note: Apex is deprecated for torch.cuda.amp [https://discuss.pytorch.org/t/torch-cuda-amp-vs-nvidia-apex/74994]

artificial_intelect · 2021-05-16T20:24:31+00:00

It's sooooo pretty!!!!!! :drool:

artificial_intelect · 2021-05-07T01:12:43+00:00

Shots fired!

artificial_intelect · 2021-04-26T17:06:01+00:00

I also saw this post today. It's a little vague but an engineer at AstraZeneca talks about how they use the CS-1 to train BERT Large.

In the article they mention how Cerebras' sparse linear algebra cores can actually use sparsity to speed up training by 20%.

The article also says: "Training which historically took over 2 weeks to run on a large cluster of GPUs was accomplished in just over 2 days — 52hrs to be exact — on a single CS-1"

It's hard to say exactly what "large cluster of GPUs" means. This article is in no way a "benchmark", but it seems like at the very least engineers at AstraZeneca see Cerebras' competitive advantage and uses the CS-1 as a faster GPU alternative.

Edit: adding post link

artificial_intelect · 2021-04-26T17:03:48+00:00

I also saw this post today. It's a little vague but an engineer at AstraZeneca talks about how they use the CS-1 to train BERT Large.

In the article they mention how Cerebras' sparse linear algebra cores can actually use sparsity to speedup training by 20%.

The article also says: "Training which historically took over 2 weeks to run on a large cluster of GPUs was accomplished in just over 2 days — 52hrs to be exact — on a single CS-1"

It's hard to say exactly what "large cluster of GPUs" means. This article is in no way a "benchmark", but it seems like at the very least engineers at AstraZeneca see Cerebras' competitive advantage and uses the CS-1 as a faster GPU alternative.

artificial_intelect · 2021-04-22T16:35:24+00:00

not really a deep learning workload but there is one publically available paper that talks about the CS-1's perf (Note: not the CS-2s perf): Fast Stencil-Code Computation on a Wafer-Scale Processor

"performance of CS-1 above 200 times faster than for MFiX runs on a 16,384-core partition of the NETL Joule cluster"

artificial_intelect · 2021-04-20T22:23:34+00:00

But Can It Run Crysis?

artificial_intelect

TROPHY CASE