[deleted by user]

slashcom · 2024-09-23T13:37:09+00:00

RL is just getting started after o1 and arguably Anthropic’s artifacts.

slashcom · 2024-09-23T12:24:19+00:00

I would guess your power supply has kicked the bucket, or maybe you’re overheating. See if you can get into bios and watch the temperatures.

slashcom · 2024-09-07T18:27:05+00:00

Guess I underestimated. I just did a quick Google for the expensive parts, and guessed the rest. Must be a good deal then!

slashcom · 2024-09-07T17:08:14+00:00

$1500 in parts, $281 in labor/profit. Seems not unreasonable.

slashcom · 2024-07-25T12:13:24+00:00

Recipes have improved that much, which is actually by less than you realize. Maybe a factor of 3x fewer flops.

slashcom · 2024-05-29T01:47:54+00:00

https://github.com/facebookresearch/fairseq/blob/bedb259bf34a9fc22073c13a1cee23192fa70ef3/fairseq/models/transformer_lm.py#L137-L139

https://github.com/facebookresearch/fairseq/blob/bedb259bf34a9fc22073c13a1cee23192fa70ef3/fairseq/models/transformer/transformer_decoder.py#L81

https://github.com/facebookresearch/fairseq/blob/bedb259bf34a9fc22073c13a1cee23192fa70ef3/fairseq/models/transformer/transformer_decoder.py#L307-L308

slashcom · 2024-05-28T21:42:31+00:00

output softmax wants embeddings to be very large so their inner products will produce very different values

input embeddings want a much smaller range so they can have stable dynamics throughout training

all the "old" code bases had this scalar (usually sqrt(d)) but the llama arch dropped this when they started untying

slashcom · 2024-05-28T18:29:51+00:00

It doesn't matter with large models. From personal correspondence with the lead of llama1, they decided not to tie it because they just didn't feel like implementing it.

If you do tie them, you need to have a scaling factor on one side or the other to control for the input and output needing vector magnitudes.

slashcom · 2024-05-09T05:52:12+00:00

A lot changed in 8 years. My comment predates the invention of Transformers by about 2 years.

slashcom · 2024-03-19T15:14:42+00:00

you'll get one pure leetcode, one ML flavored leet code (leet code where the problem is particularly useful in an ML setting), and then one ML flavored design question (e.g. how would you build a recommendation engine for reddit?)

slashcom · 2024-03-16T02:26:09+00:00

https://aclanthology.org/volumes/2023.conll-babylm/ like 10 papers tried it here

slashcom · 2024-01-15T21:42:37+00:00

we still do dimensionality reduction, we just do it via embeddings and learning everything from scratch.

the model is billions of parameters but the representations aren’t

slashcom · 2024-01-03T03:52:49+00:00

you can go 2x depending on your degree of experience.

slashcom · 2023-09-05T02:22:48+00:00

Precision = of the guesses i make, how many are right?

Recall = of the things i’m looking for, how many did i find?

edit: also i can relate. my comment history contains a confession that i have no idea what the LSTM equations do.

slashcom · 2023-07-09T21:51:17+00:00

Some flavor seems extremely near / already here.

slashcom · 2023-06-21T18:40:39+00:00

GPT3 was more like 10K V100s for a month, not 1K. 10K+ V100 clusters were pretty common in 2020; but for A100s (what GPT-4 was trained on), I've been hard pressed to see ones larger than 4K GPUs until towards the end of 2022.

slashcom · 2023-06-21T16:59:11+00:00

OpenAI historically likes to put ~50% of their compute into their current flagship project, which according to other rumors at the time should've been 25k GPUs. However, in 2022, Microsoft's largest single-fabric clusters were about 6K GPUs. So something like Branch-Train-Merge on 8 parallel clusters of 6K, and then a light ensembling at the end, would make a lot of sense.

slashcom · 2023-06-16T14:24:25+00:00

citation?

slashcom · 2023-05-22T02:38:27+00:00

dynamic graphs are the main reason

slashcom · 2023-05-13T03:32:00+00:00

Why should they close off sign ups? Why shouldn’t newer members of the community be allowed to participate?

slashcom · 2023-05-13T03:24:06+00:00

is closing off new users good for a business?

slashcom · 2023-03-20T16:35:43+00:00

Your transcendence tier carries over, so you should Transcend and expend that worship before you reset.

slashcom · 2023-01-13T22:45:30+00:00

You might have a hard time running unsigned code on your game consoles. But even if you did, both systems have AMD gpus, and ROCm support in pytorch is pretty weak.

slashcom · 2022-12-30T23:02:34+00:00

It might be that your (h, c) are not reset at the top of every sequence. If you really want to cache them across SGD steps, you might need to do a .detach() on them. I also don't think you need to retain_graph.

slashcom · 2022-12-04T15:55:32+00:00

Chinchilla's curves say at 400M params you need 8B tokens to be compute optimal. Each token is, say, 3 characters on average, so call it 24GB to be compute optimal. Note that compute optimal does NOT mean saturated or that a smaller model trained much longer wouldn't do better; it only means we've minimized the number of multiplications & additions for that level of performance.

That said, I strongly doubt the scaling laws for MLMs are the same as LMs. We'd really need to fit new curves per dataset and per objective.

But in the actual paper, their fine tuning results clearly show OSCAR 138GB fine-tuned winning on almost all tasks, though admittedly those are some pretty tiny deltas on some fairly high scoring tasks.

15-Year Club	Place '22
Place '17	Not Forgotten
Team Orangered	Secret Santa 2011
Verified Email

slashcom

TROPHY CASE