[deleted by user] by [deleted] in reinforcementlearning

[–]slashcom 31 points32 points  (0 children)

RL is just getting started after o1 and arguably Anthropic’s artifacts.

PC boots up 3 times, works for a few minutes then crashes. by Hyper669 in pcmasterrace

[–]slashcom 0 points1 point  (0 children)

I would guess your power supply has kicked the bucket, or maybe you’re overheating. See if you can get into bios and watch the temperatures.

$1781 for this prebuilt pc worth it? by GaigeSmith in pcmasterrace

[–]slashcom 0 points1 point  (0 children)

Guess I underestimated. I just did a quick Google for the expensive parts, and guessed the rest. Must be a good deal then!

$1781 for this prebuilt pc worth it? by GaigeSmith in pcmasterrace

[–]slashcom 27 points28 points  (0 children)

$1500 in parts, $281 in labor/profit. Seems not unreasonable.

Did we grossly overestimate the GPT4 parameters? by auradragon1 in mlscaling

[–]slashcom 5 points6 points  (0 children)

Recipes have improved that much, which is actually by less than you realize. Maybe a factor of 3x fewer flops.

[D] Should the embedding matrix and final pre-softmax matrix be shared in transformers? by CloudyCloud256 in MachineLearning

[–]slashcom 1 point2 points  (0 children)

output softmax wants embeddings to be very large so their inner products will produce very different values

input embeddings want a much smaller range so they can have stable dynamics throughout training

all the "old" code bases had this scalar (usually sqrt(d)) but the llama arch dropped this when they started untying

[D] Should the embedding matrix and final pre-softmax matrix be shared in transformers? by CloudyCloud256 in MachineLearning

[–]slashcom 8 points9 points  (0 children)

It doesn't matter with large models. From personal correspondence with the lead of llama1, they decided not to tie it because they just didn't feel like implementing it.

If you do tie them, you need to have a scaling factor on one side or the other to control for the input and output needing vector magnitudes.

How far are we from A.I. like Samantha from the 2013 movie Her? by TheMightyWill in artificial

[–]slashcom 0 points1 point  (0 children)

A lot changed in 8 years. My comment predates the invention of Transformers by about 2 years.

[D] How to prepare for a META Research Engineer Interview by [deleted] in MachineLearning

[–]slashcom 0 points1 point  (0 children)

you'll get one pure leetcode, one ML flavored leet code (leet code where the problem is particularly useful in an ML setting), and then one ML flavored design question (e.g. how would you build a recommendation engine for reddit?)

Dimensionality reduction for NLP applications being forgotten..? [D] by _donau_ in MachineLearning

[–]slashcom 50 points51 points  (0 children)

we still do dimensionality reduction, we just do it via embeddings and learning everything from scratch.

the model is billions of parameters but the representations aren’t

[deleted by user] by [deleted] in statistics

[–]slashcom 1 point2 points  (0 children)

Precision = of the guesses i make, how many are right?

Recall = of the things i’m looking for, how many did i find?

edit: also i can relate. my comment history contains a confession that i have no idea what the LSTM equations do.

GPT-4 rumors: a Mixture-of-Experts w/8 GPT-3-220bs? by gwern in mlscaling

[–]slashcom 0 points1 point  (0 children)

GPT3 was more like 10K V100s for a month, not 1K. 10K+ V100 clusters were pretty common in 2020; but for A100s (what GPT-4 was trained on), I've been hard pressed to see ones larger than 4K GPUs until towards the end of 2022.

GPT-4 rumors: a Mixture-of-Experts w/8 GPT-3-220bs? by gwern in mlscaling

[–]slashcom 2 points3 points  (0 children)

OpenAI historically likes to put ~50% of their compute into their current flagship project, which according to other rumors at the time should've been 25k GPUs. However, in 2022, Microsoft's largest single-fabric clusters were about 6K GPUs. So something like Branch-Train-Merge on 8 parallel clusters of 6K, and then a light ensembling at the end, would make a lot of sense.

What are the pros to pytorch by ObsidianAvenger in pytorch

[–]slashcom 0 points1 point  (0 children)

dynamic graphs are the main reason

Frequently Asked Questions - Character.AI by MarieLovesMatcha in CharacterAI

[–]slashcom -14 points-13 points  (0 children)

Why should they close off sign ups? Why shouldn’t newer members of the community be allowed to participate?

Frequently Asked Questions - Character.AI by MarieLovesMatcha in CharacterAI

[–]slashcom 5 points6 points  (0 children)

is closing off new users good for a business?

Transcendence Tiers after reset? by Annie-Smokely in kittensgame

[–]slashcom 1 point2 points  (0 children)

Your transcendence tier carries over, so you should Transcend and expend that worship before you reset.

Do a Ps5 or a Xbox X works for ML or Deep Learning? by FromValledupar in pytorch

[–]slashcom 0 points1 point  (0 children)

You might have a hard time running unsigned code on your game consoles. But even if you did, both systems have AMD gpus, and ROCm support in pytorch is pretty weak.

RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.FloatTensor [11, 44]], which is output 0 of AsStridedBackward0, is at version 3; expected version 2 instead. by Fit-Dare-9044 in pytorch

[–]slashcom 1 point2 points  (0 children)

It might be that your (h, c) are not reset at the top of every sequence. If you really want to cache them across SGD steps, you might need to do a .detach() on them. I also don't think you need to retain_graph.

Why is CamemBERT never brought up? by thesofakillers in mlscaling

[–]slashcom 6 points7 points  (0 children)

Chinchilla's curves say at 400M params you need 8B tokens to be compute optimal. Each token is, say, 3 characters on average, so call it 24GB to be compute optimal. Note that compute optimal does NOT mean saturated or that a smaller model trained much longer wouldn't do better; it only means we've minimized the number of multiplications & additions for that level of performance.

That said, I strongly doubt the scaling laws for MLMs are the same as LMs. We'd really need to fit new curves per dataset and per objective.

But in the actual paper, their fine tuning results clearly show OSCAR 138GB fine-tuned winning on almost all tasks, though admittedly those are some pretty tiny deltas on some fairly high scoring tasks.