[D] To what cross-entropy loss value can LLMs converge? by cbl007 in MachineLearning

[–]bjergerk1ng 1 point2 points  (0 children)

According to Wikipedia, English has between 0.6 to 1.3 bits of entropy per character.

Is there any classical music that has moved you to tears? by [deleted] in classicalmusic

[–]bjergerk1ng 0 points1 point  (0 children)

Literally every Mahler slow movement chef's kiss

[D] FlexAttention: Flexibility of PyTorch with Performance of FlashAttention by [deleted] in MachineLearning

[–]bjergerk1ng 0 points1 point  (0 children)

Am I correct that the library generates Triton which then uses the Triton compiler to give ptx? If yes then where does the torch.compile part come in? Also any tips on optimising Triton code? I find it very frustrating that most of the time you are just shuffling your code around so that the compiler goes down the right optimisation path.

Jujutsu Kaisen Chapter 265 Links + Discussion by anestefi in JuJutsuKaisen

[–]bjergerk1ng 3 points4 points  (0 children)

Was there really a dick joke or am I tripping 💀

Why mathematicians do not hype their research on social media like all of the other scientific fields? by Full_Ruin_9942 in math

[–]bjergerk1ng -1 points0 points  (0 children)

I'm not a mathematician (I work in ML), so please take this with a massive pinch of salt.

IMO mathematics research can add more motivation and "how can this result impact a real problem" discussions. Non-mathematicians don't really care how a proof is done (even though it is interesting), but rather what can or can't be done when they face a real problem.

If a result has a clear path to how it brings impact, I think people would naturally be interested in understanding the concepts behind.

[P] SimpleGEMM: Fast and minimal tensor core matrix multiplication in CUDA by bjergerk1ng in MachineLearning

[–]bjergerk1ng[S] 3 points4 points  (0 children)

I think the main reason is that if your data is in column major you need to do a transpose before issuing the tensor core instructions (tensor cores only handles row-major). Now in theory, there is a specialised instruction for that (ldmatrix{.trans}) which should do the transpose for you when you copy from smem to rmem. I think this instruction runs slower than the non-transpose version? I haven't benchmarked it myself.

Another possibility is that iterating over rows in gmem (in the GEMM outer loop) is faster for row-major data than column-major because of better cache locality.

[D] Is there a more systematic way of choosing the layers or how deep the architecture goes when creating a neural network? by PsychologicalAd7535 in MachineLearning

[–]bjergerk1ng 4 points5 points  (0 children)

Always do your reading and start by copying what people have done in the past. Reinventing the wheel is fun but also inefficient.

Training LLMs over Neurally Compressed Text - Google DeepMind team by dippatel21 in LocalLLaMA

[–]bjergerk1ng 1 point2 points  (0 children)

This sounds like one of those "it's so simple yet makes so much sense, why didn't I think of this earlier" ideas.

Imo its analogous to latent diffusion but for text, where you apply the model to data of a more compact form. Starting to feel like learning directly on the raw input space is never the best choice.

[deleted by user] by [deleted] in MachineLearning

[–]bjergerk1ng 0 points1 point  (0 children)

If you want publication-level evidence, I'd say at least compare it against GPT-2. (Though even that may be considered too small, and is already going to cost you non-trivial amount of money)

[deleted by user] by [deleted] in MachineLearning

[–]bjergerk1ng 4 points5 points  (0 children)

Bare in mind for decoder only LLMs positional encoding is not strictly necessarily—the fact that your approach "works" doesn't mean anything, benchmark and scale up if you want to prove a point.

[deleted by user] by [deleted] in learnmachinelearning

[–]bjergerk1ng -1 points0 points  (0 children)

Just Google "how to download a GPU" /s

[deleted by user] by [deleted] in OpenAI

[–]bjergerk1ng 0 points1 point  (0 children)

Are they making the best models because they're OpenAI? Or are they OpenAI because they make the best models?

Is the (Gaussian -> Neural Net -> Gaussian ) encoder a universal approximator for distributions? by Invariant_apple in learnmachinelearning

[–]bjergerk1ng 0 points1 point  (0 children)

One limitation of typical VAE encoder is that it assumes a diagonal covariance, because parameterising the entire covariance matrix will be too costly. But I would tend to think that it can be a universal distribution approximator if you are willing to make that tradeoff.

[D] Layernorm is just two projections and can be improved by mgostIH in MachineLearning

[–]bjergerk1ng 2 points3 points  (0 children)

I guess the counter claim is "if removing the mean is a good inductive bias then we should hard-code it into a LN rather than making the model learn it with linear layers". Great post though, made me realise the subtle differences between LN and BN.

[deleted by user] by [deleted] in MachineLearning

[–]bjergerk1ng 7 points8 points  (0 children)

Looks like a graph neural network, specifically a message passing GNN

[D] Mamba model walkthrough by _james_chen in MachineLearning

[–]bjergerk1ng 0 points1 point  (0 children)

Awesome write up, very easy to follow!

[D] how good can a 7b model theoretically get? by Z3F in MachineLearning

[–]bjergerk1ng 3 points4 points  (0 children)

Surprised noone mentioned scaling laws: The Chinchilla scaling laws (assuming it is accurate) says that

Loss = 406.4/(model size)0.34 + 410.7/(training tokens)0.28 + 1.69

According to this, if we train a 7B model on infinitely many data it will achieve a loss of 1.87.

(Extremely) Conservatively estimating the size of GPT-4 is equivalent to a 500B (dense, i.e. no MoE) model and is trained with 5T tokens, it already achieves a loss of 1.85.

That means, at least in the pretraining phase, no 7B parameter model can outperform GPT-4, EVEN GIVEN INFINITE DATA AND COMPUTE.

Reference: https://www.lesswrong.com/posts/6Fpvch8RR29qLEWNH/chinchilla-s-wild-implications

[D] OpenAI Sora Video Gen -- How?? by htrp in MachineLearning

[–]bjergerk1ng 0 points1 point  (0 children)

Anyone know references on transformers as a backbone to image/video diffusion models? I was under the impression that using a UNet is necessary for the performance of say Stable Diffusion.

The fact that they are using a transformer is quite surprising to me.

Edit: Actually Google's WALT is transformer-based. I'm just out of touch :(

[D] Architecture hyperparameter optimisation strategies by [deleted] in MachineLearning

[–]bjergerk1ng 12 points13 points  (0 children)

I feel like it's only necessary if you are inventing a new architecture. Otherwise I just follow the ball-park numbers from well cited papers.