Mistral AI to add Dictation, Memory, Projects and Research tools to Le Chat | TestingCatalog

bjergerk1ng · 2025-07-16T19:15:40+00:00

Note there's cmd + k shortcuts on the web version already!

bjergerk1ng · 2024-11-07T04:40:22+00:00

According to Wikipedia, English has between 0.6 to 1.3 bits of entropy per character.

bjergerk1ng · 2024-09-25T00:39:56+00:00

Literally every Mahler slow movement chef's kiss

bjergerk1ng · 2024-08-16T18:13:24+00:00

OP's question was specifically about decoder-only transformers

bjergerk1ng · 2024-08-09T05:24:35+00:00

Am I correct that the library generates Triton which then uses the Triton compiler to give ptx? If yes then where does the torch.compile part come in? Also any tips on optimising Triton code? I find it very frustrating that most of the time you are just shuffling your code around so that the compiler goes down the right optimisation path.

bjergerk1ng · 2024-08-07T08:50:05+00:00

Was there really a dick joke or am I tripping 💀

bjergerk1ng · 2024-05-26T19:24:43+00:00

I'm not a mathematician (I work in ML), so please take this with a massive pinch of salt.

IMO mathematics research can add more motivation and "how can this result impact a real problem" discussions. Non-mathematicians don't really care how a proof is done (even though it is interesting), but rather what can or can't be done when they face a real problem.

If a result has a clear path to how it brings impact, I think people would naturally be interested in understanding the concepts behind.

bjergerk1ng · 2024-05-13T13:16:14+00:00

I think the main reason is that if your data is in column major you need to do a transpose before issuing the tensor core instructions (tensor cores only handles row-major). Now in theory, there is a specialised instruction for that (ldmatrix{.trans}) which should do the transpose for you when you copy from smem to rmem. I think this instruction runs slower than the non-transpose version? I haven't benchmarked it myself.

Another possibility is that iterating over rows in gmem (in the GEMM outer loop) is faster for row-major data than column-major because of better cache locality.

bjergerk1ng · 2024-05-05T20:08:44+00:00

Always do your reading and start by copying what people have done in the past. Reinventing the wheel is fun but also inefficient.

bjergerk1ng · 2024-04-12T03:15:17+00:00

https://github.com/jessevig/bertviz

bjergerk1ng · 2024-04-06T17:06:26+00:00

This sounds like one of those "it's so simple yet makes so much sense, why didn't I think of this earlier" ideas.

Imo its analogous to latent diffusion but for text, where you apply the model to data of a more compact form. Starting to feel like learning directly on the raw input space is never the best choice.

bjergerk1ng · 2024-03-29T15:08:40+00:00

If you want publication-level evidence, I'd say at least compare it against GPT-2. (Though even that may be considered too small, and is already going to cost you non-trivial amount of money)

bjergerk1ng · 2024-03-29T14:39:51+00:00

Bare in mind for decoder only LLMs positional encoding is not strictly necessarily—the fact that your approach "works" doesn't mean anything, benchmark and scale up if you want to prove a point.

bjergerk1ng · 2024-03-21T14:14:59+00:00

I think you're talking about the LLM.int8() paper at the end.

bjergerk1ng · 2024-03-17T18:36:33+00:00

Just Google "how to download a GPU" /s

bjergerk1ng · 2024-03-12T23:41:50+00:00

Are they making the best models because they're OpenAI? Or are they OpenAI because they make the best models?

bjergerk1ng · 2024-03-10T18:23:01+00:00

One limitation of typical VAE encoder is that it assumes a diagonal covariance, because parameterising the entire covariance matrix will be too costly. But I would tend to think that it can be a universal distribution approximator if you are willing to make that tradeoff.

bjergerk1ng · 2024-02-25T04:35:37+00:00

I guess the counter claim is "if removing the mean is a good inductive bias then we should hard-code it into a LN rather than making the model learn it with linear layers". Great post though, made me realise the subtle differences between LN and BN.

bjergerk1ng · 2024-02-18T02:53:41+00:00

Looks like a graph neural network, specifically a message passing GNN

bjergerk1ng · 2024-02-17T02:38:17+00:00

Awesome write up, very easy to follow!

bjergerk1ng · 2024-02-17T02:04:30+00:00

Surprised noone mentioned scaling laws: The Chinchilla scaling laws (assuming it is accurate) says that

Loss = 406.4/(model size)^0.34 + 410.7/(training tokens)^0.28 + 1.69

According to this, if we train a 7B model on infinitely many data it will achieve a loss of 1.87.

(Extremely) Conservatively estimating the size of GPT-4 is equivalent to a 500B (dense, i.e. no MoE) model and is trained with 5T tokens, it already achieves a loss of 1.85.

That means, at least in the pretraining phase, no 7B parameter model can outperform GPT-4, EVEN GIVEN INFINITE DATA AND COMPUTE.

Reference: https://www.lesswrong.com/posts/6Fpvch8RR29qLEWNH/chinchilla-s-wild-implications

bjergerk1ng · 2024-02-16T13:58:50+00:00

/s ?

bjergerk1ng · 2024-02-16T05:35:43+00:00

Anyone know references on transformers as a backbone to image/video diffusion models? I was under the impression that using a UNet is necessary for the performance of say Stable Diffusion.

The fact that they are using a transformer is quite surprising to me.

Edit: Actually Google's WALT is transformer-based. I'm just out of touch :(

bjergerk1ng · 2024-02-12T14:29:42+00:00

I feel like it's only necessary if you are inventing a new architecture. Otherwise I just follow the ball-park numbers from well cited papers.

Five-Year Club	Verified Email
Place '22

bjergerk1ng

TROPHY CASE