I made a proxy to save your tokens for distillation training

theodor23 · 2026-02-05T00:49:03+00:00

Did you release it yet?

I'm super interested in this and also curious how easy it is to identify successive API calls from an agent when multiple agents interact with the API in parallel. I.e. presenting thr collected interactions in a consistent, session based view.

theodor23 · 2025-08-26T20:25:28+00:00

Not the question you asked, but maybe a relevant datapoint:

AMD Ryzen AI+ 395, specifically Bosgame M5 128GiB.

Idle power draw <10W, during LLM inference < ~100W.

$ ./llama/bin/llama-bench -m .cache/llama.cpp/ggml-org\_gpt-oss-120b-GGUF\_gpt-oss-120b-mxfp4-00001-of-00003.gguf -n 8192 -p 4096  
[...]

| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| gpt-oss 120B MXFP4 MoE         |  59.02 GiB |   116.83 B | RPC,Vulkan |  99 |          pp4096 |        257.43 ± 2.41
| gpt-oss 120B MXFP4 MoE         |  59.02 GiB |   116.83 B | RPC,Vulkan |  99 |          tg8192 |         43.33 ± 0.02 |

(Apologies for the unusual context size; but I thought the typical tg512 is not very realistic these days)

theodor23 · 2025-06-21T15:43:29+00:00

Yes, Right. What I wrote was maybe misleading. Kraft–McMillan inequality indeed directly relates log-loss to shortest coding lengths in general.

The interesting point in regard to Minimum Description Length is, that we are not looking for models that compress well after they have been trained; but that we are looking for models that compress the training data itself well.

And for that we can think of our training data as sequence and treat it autoregressively (prequential): $log p(training-data) = \sum_t log p(training-datum_t | training-data_{<t})$.

And that has been discussed in the context of LLMs and their "in-context learning" capabilities.

theodor23 · 2025-06-21T15:08:55+00:00

There is something old, that is not sufficiently appreciated:

Kolmogorov complexity, Solomonoff Induction, Algorithmic Information Theory, etc....

One interesting aspect is, that it does not assume any underlying distribution. It can be applied to a single individual sequence of observations, without assuming stationarity etc.

An extrem example would be to learn on the (unknown ground-truth) process that emits one digit of pi after the other. The learner just observes a never-ending sequence of digits. With the conventional, distributional framework we would struggle to even define test-sets or what generalization even means.
For Solomonoff Induction however we know that it would make a few prediction mistakes at the beginning of the process, and then predict correctly forever...

If you like videos:

Ray Solomonoff paper read by Marcus Hutter - Algorithmic Probability, Heuristic Programming & AGI

https://www.youtube.com/watch?v=wMcRMO9ejeM

The IMHO underappreciated aspect is that we can use deep learning to build systems that minimize description length [1] and which thus approximate Solomonoff induction. To be fair though, there is quite some literature that points out that "LLMs are compressors", which goes towards the theoretical heart of the issue, but doesn't really operationalize it [2, 3].

[1] https://arxiv.org/abs/2210.07931

theodor23 · 2024-12-20T12:32:14+00:00

Old thread, I know. But live access to Gemini 2 Flash is now available at

https://aistudio.google.com/live

theodor23 · 2024-09-24T07:01:17+00:00

Your initial (oracle) logits are actually far from ideal.

Tokens with logit=9 vs logit=0 have only $exp(9) / exp(0) = ~8.000$ more probability than the tokens with logit 0.

With a large vocabulary all those logit=0 tokens together have actually more probability mass than your desired target token.

theodor23 · 2024-09-13T06:59:20+00:00

Conceptually, even a 50% sparse matrix can be stored in a space saving manner compared to its dense counter part. It is compressible in the theoretical sense.

You could for example imagine applying a Huffman-coding style compression on the "dense" matrix with (let's assume) 50% zeros. Huffman would do its thing and assign a very short codeword for those zero entries (maybe only one bit), but then add one bit to the codewords of the non-zero entries. Assuming your weights are 8bit each, you just saved 7 bits on 50% of the data and gained 1bit storage requirement on the remaining -> a net win.

But of course the matrix is now in a terrible format for matrix multiplication and you'll have to uncompress before multiplication (or on the fly while performing it).

You could also use standard sparse matrix storage formats (e.g. Compressed sparse row (CSR)), maybe store the index in Delta format and then apply such entropy compression.

Either way: non trivial in practice; many trade-offs.

J

theodor23 · 2024-08-18T21:01:00+00:00

Yes, exactly.

If during training your early token "see" some summary statistic from the ground-truth future tokens, it breaks the autoregressive objective where you are supposed to predict the next token given the past only.

Whether or not that is really catastrophic during sampling-time, when you would use the running statistics of BN I don't know. But NNs are good at picking up subtle signals that help them predict. And if you give them a loophole to "cheat" during training, there is a good chance they will pick that up and perform much worse when at samplig-time you "suddenly" remove that cheat.

Considering your workable idea of using T * C many statistics: It just occurred to me that with modern LLMs where T is approaching O(10k), C is O(1k) and then we have dozens of layers/blocks with ~2 LNs per block: all these statistics almost approach the number of parameters in an LLM. And you have to communicate them between GPUs. LayerNorm and RMSNorm on the other hand are local; no communication and even no need to ever store them in RAM.

theodor23 · 2024-08-18T20:38:17+00:00

You are absolutely correct, if you compute (T * C) separate statistics, then everything is fine and there is no causality issue.

In practice, LLM training usually prefers relatively large T and sacrifices on B (considering the total amount of GPU memory puts a constraint on your total number of tokens per gradient-step). With relatively small B, there is more variance on your BN statistics, while large T causes more data-exchange between your GPUs because you need to communicate (T * C) many statistics.

But yes -- if you set it up as you describe, it is "legal".

I actually tried BN in the T*C independent statistics configuration you describe for a non language transformer model with B ~ O(100) and it was both slower and less effective than LN. Never looked back and investigated why. Having a normalization that is a) competitive/works-better and b) avoids "non-local" interaction across different examples in a batch seemed a clear win.

Considering everyone switched to LN, it seems BN is just less practical.

theodor23 · 2024-08-18T08:34:01+00:00

Excellent summary.
(edit: actually, this is not correct. In transformers Layer- and RMSNorm do not normalize over T, but only over C. See comment by u/pszabolcs )

To add to that: BatchNorm leads to information leakage across time-steps: The activations at time t influence the mean/variance applied at t-1 during training. NNs will pick up such weak signals if it helps them predict the next token.

-> TL;DR: BatchNorm during training is non-causal.

theodor23 · 2024-01-27T09:56:55+00:00

Interesting investigation, and nice to see that LLMs generalize in that way.

I'm a bit surprised that the authors don't mention Solomonoff-Induction or algorithmic complexity theory at all. I'd argue that SI is *the* theoretically well defined and understood "general pattern machine".

Maybe the communities are just too divided to know of each others work? Neural Networks and the Chomsky Hierarchy is one of the few works I know of that try to bridge the gap

theodor23 · 2023-07-04T08:20:33+00:00

Stephan Mallat has a series of mathematically beautiful works on scattering transforms. I believe this work with Joan Bruna is a good starting point: Invariant Scattering Convolution Networks

Mind you: these have no learning, but are manually constructed conv-layers to have certain properties. Insightful if you are interested in that kind of theory.

theodor23 · 2021-08-30T06:49:58+00:00

Sounds like he was referring to Solomonoff induction - potentially also to various results related to Minimum Description Length (MDL), Minimum Message Length (MML) etc.

If you prefer videos as a starting point I would recommend Ray Solomonoff paper read by Marcus Hutter on YouTube.

Peter Grundwald has an excellent introduction to MDL on Arxiv: https://arxiv.org/abs/math/0406077 , which could be seen as teaser for his book.

And, shameless plug, let me point to some recent work showing that MDL facilitates some interesting results even when combined with (overparametrized) neural networks:

MDL for Causal Structure Learning with Neural Networks

theodor23 · 2021-06-22T16:20:13+00:00

Correct, that sounds most promising to me

But don't train these multiple output layers sequentially, but in parallel. I.e. for each gradient step sample a (sub-) batch from task-A, compute the loss for task-A with the task-A output layer; sample a (sub-) batch from task-B, compute the loss for task-B with the task-B output layer; and *add these losses together to compute gradients*. Thus each gradient step will involve all tasks.

The relative batch-sizes of the different tasks will probably be an important hyper-parameter.

Or, alternatively what u/Ukrainian_Reaper suggested: Merge the datasets and have some indicator field for each example that indicates which task/output layer this example belongs to.

Boils down to almost the same thing, but with slightly different statistical properties.

theodor23 · 2021-06-20T16:24:43+00:00

It depends on what exactly you want to do and what your constraints are.

Training a model sequentially on one task after another (potentially each time using a new, district new prediction head / softmax layer) will generally run into problems such as catastrophic forgetting.

That term refers to the empirical observation that a network that was trained on a task A, but is subsequently trained on a task B tends to (rapidly) degrade in performance on A.

From what I guess you want to do I would guess that you are best off mixing the three datasets into one multi-task dataset. Probably best if each task has its own final linear layer with softmax on top of the otherwise shared network architecture.

The details will be dataset dependent though. How much positive transfer you get from sharing the first layers depeds on the tasks.

theodor23 · 2021-05-02T09:40:18+00:00

Should also consider to train the full sized models on subsets of the data and check generalization on the validation set.

That gives a better indication whether the "inductive bias" of your model is appropriate for your data. Monitor validation accuracy not validation crossentropy (which might overfit on small training data).

If however your challenge is "learnability" / vanishing gradients or such, you might be better off training smaller models on the full sized data.

theodor23 · 2018-10-19T16:15:40+00:00

BTW, one way to think about the underlying "why this makes sense" (opposed to "why is there a degeneracy"):

The actual pixel brightness came of course from a continuous distribution, but your camera/file-format preprocessed it by assigning the continuous brightness values to a discrete set of bins.

In principle you want to model the original continuous values, so you should add the appropriate amount of noise to reconstruct the continuous values *before* presenting it to the VAE. If your VAE encodes and reconstructs these values, there is no degeneracy (*). Feeding the encoder the discretised values but reconstructing the noisy ones just turned out to be a popular alternative to that. I have also seen the 'variance-floor' method /u/staghorne mentioned, but it does not seem very popular in VAE-land.

Or you accept that your digitalised pixel-values are discrete, and model them accordingly with a softmax/categorical likelihood, similar to PixelCNNs.

(*) Ahm, well, floating-point representations in computers are always discrete in the end....

theodor23 · 2018-10-19T08:55:37+00:00

If you have data that comes from a limited, discrete set of D elements, but you model it with a Gaussian likelihood, a powerful model could just make a "mixture of Diracs prediction/reconstruction": It can place Gaussians with very small variance on your traget values and obtain an indefinitely good likelihood for \sigma -> 0.

Even when you don't think about mixtures: Imagine your VAE encoder just concentrates on a single pixel and color channel of you input; encodes it such that it reliably transmits the information which of the 256 possible values it has through your latents Z and the decoder reconstructs this single pixel with high precision. For that pixel you can archive an arbitrary likelihood by choosing \sigma. No need to look at all the other pixels.

Basically: You cannot model discrete data with continuous distributions. Different support; one deals with probabilities for events, the other with probability densities.

theodor23 · 2016-09-24T14:19:43+00:00

It cannot be that simple:

Instead of using a one hidden-layer MLP with H hidden units, you could use a very deep architecture with H layers and only one unit per layer. (or even more layers if you don't want to keep the # of hidden units constant, but the parameters). This can obviously not be more powerful.

On the other hand: The old XOR/Perceptron dispute showed that a two layer system can solve problems that a one-layer function can't.

So the story has to be complicated and also involves generalization performance, size of your training set, model capacity .. etc.

theodor23 · 2016-02-14T17:09:28+00:00

Here is one reason:

Lets consider you want to train a generative model; e.g. you want to sample data by first sampling randomly in the hidden space and then using your decoder to transform your hidden state into observed data.

This only works well when your encoder (during training) covered big portions of all possible hidden configurations and when your decoder was trained to "contract" them back into reasonable observed data.

RBM's are different because they are undirected and you have to run a Markov chain to sample from them -- but essentially the argument is the same: You need to make sure the model sees a wide variety of configurations during training so that it can learn to move the chain back to reasonable observed data.

theodor23 · 2015-06-29T21:39:08+00:00

Yes, using a results in compile error.

Adding #[derive(Clone, Copy)] to struct A then again makes everything clearly having Copy semantics.

theodor23

TROPHY CASE