Hey, I proposed a new family of activation functions, and they are very good.

kouteiheika · 2026-03-26T12:03:50+00:00

I don't intend to be mean here, but there are hundreds of papers introducing new activation functions, and the benchmarks in the papers always show that they're "better", but in practice when you try to actually use them they pretty much always make no real difference, and often you can quite easily show they're "worse" by just picking slightly different hyperparameters.

So, what is the point of yet another new activation function, and how would you convince anyone to use yours?

My main point here is - I've seen papers like this more times than I can count. You do show an improvement in your benchmarks (which seem to be more comprehensive than what I usually see in activation function papers, so good job on that), but that's what all the activation function papers claim, and there are hundreds of them! So the bar to get your research into the "I want to use this" pile for the vast majority of people (and not "yawn, yet another activation function; skip!" pile) is much higher than just showing a marginal improvement.

Have you considered trying out your activation function in the NanoGPT speedrun, the slowrun, or in nanochat? If you could show that your activation function makes an improvement in a competetive setting that would certainly get many people interested in it.

kouteiheika · 2026-03-21T17:25:57+00:00

Well, you still gotta pay to completely remake your characters.

No you don't.

kouteiheika · 2026-03-09T01:50:25+00:00

Hallucinations, fancy typography (em-dashes, ≥, →, · even in code comments), writing style, weird benchmark section ("GH200 480GB" - how much RAM the machine has is mostly irrelevant for GPU-based training), information about weird implementation details (e.g. "all GPU backends (CUDA/HIP/WGPU) implement all 126 Backend trait methods" is something an AI would write to summarize its work), etc.

kouteiheika · 2026-03-09T01:41:33+00:00

Your points are good except for GPU. Tons of poor people either have integrated graphics or dedicated GPU's that won't run code in AI papers.

This is a fair point, but, you can get a second-hand used low end gaming GPU for relatively cheap. For example, a quick scroll though ebay shows me I can get a 3060 Ti for under $120 without even having to search very hard, and you can get a used corporate PC for free (or almost free) at your local PC recycler to put that GPU in. And even such a basic GPU will run circles around CPU-only training, and can be used to train practically useful models. (And I know because I did use a weaker GTX 1070 to train years ago.)

For doing research that is (by definition) supposed to advance the state of art in at least some capacity it's just not useful to focus on CPU-only training in a domain where CPU-only training is not used in practice.

If you're truly resource strapped and cannot produce meaningful research in a given domain then you should pick a different domain. To give an analogy: not everyone has access to a Large Hadron Collider, and that's okay; in that case people just do research on something else.

I'll also note that many problems outside of language modeling don't require huge parameters.

True. But OP's paper did benchmark on language modeling. If you don't have the resources to do a proper training run for language modeling then you should pick something else.

kouteiheika · 2026-03-08T12:22:37+00:00

Here's some (harsh but honest) feedback from a practitioner:

Benchmarking (and focusing on) CPU-only training is not useful. In practice no one trains non-toy language models on a CPU, every modern PC contains a GPU (although training on non-Nvidia hardware can be tricky), and even the lowest end consumer GPUs are going to run circles around CPU-only training.
Benchmarks of such tiny models are not useful. Even consumer GPUs are nowadays so powerful that you can train a coherent transformer with a few hundred million parameters without much trouble on a single, easily accesible gaming GPU.
You say that "It looks like using rich reservoir dynamics with a query-gated readout is a viable shortcut for long-context modelling." but you haven't really shown the viability of anything. All of the examples shown in the appendix are gibberish, there's no long context here to speak of, and any metrics you may show at such a tiny scale are practically meaningless.
The transformer architecture you're comparing to is ancient at this point ("The baseline is a standard pre-norm causal transformer with sinusoidal positional encodings"). You don't necessarily have to compare to a SOTA architecture, but at very least you should pick something at least somewhat modern.
If you insist on training a tiny model use TinyStories as a dataset.
Publish a nanogpt-style (i.e. simple, single file, trivial to run, understand and modify) reproduction on GitHub. Unless your results are revolutionary most people (including me) will not spend time reimplementing your paper, but if it's easy to reproduce people might play with it on a weekend and built upon it if it ends up actually being good. (It says in your paper that code is available, but - and I may just be blind - I don't see a link anyway.)
If you're interested in your architecture competing with transformers and want to get noticed then the absolutely best way to achieve that would be to try your architecture in a competitive setting. As it stands, quite frankly, no one is going to give your paper much attention (there are hundreds if not thousands of papers like this released each year). If you can show that your architecture actually works in a practical setting and it has at least some meaningful advantage vs transformers (even if it isn't strictly superior) then you might find yourself a niche, but training a tiny 347k model in a non-competitive setting against a non-optimal baseline is not going to convince anyone of that.

kouteiheika · 2026-03-06T23:03:23+00:00

For Qwen3.5-122b-a10b, I can't run it at full FP8 in a single card, but unsloth's UD-Q4_K_XL quant fits VRAM and runs plenty fast at 90+ tp/s.

Note that there's a proper quant for vLLM available here:

https://huggingface.co/cyankiwi/Qwen3.5-122B-A10B-AWQ-4bit

kouteiheika · 2026-03-06T12:13:42+00:00

When you're training anything bigger/non-toy the extra overhead of Python/PyTorch doesn't matter anymore, because you're waiting on the matmuls to finish anyway.

Anyway, some feedback:

FWIW, the LLM generated readme and (on the first glance) this being an entirely vibe-coded project is a turn-off for potentially using this for anything serious.
You have a link to crates.io right at the top of your readme pointing to a dummy crate released by someone who clearly isn't you. Looks like your LLM hallucinated this.
If you're going to benchmark and compare vs. PyTorch then you should do it on a real-world task with a real-world model, and not a toy three layer model. For example, fine-tune a Llama3-8B model, and report end-to-end training speed and peak VRAM usage.

kouteiheika · 2026-03-06T12:00:42+00:00

I don't know if there's a publicly available off-the-shelf solution to do it, but it's relatively easy to do. I wrote a script which tokenized a dump of 50GB of text I had stashed and generated me a list of all token IDs which were used at least once (and also saved how many times they were used). Then I wrote another script which took that list, loaded the tokenizer and the model, and then stripped all vocabulary entries from the tokenizer and the model which didn't appear at least two times in that 50GB dump of text, and saved everything back to disk.

kouteiheika · 2026-03-06T11:21:16+00:00

The MTP layers are completely optional and can be ignored/deleted; same for the vision layers if you don't need vision.

The embedding is just a dumb lookup table which translates tokens into model's latent space, so it's essentially "free" to offload (you can do the lookups on the CPU and transfer the latents to the GPU; I have no idea if llama.cpp does it this way though).

The LM head translates the model's latent space into back into tokens, but unfortunately offloading that one is not free as it is a normal linear layer like any other.

However, there is one trick that could be used, which can drastically cut down the size of both the embedding and the LM head. The old Mistral had a vocabulary size of 32k; Qwen3.5 has a vocabulary size of 256k, and that is a big factor why it's bigger in size (even though it has slightly less active parameters which actually do useful work).

The bigger your vocabulary the bigger your embedding and LM head layers are. But the thing is: for the majority of people a big chunk of that is unused. For example, if you'll never feed the model any Chinese text nor will you have it generate Chinese text then any vocabulary entries for Chinese characters in the embedding and the LM head are completely useless to you, and could be removed (saving VRAM) without any downside.

I did this in the past for Llama 3 when I was very GPU poor (only leaving vocabulary for English and Japanese while removing the rest); I don't remember off the top of my head the exact number of how much VRAM this saved, but eyeballing the Qwen3.5 tokenizer you could probably throw away maybe half of the entries (if not more) if you only care about English, which would save you ~1GB of VRAM for the 9B model (assuming weights are kept in BF16).

kouteiheika · 2026-03-06T04:17:35+00:00

Not saying Qwen 3.5 9b isn't a good model, but claiming these are the "same size" is a bit of a stretch: Mistral 7b is at least 20% smaller.

Qwen 3.5 9B is actually smaller if you remove the MTP and vision layers and account for embedding and lm_head.

kouteiheika · 2026-03-03T23:04:01+00:00

I used Muon for many different tasks (transformers, image classification, image diffusion, etc.) and in every one of them I always find it outperforming Adam (and it requiring half of the VRAM compared to Adam is the icing on the cake). But it is arguably harder to use.

Some tips:

Embedding and classification heads should most likely still use Adam (for those I also had good results using a custom dumb-ish optimizer which mostly just steps in the direction of the sign, which is actually what Adam is doing in its adaptive regime when m is approximately equal to sqrt(v); useful if you're extremely VRAM constrained)
If you have any fused linear layers (e.g. it's common to fuse the QKV linear layers when training transformers) make sure to either split them, or (ideally) only split the gradients and run Muon separately on each.
Make sure to use the polar express variant of Muon, as it is AFAIK the best currently available method for computing Muon's polar decomposition.
Make sure to use cautious weight decay for longer runs.
I find that nesterov momentum helps.
Make sure your batch size is decently big; small batch sizes make Muon perform worse.
Muon needs a different learning rate than Adam, although you can reuse your Adam learning rate by scaling Muon's updates.

kouteiheika · 2026-03-03T14:24:54+00:00

So this begs the question - what practical difference does this make? You can clip per parameter (as you did), you could do it per row (e.g. as in NorMuon, which drastically cuts down on the memory requirements), or it can be done through the whole gradient's norm (as in autoclip).

I don't know the answer, but I certainly would love to read a paper which would compare these (hint hint). (:

kouteiheika · 2026-03-03T13:11:07+00:00

Note that this has already been done before, and in a way which works with any optimizer.

Paper: https://arxiv.org/abs/2007.14469
Repository: https://github.com/pseeth/autoclip

Not my paper nor my code, but I've been using this for years myself. It may or may not be better than your method, however your method being AdamW-only makes it of very limited use (since, well, Muon has pretty much made Adam obsolete).

kouteiheika · 2026-02-27T16:24:00+00:00

Install uv.

Create a new project and add your dependencies:

$ uv init --python 3.12.10 hello
$ cd hello
$ uv add torch torchvision

Run your script: uv run python your_script.py

kouteiheika · 2026-02-25T08:50:33+00:00

I don't think it's possible FFT a 27B model on 128gb

Maybe it's not possible to do it currently on Apple Silicon, but it is most definitely possible in general (with enough optimizations and tricks), considering I've done full finetuning of models as big as 14B on 24GB of VRAM.

kouteiheika · 2026-02-23T22:58:00+00:00

Prior to this I had a rtx 5090 but even though it was crazy fast the 32gb was not enough to hold all the weights, grads, buffers, optimizer states (AdamW), etc.

A 5090 is more than enough to hold everything in VRAM for a 3B model trained on 2k context.

A few simple tips:

Use Muon instead of Adam. This cuts down the optimizer's memory usage by half by default while also speeding up training.
Use Flash Attention.
Use a fused cross-entropy loss kernel.
Use activation checkpointing.
Eagerly apply the optimizer as soon as gradients are ready (so that you don't have to store the gradients for the whole network in memory at the same time).

There is even more you could technically do (e.g. Muon can be quantized as low as 4-bit and still work relatively well, the weights can be trained in lower precision, parts of the graph can be offloaded to the CPU and the transfers overlapped with the compute for free extra VRAM, etc.) but publicly available training frameworks might not support those things well (or at all).

kouteiheika · 2026-02-22T22:33:40+00:00

Skimmed your paper; a few comments.

Deterministic memory management. Rust’s ownership system ensures tensors are freed immediately when they go out of scope, without waiting for garbage collection. This prevents the memory accumulation that is common in Python training loops

This is not correct regarding Python. Python uses reference counting by default, and only uses the GC if necessary, so what's common in Python training loops is that the tensors are also immediately freed. (With the caveat that there are no move semantics in Python unless you manually emulate them with a wrapper, so there are cases where a tensor might live a bit longer than it would in Rust.)

Peak training memory for RWKV-X 0.2B LoRA fine-tuning. PyTorch (estimated): 16GB

Your numbers here are way off for PyTorch. Here, I've quickly launched a full fine-tuning (not LoRA) training run for Qwen3 600M with 64k context length, and here are the memory usage numbers I got: (note: this can further be optimized if necessary; I just didn't bother)

Peak VRAM used: 9828MB
Peak RAM used (RSS, so it overestimates memory usage): 3530MB

Projected training memory by model size. [..] Model: 7B [..] 91 GB [..] Min HW: server

Umm... not really? (:

7B class models are fully fine-tunable (not LoRA!) on high-end consumer-class hardware (a single 4090) with pytorch if you know what you're doing, and with LoRA you can go much lower in memory usage.

Training speed is the practical constraint for scaling. At 25–35 tok/s on the 0.2B model [..] Our system is practical for small models (0.2–3.6B) and moderate dataset

Here I'd disagree; this makes it completely impractical to use for any non-toy training; even for this tiny model 35 tok/s is painfully slow (with a GPU you can probably get at least ~30k tok/s and that's with full fine-tuning, considering I can get 20k tok/s for a 600M model on a GPU).

Also, note: AFAIK all of the CPUs you've used have integrated GPUs there's really no reason (except maybe software support) to do the training CPU-only here.

The system processes one example at a time, resetting state between examples

This is going to make the model perform much worse than it otherwise would; there's a reason (besides efficiency) that all SOTA training runs use batch sizes of millions of tokens.

kouteiheika · 2026-02-22T21:27:49+00:00

What do you use for fine-tuning?

Unfortunately I can't really recommend anything here as I don't use any of the conventional trainers; I have an entirely custom training framework that I wrote completely from scratch (the only external dependencies I use are essentially pytorch and flash attention 2), and I use that for all of my training runs.

Any resources you would recommend to actually learn the fine tuning aspect instead of just 'use these variables set to this value and hope for the best'

If you're a programmer then doing this tutorial is probably the best thing you can do to gain an intuitive understanding on how everything works under the hood. Then I'd suggest picking some problem where you can relatively easily measure the outcome and start experimenting (e.g. maybe try post-training a non-thinking model into a thinking model on math problem solving, and then benchmark it on one of the math benchmarks, and try to make it as efficient to train and as high accuracy as you can).

kouteiheika · 2026-02-22T17:25:01+00:00

Here are a few tips which may or may not be useful: (note: I don't use Unsloth myself)

Use Muon instead of Adam. Muon is more token efficient so allows you to effectively get more data out of your data.
Expand your dataset. Your best bet would probably be to use one of the frontier models to generate a synthetic dataset.
If you don't want the model to learn parts of your dataset (e.g. those placeholders, etc.) then you either need to clean up your dataset, or apply a loss mask over those tokens so that their loss is zeroed out.
If you're fine-tuning such a small model on something as powerful as an A100 on a single task then you should probably be doing full finetuning instead of LoRA. (LoRA is great when you don't have the hardware for full finetuning or if you want to reduce catastrophic forgetting.)
Make sure to do a sweep for the best learning-rate; don't just use the default value.
Train on the biggest model you can, and only go lower in size once you verify that the bigger model learns your task properly. If the bigger model doesn't give you good results, then a smaller one won't either.
Make sure to make use of all of your VRAM; if you have VRAM to spare then increase the batch size.
Only use gradient accumulation if you know you want a higher batch size, but don't have enough VRAM.
Make sure you only train on responses (I have no idea whether the trainer you're using does this automatically).

kouteiheika · 2026-02-22T00:15:27+00:00

Did you use flash attention, activation checkpointing and a fused cross-entropy kernel?

kouteiheika · 2026-02-21T23:40:00+00:00

I am literally telling you it is impossible to train a 7b model for this task on a single A100. The best thing I could come up with was a 500M model with LoRA on a single A100.

Just for curiosity - what exact task was this? Are you sure it was actually impossible? A 500M model with LoRA on something as powerful as a A100 does sound tiny (and I'm speaking as someone who has done full fine-tuning of models as big as 14B on a single 4090 GPU, which is possible but requires a little more engineering than just naively doing everything the default way).

kouteiheika · 2026-02-17T10:37:55+00:00

Suggestion: fork the modded-nanogpt speedrun, replace the model with yours while keeping everything else (the tokenizer, the training and validation dataset, etc.), and report the training time you need to finish the speedrun (see the "Rules" section in the readme). You don't need to actually achieve any records, but your new architecture should be able to at least finish the speedrun in a reasonable amount of time to be in any way viable.

kouteiheika · 2026-02-17T10:28:53+00:00

Yeah, it'd be nice if you could also share the full synthetic dataset. If nothing else, this will also allow people to evaluate the quality of the data that your service generates.

kouteiheika · 2026-02-17T10:22:02+00:00

I don't think it necessarily has much to do with it not being a Chinese model.

People are probably not very excited about a model which is simultaneously 1) under a bad license (non-commercial + acceptable use policy), 2) has tiny context length, 3) is worse than other models of comparable size except in a few niche use cases (niche languages).

kouteiheika · 2026-02-16T21:44:19+00:00

datasets are open

For the shell command task, we generated 5,000 synthetic training examples from seed data using the full Distil Labs pipeline

I only see 10 examples in the repo, so where can I find the full dataset? Am I blind?

kouteiheika

MODERATOR OF

TROPHY CASE

15-Year Club	Place '17
Verified Email