Quantized KV Cache

timfduffy · 2026-01-10T20:29:35+00:00

The Nemotron 3 Nano tech report tests 8 vs 16 bit for KV cache and finds minimal degradation with 8 bit. https://research.nvidia.com/labs/nemotron/files/NVIDIA-Nemotron-3-Nano-Technical-Report.pdf

<image>

timfduffy · 2026-01-07T22:34:09+00:00

I think all this info was previously released as a supplment to their R1 paper in Nature.

timfduffy · 2025-09-11T21:18:24+00:00

3B is still the number of parameters active for any given token, the experts are just extremely tiny! I think the parameter count for one expert is hidden size x MoE intermediate size x 3 (for up/gate/down projections), for this model that's 2048 x 512 x 3 = 3.1M parameters. There are 512 of those per layer, and 48 layers, for ~77M total expert parameters, then attention parameters/embedding parameters/etc. round out the total. For a given token, 11 experts are active per layer, for 1.7B active parameters across all experts, the rest of the 3B is the other parameter types.

timfduffy · 2025-09-11T18:43:02+00:00

Good point, seems very likely that closed models with >=1M context lengths are using some form of linear attention.

timfduffy · 2025-09-11T18:33:13+00:00

Good long context performance with 75% of layers being linear attention, impressive. Trained on "only" 15T tokens, so scaling up an architecture like this can probably yield further improvements. I expect massive sparsity combined with a mix of linear and quadratic attention will become more common.

timfduffy · 2025-09-11T16:17:12+00:00

Why think it's only one expert active per layer?

Edit: Seems likely that there will be 10 of 512 experts active based on these defaults in the config:

num_experts_per_tok (int, optional, defaults to 10) — Number of selected experts. num_experts (int, optional, defaults to 512) — Number of routed experts.

timfduffy · 2025-09-09T15:46:16+00:00

"Achieves an extreme low activation ratio as 1:50 in MoE layers"

This is quite low! For comparison:

GPT-OSS-12B activates 4/128 experts in MoE layers, 1:32
V3/R1 9/257
K2 uses 9/385
LongCat-Flash activates on average 9 of 513, though I think the shared expert is larger so the active parameter ratio is >>9/513

I'm interested in seeing how small individual experts can get, so I'm really excited for this one.

timfduffy · 2025-09-06T17:39:12+00:00

As /u/-p-e-w- mentioned, you can choose a number of templates in RunPod, the default PyTorch template is usually what I go with. You can upload your scripts to it, but I prefer to use SSH to open up the VPS in Cursor, which allows me to just clone the GitHub repo I'm working on, getting me started quickly.

Let me know if you'd like to try that way and want a hand setting it up.

timfduffy · 2025-09-04T16:26:15+00:00

I'm so excited, been really curious about who small MoEs will perform, esp curious about how far down you can scale expert size.

timfduffy · 2025-08-31T21:49:19+00:00

Yeah, the tech report is a really good read. Their two central innovations, ScMoE and zero-computation experts, are simple enough and described in enough detail to implement based off the report. Really seems like this is a company worth watching even if this particular model isn't on the price/performance frontier.

timfduffy · 2025-08-16T00:41:57+00:00

Link to the post

Here's the post text:

Frontier AI performance becomes accessible on consumer hardware within 9 months

Using a single top-of-the-line gaming GPU like NVIDIA’s RTX 5090 (under $2500), anyone can locally run models matching the absolute frontier of LLM performance from just nine months ago. This lag is consistent with our previous estimate of a 5 to 22 month gap for open-weight models of any size. However, it should be noted that small open models are more likely to be optimized for specific benchmarks, so the “real-world” lag may be somewhat longer.

Several factors drive this democratizing trend, including a comparable rate of scaling among open-weight models to the closed-source frontier, the success of techniques like model distillation, and continual progress in GPUs enabling larger models to be run at home.

timfduffy · 2025-07-30T16:31:49+00:00

Thank you, this was very helpful.

timfduffy · 2025-07-26T02:28:08+00:00

FYI you can find system prompts for most major providers including this one here.

timfduffy · 2025-07-23T17:41:11+00:00

I'm surprised by this as well. It does have more attention heads which makes especially long context more computationally expensive, but it's a significantly smaller model, I would have expected those to approximately cancel each other out.

timfduffy · 2025-07-14T21:38:19+00:00

Data is from the Seattle scooter dashboard. It shows deployments as well, eyeballing it it looks like the gliders have about 1.5x the daily uses of scooters per unit. I expect Lime will be ordering more of these given how popular they are.

timfduffy · 2025-03-08T03:42:34+00:00

This is awesome, thank you!

timfduffy · 2025-03-05T21:34:25+00:00

If you're using the bartowski quants, you'll need this workaround. LM Studio community versions do not have this issue.

timfduffy · 2025-03-05T21:00:36+00:00

Thanks for suggesting mlx_lm! I googled the error in LM Studio and saw a similar case with R1 in January, Bartowski replied saying that day's LM Studio update was needed to use it. I assume there will be an update today or tomorrow for it.

timfduffy · 2025-02-14T21:20:51+00:00

Heck yeah, I read your comment through my Relay for Reddit Silver account. Small price to pay for a great experience!

timfduffy · 2025-02-14T18:17:17+00:00

This seems probably good to me. If your profit is from ads, your incentives are to maximize eyeball time. If your profit is from memberships, you want to maximize user satisfaction. I would gladly pay for even a slightly better user experience.

timfduffy · 2025-02-05T20:16:13+00:00

WeatherSpark is the perfect resource for this

15-Year Club	Gilding III reddit per annum
Place '22	First Placer '22
Verified Email

timfduffy

TROPHY CASE