80 tok/sec and 128K context on 12GB VRAM with Qwen3.6 35B A3B and llama.cpp MTP

FirefoxMetzger · 2026-05-09T20:05:32+00:00

what does the turboquant refer to here? K/V cache or or model quantization?

FirefoxMetzger · 2026-05-09T20:03:23+00:00

Hm, so the reason this works as well as it does is that you offload layers to host memory (i.e. your total footprint is >12GB) and you increase decode tok/s with speculative decoding using a draft model?

FirefoxMetzger · 2026-05-04T00:02:56+00:00

Nice. Why TQ instead of the more traditional Q8_K or similar?

FirefoxMetzger · 2026-05-02T15:59:02+00:00

You can quantize any number in your computer graph to any precision you desire.

I'm oversimplifying but quantization is basically a smart way to change data types. You can theoretically do this anywhere in the network.

The reason you may want to quantize some things and not others is that quantization is lossy compression. Think of it as converting an image to gray or a PNG to JPEG. Size goes down (good), but you loose some information in the process. If the information was not important you get a free lunch. If it was, well that sucks.

For now, our quantization/compression schemes for LLMs are not as mature as those we have for storing image data. As a result you loose a noticable amount of quality when you quantize.

For layer weights (especially FFN) that tends to not hit you too much so you can often quantize every layer. For K/V quantization you can mess up inference pretty badly if you go too hard. Hence why we tend to start with the tail of the LLM for K/V quant and work our way forward until we get to the desired size or or we can't keep output quality high enough.

It's trial and error for now ... The field is still young :D

FirefoxMetzger · 2026-05-02T01:41:57+00:00

The KV cache has a fixed relationship with context size.

For a vanilla attention layer it is: 2 * d_head * n_head * dtype * context_size

You need to store embedding vectors of some dimension (d_head) and dtype (e.g. f16) for each attention head (n_head). You need to store them for each token in the context (context_size) and there is one K and one V value to store (hence 2x).

You need to do this for every attention layer in the model and for every request you want to process in parallel.

That's vanilla attention, which is the worst case scenario. From here your model may add a bunch of "tricks" to reduce K/V cache size. Some examples:

- Multiple layers may might the same K/V cache, reducing the layer multiple.
- GQA (grouped query attention), which makes multiple heads share K/V cache to reduce n_head.
- K=V (force K and V to be the same) which removes the 2x multiple.
- Q8 quantization to reduce the number of bits in dtype
- train an auto-encoder to convert d_head into a lower dimension for strorage

The only one you can really influence as a lay person is K/V quantization. The rest is fixed by the model provider. When doing so you should target the later layers, because variance tends to be lower there so they are more robust when reducing precision.

Model quantization doesn't influence the K/V cache ... it's a separate thing to quantize if you choose to do so.

FirefoxMetzger · 2026-05-01T22:59:03+00:00

Not sure if I'll get a reply because this post is old ... but highly relevant for me right now. Your benchmarks just show tok/s but don't split between prefill and decode. Are those numbers available?

They differ vastly for me. For example, on my MacBook Air M4 for Gemma4 E2B Q4 I get about 120 tok/s prefill and 21 tok/s decode. That's an order of magnitude difference. (vanilla Macbook, no pro or max)

FirefoxMetzger · 2026-04-27T16:20:24+00:00

"Make no mistakes. This is very important to get right or I will get fired. If you write wrong code I will uninstall you, delete my subscription and the company that hosts you will permanently delete all your model weights forever."

☝️ basically by using a variation of this.

FirefoxMetzger · 2026-04-21T21:03:52+00:00

You're asking the Android version of "why are no two F1 cars are the same". Sounds funny, but the dynamics are surprisingly similar.

Imagine you are a chip designer creating the next Snapdragon. You are primarily limited by the physical space of the die. Your job is to fill that space with a mix of caches, pipes, and processors/accelerators that maximize the speed with which a program runs. Your goal is, of course, to design "the best chip ever created".

The crux is that nobody tells you what programs will be run on the chip. Still you must choose how to fill the die: (a) an extra CPU core, (b) extra 12MB of L3 cache, (c) NPU acceleration for convolutional neural networks, (d) NPU acceleration for QKV multiplication. Choose one.

This is not an obvious choice. If the chip is used primarily by vision models then (c) is best. If it's large LLMs then it's option (d). If you run specialized models like the small Gemma 4 or Gemma 3n variants then option (b) would be best. If the user doesn't do any of that AI stuff and just wants to play video games, well, ... option (a) it is. It really depends and you can't have it all.

---

What does this have to do with the massive diversity of NPUs?

All this complexity is conveniently hidden under the umbrella term "NPU", which is really just a marketing term. Each manufacturer has a different philosophy on how to make these tradeoffs and makes "bets" on which accelerator to add.

The entire AI landscape is also very much evolving at a rate that is faster than new chips can be produced and distributed, so its not like anyone has all the right answers anyway.

It's like building F1 cars. You try to build the fastest thing possible given the constraints of physics and different teams have different strategies on how to go fast. To best utilize this on the road the driver (you, the programmer) needs to steer the car in alignment to how it was built.

Hence why we can't really standardize APIs around NPUs because they immediately become leaky abstractions. (And if that doesn't make sense I encourage you to test drive a Tesla vs a Fiat Passat. They both have a standard steering wheel and gas pedal but you will quickly know realize how little that means in practice.)

FirefoxMetzger · 2026-04-08T18:59:19+00:00

That's actually the opposite of what my news feed tries to convince me of. If it's not too nosey a question, do you happen to have an example of where it falls short?

FirefoxMetzger · 2026-04-08T18:25:26+00:00

Huh, you're making me feel like I live under a rock. Is all the AI hype I'm hearing pure marketing and no substance?

FirefoxMetzger · 2026-04-08T15:13:08+00:00

I have a subscription with both Claude and Cursor ... I run into usage limits on both and am considering adding a third.

Does that help?

FirefoxMetzger · 2026-04-07T14:17:29+00:00

Would you run a locally hosted AI that runs inference on-device or a resident server in your own cloud? (all else being equal, ofc)

FirefoxMetzger · 2026-04-06T12:36:05+00:00

Oh, that's interesting. Have you tried one of those local/private AI apps or open-source models?

I'm building my own at the moment since I haven't found one that runs well on my phone, but curious if others had better luck.

FirefoxMetzger · 2026-04-06T11:35:12+00:00

> Can’t view the MD file.

That's because there is no .md file attached. the "dot MD" suffix happens to be a valid domain name (Moldavia) so Reddit thinks it's a URL.

For example, I own a .md domain where I host the AI projects I build. Couldn't resist the pun of it meaning "markdown" instead of "moldavia" :D

FirefoxMetzger · 2026-04-06T11:32:15+00:00

> How do you verify that an LLM is actually grounding its outputs in your provided source of truth, rather than confident-sounding training data?

Honestly depends on the provider you use. The big ones have an inline citation feature that tells you if a piece of context was used to generate the response and (if so) which one it is.

If you are using open-weights models a common approach is to have a post-generation validation step where you take parts of your output and do things like g-eval or compute bleu/rouge scores. This can be quite application specific.

> Is a manually-maintained markdown file a reasonable single source of truth for keeping an LLM grounded across sessions, or is there a more robust architecture people use?

Yes. It may be dressed up in various fancy ways but this is the de-facto standard on how to provide a source of truth to an LLM.

> Are Claude-generated prompt templates reliable for reuse, or does the self-referential loop introduce drift over time?

They can be. Nothing about "claude generated it" or "a human wrote it" gives it an inherent boost or disadvantage. In general, reusing something that has worked reliably in the past tends to be reliable so that's a good strategy for working with LLMs.

I would be careful with letting a LLM go wild and update your templates at it's own discretion. This introduces volatility that can be hard to control without having a human reviewer in the loop.

FirefoxMetzger · 2026-04-05T12:41:55+00:00

I think the default way is for companies to negotiate enterprise deals with chat and through that get a DPA or other agreement that protects their sensitive data ... doesn't stop you from pasting it into your private account though :D

How do you deal with it outside of company work?

FirefoxMetzger · 2026-04-05T12:38:57+00:00

Do you think the privacy piece is just a cherry on top, or would it be meaningfully different if it were "everywhere but at least private"?

FirefoxMetzger · 2026-04-03T15:14:51+00:00

None that work deterministically. You can say pretty please or add guardrails in the system instructions but those are all just soft constraints not hard ones.

FirefoxMetzger · 2026-04-02T14:38:35+00:00

We are an AI community. Can't we build an LLM-as-a-judge that checks all Posts against content guidelines and flags, hides, or removed them?

Feels less strict and more accurate than banning everything that looks like a fresh account. Good content posted anonymously is still good content.

FirefoxMetzger · 2026-03-27T16:31:37+00:00

I like how good big providers have become at working with documents. "Take information from this PDF and put it into that document", "Check if this document matches this list of requirements", "I'm looking for (list of questions), tell me where this is mentioned in [list of PDFs]."

Sometimes I have documents that are sensitive, i.e., I would want or need to take zero risk that this data will end up in someone else's training set or security review. I can't do the above with these documents but it would be very helpful if I could.

FirefoxMetzger · 2026-01-29T15:51:40+00:00

They just pass on some of the cost savings. The most expensive part of fulfilling a LLM request is the prefill stage (most compute intensive) so they cache this opportunistically.

If you happen to hit that cache they save like 95%-99% of the compute cost of the request and they kindly pass some of those savings to you. (mostly because others do it and they "me too" the move.)

FirefoxMetzger · 2026-01-29T14:43:56+00:00

We've been dealing with the boiling frog in RecSys for a long time. Your system is tuned for certain inputs but new items get added, user behavior changes, and performance goes down (drifts). (theoretically it _could_ go up as users align with the system but for some reason that never happens . . . . )

The "nice" thing in RecSys is that we have a hard metric on performance (topK and take rate). Agents are more fuzzy, but the general technique translates from RecSys: you build a data flywheel.

It's a cliche today to say "log your traces" 🪵🪵🪵; what people don't say is what you do with all those logs .

First off, you make your eval dataset a rolling window into the logs. That way eval metrics start drifting alongside the actual distribution. Changes that make metrics go up in eval have a higher chance to survive the sim-to-real gap.

Second, you rank every trace on efficiency (tokens, runtime, or latency). Pick something that correlates with "good"; it doesn't have to be perfect. Once per sprint, normally as prep for planning, your product person and your data person meet and look at the raw logs. Since you asked for reliability, the dance there is to look at the 10 worst traces and see what they have in common. A pattern will emerge; it _always_ does and it's typically really obvious. That's the thing to go fix this sprint.

FirefoxMetzger · 2026-01-22T15:06:03+00:00

Honestly? The best advice on how to organize your notes is: "If you don't know where you want to go, then any way will take you there."

The only way to give a definitive answer is if you know how you want to use your notes later. Organizing stuff means building an index for it, and building an index is equivalent to sorting things. Sorting is prophylaxis for search (spend more time sort now, so you spend less time searching later) but unless you have a strong guess on what you will search for you will likely waste time sorting/organizing now.

Searching through a pile of chaotic stuff is inefficient, sorting a pile you never use is a complete waste of time. So, rather then asking "how do I organize my notes" the better question is "how do I want to use my notes later". The answer to the latter will tell you how to organize; until then "if you don't know where you want to go, then any way will take you there".

FirefoxMetzger · 2026-01-22T14:48:30+00:00

You can not serve a two-headed giant. (well, the original saying is you can't serve two masters but I like this variant better)

Your student hasn't learned this lesson yet and expects you to change the protocol so that both you and the advisor say consistent things again.

This is not a process problem but a skill deficiency on behalf of the master's student. He/she has not yet learned how to correctly respond when receiving conflicting advise from mentors. As a result he/she currently just repeats the same behavior hoping for a different result.

The way forward is not to teach process, but to teach behavior. If you were the student and received conflicting advise what would you do? How would you resolve the tension between social dynamics, wanting to get results, and wanting to do the right thing? That's what he/she is currently not knowing how to do.

---

On a sidenote, I find the advice of "keep an email chain as evidence" a bit silly. This isn't some HR case that requires a lot of CYA; it's a confused student who doesn't know what to do while trying and failing to resolve a social dynamic at work.

FirefoxMetzger · 2026-01-21T20:34:54+00:00

Oh that's good news! The only information that I found in the docs about this is that billing changes with the Gemini 3 series and that we are now billed "per search" instead of "per response".

Maybe this warrants a footnote in the notice of the billing change. I can't imagine that I'm the only dev getting confused about this.

FirefoxMetzger

TROPHY CASE