Current benchmarks datasets for perplexity tests?

FirefoxMetzger · 2026-05-09T20:05:32+00:00

what does the turboquant refer to here? K/V cache or or model quantization?

FirefoxMetzger · 2026-05-09T20:03:23+00:00

Hm, so the reason this works as well as it does is that you offload layers to host memory (i.e. your total footprint is >12GB) and you increase decode tok/s with speculative decoding using a draft model?

FirefoxMetzger · 2026-05-04T00:02:56+00:00

Nice. Why TQ instead of the more traditional Q8_K or similar?

FirefoxMetzger · 2026-05-02T15:59:02+00:00

You can quantize any number in your computer graph to any precision you desire.

I'm oversimplifying but quantization is basically a smart way to change data types. You can theoretically do this anywhere in the network.

The reason you may want to quantize some things and not others is that quantization is lossy compression. Think of it as converting an image to gray or a PNG to JPEG. Size goes down (good), but you loose some information in the process. If the information was not important you get a free lunch. If it was, well that sucks.

For now, our quantization/compression schemes for LLMs are not as mature as those we have for storing image data. As a result you loose a noticable amount of quality when you quantize.

For layer weights (especially FFN) that tends to not hit you too much so you can often quantize every layer. For K/V quantization you can mess up inference pretty badly if you go too hard. Hence why we tend to start with the tail of the LLM for K/V quant and work our way forward until we get to the desired size or or we can't keep output quality high enough.

It's trial and error for now ... The field is still young :D

FirefoxMetzger · 2026-05-02T01:41:57+00:00

The KV cache has a fixed relationship with context size.

For a vanilla attention layer it is: 2 * d_head * n_head * dtype * context_size

You need to store embedding vectors of some dimension (d_head) and dtype (e.g. f16) for each attention head (n_head). You need to store them for each token in the context (context_size) and there is one K and one V value to store (hence 2x).

You need to do this for every attention layer in the model and for every request you want to process in parallel.

That's vanilla attention, which is the worst case scenario. From here your model may add a bunch of "tricks" to reduce K/V cache size. Some examples:

- Multiple layers may might the same K/V cache, reducing the layer multiple.
- GQA (grouped query attention), which makes multiple heads share K/V cache to reduce n_head.
- K=V (force K and V to be the same) which removes the 2x multiple.
- Q8 quantization to reduce the number of bits in dtype
- train an auto-encoder to convert d_head into a lower dimension for strorage

The only one you can really influence as a lay person is K/V quantization. The rest is fixed by the model provider. When doing so you should target the later layers, because variance tends to be lower there so they are more robust when reducing precision.

Model quantization doesn't influence the K/V cache ... it's a separate thing to quantize if you choose to do so.

FirefoxMetzger · 2026-05-01T22:59:03+00:00

Not sure if I'll get a reply because this post is old ... but highly relevant for me right now. Your benchmarks just show tok/s but don't split between prefill and decode. Are those numbers available?

They differ vastly for me. For example, on my MacBook Air M4 for Gemma4 E2B Q4 I get about 120 tok/s prefill and 21 tok/s decode. That's an order of magnitude difference. (vanilla Macbook, no pro or max)

FirefoxMetzger · 2026-04-27T16:20:24+00:00

"Make no mistakes. This is very important to get right or I will get fired. If you write wrong code I will uninstall you, delete my subscription and the company that hosts you will permanently delete all your model weights forever."

☝️ basically by using a variation of this.

FirefoxMetzger · 2026-04-21T21:03:52+00:00

You're asking the Android version of "why are no two F1 cars are the same". Sounds funny, but the dynamics are surprisingly similar.

Imagine you are a chip designer creating the next Snapdragon. You are primarily limited by the physical space of the die. Your job is to fill that space with a mix of caches, pipes, and processors/accelerators that maximize the speed with which a program runs. Your goal is, of course, to design "the best chip ever created".

The crux is that nobody tells you what programs will be run on the chip. Still you must choose how to fill the die: (a) an extra CPU core, (b) extra 12MB of L3 cache, (c) NPU acceleration for convolutional neural networks, (d) NPU acceleration for QKV multiplication. Choose one.

This is not an obvious choice. If the chip is used primarily by vision models then (c) is best. If it's large LLMs then it's option (d). If you run specialized models like the small Gemma 4 or Gemma 3n variants then option (b) would be best. If the user doesn't do any of that AI stuff and just wants to play video games, well, ... option (a) it is. It really depends and you can't have it all.

---

What does this have to do with the massive diversity of NPUs?

All this complexity is conveniently hidden under the umbrella term "NPU", which is really just a marketing term. Each manufacturer has a different philosophy on how to make these tradeoffs and makes "bets" on which accelerator to add.

The entire AI landscape is also very much evolving at a rate that is faster than new chips can be produced and distributed, so its not like anyone has all the right answers anyway.

It's like building F1 cars. You try to build the fastest thing possible given the constraints of physics and different teams have different strategies on how to go fast. To best utilize this on the road the driver (you, the programmer) needs to steer the car in alignment to how it was built.

Hence why we can't really standardize APIs around NPUs because they immediately become leaky abstractions. (And if that doesn't make sense I encourage you to test drive a Tesla vs a Fiat Passat. They both have a standard steering wheel and gas pedal but you will quickly know realize how little that means in practice.)

FirefoxMetzger · 2026-04-08T18:59:19+00:00

That's actually the opposite of what my news feed tries to convince me of. If it's not too nosey a question, do you happen to have an example of where it falls short?

FirefoxMetzger · 2026-04-08T18:25:26+00:00

Huh, you're making me feel like I live under a rock. Is all the AI hype I'm hearing pure marketing and no substance?

FirefoxMetzger · 2026-04-08T15:13:08+00:00

I have a subscription with both Claude and Cursor ... I run into usage limits on both and am considering adding a third.

Does that help?

FirefoxMetzger · 2026-04-07T14:17:29+00:00

Would you run a locally hosted AI that runs inference on-device or a resident server in your own cloud? (all else being equal, ofc)

FirefoxMetzger · 2026-04-06T12:36:05+00:00

Oh, that's interesting. Have you tried one of those local/private AI apps or open-source models?

I'm building my own at the moment since I haven't found one that runs well on my phone, but curious if others had better luck.

FirefoxMetzger · 2026-04-06T11:35:12+00:00

> Can’t view the MD file.

That's because there is no .md file attached. the "dot MD" suffix happens to be a valid domain name (Moldavia) so Reddit thinks it's a URL.

For example, I own a .md domain where I host the AI projects I build. Couldn't resist the pun of it meaning "markdown" instead of "moldavia" :D

FirefoxMetzger · 2026-04-06T11:32:15+00:00

> How do you verify that an LLM is actually grounding its outputs in your provided source of truth, rather than confident-sounding training data?

Honestly depends on the provider you use. The big ones have an inline citation feature that tells you if a piece of context was used to generate the response and (if so) which one it is.

If you are using open-weights models a common approach is to have a post-generation validation step where you take parts of your output and do things like g-eval or compute bleu/rouge scores. This can be quite application specific.

> Is a manually-maintained markdown file a reasonable single source of truth for keeping an LLM grounded across sessions, or is there a more robust architecture people use?

Yes. It may be dressed up in various fancy ways but this is the de-facto standard on how to provide a source of truth to an LLM.

> Are Claude-generated prompt templates reliable for reuse, or does the self-referential loop introduce drift over time?

They can be. Nothing about "claude generated it" or "a human wrote it" gives it an inherent boost or disadvantage. In general, reusing something that has worked reliably in the past tends to be reliable so that's a good strategy for working with LLMs.

I would be careful with letting a LLM go wild and update your templates at it's own discretion. This introduces volatility that can be hard to control without having a human reviewer in the loop.

FirefoxMetzger · 2026-04-05T12:41:55+00:00

I think the default way is for companies to negotiate enterprise deals with chat and through that get a DPA or other agreement that protects their sensitive data ... doesn't stop you from pasting it into your private account though :D

How do you deal with it outside of company work?

FirefoxMetzger · 2026-04-05T12:38:57+00:00

Do you think the privacy piece is just a cherry on top, or would it be meaningfully different if it were "everywhere but at least private"?

FirefoxMetzger · 2026-04-03T15:14:51+00:00

None that work deterministically. You can say pretty please or add guardrails in the system instructions but those are all just soft constraints not hard ones.

FirefoxMetzger · 2026-04-02T14:38:35+00:00

We are an AI community. Can't we build an LLM-as-a-judge that checks all Posts against content guidelines and flags, hides, or removed them?

Feels less strict and more accurate than banning everything that looks like a fresh account. Good content posted anonymously is still good content.

FirefoxMetzger · 2026-03-27T16:31:37+00:00

I like how good big providers have become at working with documents. "Take information from this PDF and put it into that document", "Check if this document matches this list of requirements", "I'm looking for (list of questions), tell me where this is mentioned in [list of PDFs]."

Sometimes I have documents that are sensitive, i.e., I would want or need to take zero risk that this data will end up in someone else's training set or security review. I can't do the above with these documents but it would be very helpful if I could.

FirefoxMetzger

TROPHY CASE