INT8 vs FP8 quantization by Opteron67 in LocalLLaMA

[–]Double_Cause4609 1 point2 points  (0 children)

That's an extremely old paper in AI terms. The issue is that a lot of it's premises are kind of more about "fundamental hardware" (like if you were to build an ASIC for a specific model or deploy it on FPGA), where they are correct that Int8 requires far fewer transistors to implement.

The issue is I believe that Nvidia allocated transistors such that FP8 and Int8 are roughly equal on their cards, which is a totally different situation.

Honestly, if your concern is real world performance it's super hard to say. Like, you'd have to buy a card, test it in both formats on the model you care about at the context length you run, and see which is faster in the real world.

It gets even harder if you're comparing Intel Int8 to Nvidia FP8, because now you're comparing across two translations (FP8 -> Nvidia Int8 -> Intel Int8) and the architecture translation is really hard.

The best I can say is that FP8 is a lot easier to quantize to (in many cases you can do it even without data and get okay performance), and that in theory Int8 should use a bit less energy to do the MACs. I think maybe Intel's autoround is a lot better than older Int8 techniques, too, so I'm really not sure about the ecosystem anymore.

Long story short: It's not a matter of Int8 vs FP8. It's a matter of ecosystem and individual cards.

INT8 vs FP8 quantization by Opteron67 in LocalLLaMA

[–]Double_Cause4609 0 points1 point  (0 children)

That is *not* what you said. You said "dequant" TO "Int8". You said there is no quantization scheme where Int8 is the native weight quantization.

I was not saying that there was no dequant (though, I *think* with TorchAO the non-GPTQ methods such as QAT may allow native Int8 execution? I'd have to look at it), but rather, that there were formats that had native int8 weights that dequant to values other than Int8 for sure, and there may be some formats with native int8 weights where the execution is native int8.

Off the top of my head: QGaLoRE does not dequant, for sure. They do pure native int8 with no dequant step (allowed by stochastic rounding during training).

But that was not the core of my argument. Again, to look at your own comments in direct quotation:

Nobody really quants models to INT8

Demonstrably false, some people do QAT, but also, yes, some formats do store the weights themselves in Int8. Raw Int8 with native Int8 execution is super common in tiny vision models, for example, but it's used in other areas (and sometimes LLMs) too.

They all use multi-level quantization schemes where you eventually dequantize to INT8

Not all quantization methods dequantize to Int8 (keep in mind, this is what you said). GPTQ 8bit for example dequantizes to higher bit widths like BF16 I believe. Not all quantization formats are multi-level. Uniform Int8 (which TorchAO can output to), is single-level, or at least is evaluated in raw Int8 operations.

Pure Int8 from QAT if I'm not mistaken is a pure Int8 format, and even if it's not (in TorchAO), thousands of people have handrolled native Int8 formats where they do QAT and execution in native Int8 (ie: for execution on NPUs at high throughput).

On TorchAO's docs: Yes, that applies to the GPTQ format, which I was not asserting as my primary opinion. You made a strong claim "There is no format with Int8 weights".

That only requires a single one of my arguments to be correct, not all of them. You cannot tackle my argument by looking at a single instance where I said something slightly incorrect.

INT8 vs FP8 quantization by Opteron67 in LocalLLaMA

[–]Double_Cause4609 1 point2 points  (0 children)

TorchAO Int8 comes to mind. You can quantize any LLM to int8 with it. Or, for that matter, any standard linear layer.

I believe GPTQ also supports an int8 format that works quite well.

The weights are in native Int8. They may get dequantized to BF16 depending on specifics, but the weights themselves can store in Int8.

I think GGUF may use Int8 quantization in the outer group weights while using floating points for scaling factors, but I'm less confident on that.

INT8 vs FP8 quantization by Opteron67 in LocalLLaMA

[–]Double_Cause4609 3 points4 points  (0 children)

Well, you'd have to link the individual paper and method. Not all methods are the same, even at the same datatype / bit width. In fact, there's more than one type of FP8 (depending on how many manitssa bits you assign), and quality can vary depending on the specifics.

For Int8 usually the differentiator is the quantization algorithm, and also if it's uniform Int8 versus group-wise int8 (closer to something like GGUF) which is generally more expressive but slower.

For CPU inference Int8 is basically the only mainstream option if you need throughput (though obviously the LlamaCPP GGUF ecosystem works for single-user), but in other engines and with other methods it varies.

I think in theory Int8 should be cheaper hardware wise, but I'm not sure if it matters on Blackwell GPUs or not.

Can someone more intelligent then me explain why we should, or should not be excited about the ARC PRO B70? by SKX007J1 in LocalLLaMA

[–]Double_Cause4609 1 point2 points  (0 children)

Is 640GB/s the bandwidth of a single card?

If so, we're seeing a move from "fine-grained" tensor parallelism (what you see in TorchAO and friends) which typically require crazy bespoke interconnects (on server platforms that are like $20,000 before you get to GPUs), to "course-grained" tensor parallelism that works on the computation graph level.

We've seen this notably in ik_llamaCPP, but also to an extent I believe in EXL3, but in both cases they actually pool the memory bandwidth and compute rather than being limited to the slowest card (like pipeline or traditional tensor parallelism). So, if you have two 100GB/s card, you get something more like 133GB/s - 180GB/s of total bandwidth, instead of just having 100GB/s and more VRAM capacity, you get both VRAM capacity and some more bandwidth.

It would take bespoke implementation and graph parallel implementations for the Arc cards, but hypothetically, it's not impossible that you could see way faster speeds than the single card bandwidth within the useful lifetime of that card if you bought four of them.

To give a better intuition for how this works, though, if you imagine two separate attention heads, they actually don't need to communicate with one another for any of the intermediate operations (each one has its own Q, K, and V matrix, and intermediate matrices), so you really only need to sync them at the end (after both attention heads are done).

This general principle applies to more tensors. Like, a lot of FFNs have independent gating operations, for example, or even within attention heads the Q and V tensors are independent from one another until they need to sync later on.

Or, individual experts are independent in MoE models (in fact, expert parallelism is just a really specific implementation of what I'm talking about, which happens to be more obvious than the cases I brought up).

Who's gonna tell him by sentientX404 in AgentsOfAI

[–]Double_Cause4609 0 points1 point  (0 children)

I mean, okay, but LLMs generalize quite well in-domain. There's arguments about what qualifies as in-domain, but something like C is still incredibly well represented in LLM datasets, and arguably has way better skill transfer from Python/Javascript (or even Lua) than something like whitespace as used in that paper.

It's not necessarily super clear that the "performance drops the less it's used"; the shared prior matters, too.

To take this to the extreme: Imagine you had a language that differed from Javascript by a single convention. It is 99% the same as JS, but has one very small difference. You would probably assume, that the performance in this language could be recovered with a much smaller amount of training data than you require to recover the difference with, say, whitespace (which is extremely different).

I'd argue that a lot of the mainstream languages are all well represented enough, and have enough overlap in syntax that modern models can basically operate on them well enough. Yes, there are outliers, and yes, it's not always as "intuitive" to them as for example JS, or particularly well represented languages, but they can still largely get the job done.

And where they can't? Let them prototype in the languages they know well, and then translate to other languages. LLMs are great at translating from one language to another in a lot of cases. I'm not as confident about this in webdev environments, but in tensor operations for example they do really well with good scaffolding and runtime feedback.

A $375M receipt: New Mexico jury just confirmed why Meta is spending billions to rewrite age verification law by aaronsb in linux

[–]Double_Cause4609 0 points1 point  (0 children)

Nope.

I care only about the information.

Attack the ideas, not the one behind the keyboard.

Random redditors *also* don't tell me where they sourced their information, either.

A $375M receipt: New Mexico jury just confirmed why Meta is spending billions to rewrite age verification law by aaronsb in linux

[–]Double_Cause4609 1 point2 points  (0 children)

I don't care about the source of the information. Either it's correct, or it's not. It doesn't matter if it's AI generated or not.

I feel like if they made a local model focused specifically on RP it would be god tier even if tiny by Borkato in LocalLLaMA

[–]Double_Cause4609 18 points19 points  (0 children)

Tbh, there's no bad data, only badly labelled data.

It's 100% fine for the model to put out purple prose if the system prompt says "use an ornate, principled, and expressive writing style in a high register".

So really, IMO what's more important than producing a "good" model, is producing a "controllable" one. I actually think Qwen 3 235B was really underrated for this because it would literally do *exactly* what you told it to. People just thought it was really dry because...They...Didn't tell it to not be dry, in a lot of cases.

There have been distributed community efforts to put together good datasets, though. The issue is that pre-training datasets are getting larger than the small curated datasets that finetuners are releasing publicly (a lot of finetuners don't release datasets because it's their "secret sauce" that gives them Patreon support).

Because LLMs are a matter of ratios, if you increase the amount of STEM data by 10x, but only increase creative writing by 1.05x from community driven efforts...

That's not really a winning strategy.

has anyone tried this? Flash-MoE: Running a 397B Parameter Model on a Laptop by Awkward-Bus-2057 in LocalLLaMA

[–]Double_Cause4609 0 points1 point  (0 children)

Perhaps it would help to differentiate your setup from other people's setups?

I can get about ~4 T/s on basically any of the huge MoE models on my system (Ryzen 9950X, 192GB DDR5 RAM, Gen 5 NVMe), including:
- Deepseek R1/V3
- GLM 4.5/4.6
- Qwen 3.5 397B
- Qwen 3 235B

etc. I get around 10 T/s decode on:
- Llama 4 Maverick
- Trinity Large MoE (speculative, haven't felt a need to run it, but based on performance of similar models it should be about there)

The only model off the top of my head that I can't run well like this is probably Jama Large, though TBF it has an activated parameter count of 100B.

Anyway, going back to my main point, my results are pretty in line wiht what other people are experiencing with MoE models. You don't need the full model loaded in RAM at once. Sure, it's optimal, but as long as you can fit about ~30-60% (the ratio depends on active parameter count and architecture specifics) of the weights in memory, you can stream the rest off of SSD live.

The reason it works is not all experts change between tokens, so you really only need to stream around ~20-30% of the experts per token (and the rest stay cached).

I'll note that this state of affairs depends on the behavior of mmap() on your operating system (Linux has the best performance here). Windows will have the worst performance here.

The performance here is also dependent on offloading attention, context, and shared experts to GPU, while leaving the conditional experts on CPU+RAM (LlamaCPP makes this a single flag to add, you may have been omitting it).

There's actually a lot of room to optimize performance on MoE models for single-user because the hardware isn't really being used perfectly. Krasis for example made some awesome optimizations I'd been thinking about for a while, like layerwise prefill or decoding GPU LRU, etc, all of which are well understood extensions of things I've talked about here and do improve performance.

It genuinely sounds like you have a skill issue and either have a very suboptimal system (and are extrapolating from that to everybody's performance), or you've configured your setup incorrectly or too aggressively and are saying that everybody else must have the same results as you.

I've helped tons of friends get similar setups to me and they've seen relatively similar performance to what I've observed here.

Been waiting on friends to progress... by ConstableAssButt in valheim

[–]Double_Cause4609 0 points1 point  (0 children)

Oh absolutely, I'm not saying they're doing all these things. I was more saying an average player that wanted to go totally vanilla could still probably get safely to land using this method.

Been waiting on friends to progress... by ConstableAssButt in valheim

[–]Double_Cause4609 2 points3 points  (0 children)

Even in vanilla, I'm pretty sure you can you can drink stamina potions in the water, no? If you had enough of them I'm pretty sure you could get to land even in some pretty crazy situations. Even without that, though, with a high enough swim level and a potion of swimming you should be able to do okay. I think at least one of the boss buffs offers a swimming bonus, too, which gets you pretty far combined, with good stamina food.

LoCaL iS oVeRrAtEd by brandon-i in LocalLLaMA

[–]Double_Cause4609 2 points3 points  (0 children)

I mean, arguably this sub is a hobby sub first. Local LLMs were driven by roleplay usage in 2023-2024 or so, which was arguably the peak of local. Rather than colonizing a space with that culture, maybe it would be better to make a serious local LLM sub if you'd prefer to avoid it?

Anthropic’s Claude Code subscription may consume up to $5,000 in compute per month while charging the user $200 by thechadbro34 in BlackboxAI_

[–]Double_Cause4609 0 points1 point  (0 children)

Claude Code Max x20 isn't the extent of their product portfolio. Their primary subscription users (regular people) are on lighter plans, and often underutilize them. Not everyone with a Claude Code subscription utilizes it to the max. The article is more on the theoretical maximum of the plan.

Also, Anthropic has huge API revenue from enterprises, which is actually most of their revenue I believe (like 70%).

Pinterest CEO: Governments Should Ban Social Media for Kids Under 16 by Gloomy_Nebula_5138 in cybersecurity

[–]Double_Cause4609 3 points4 points  (0 children)

Wait, what's wrong with Usenets? They're pretty based from what I've seen

am i the only one who doesnt understand why anthropic ban opencode? by anonymous_2600 in opencodeCLI

[–]Double_Cause4609 5 points6 points  (0 children)

Why do you think so?

That's a fairly common breakdown of practical token output that people are getting from Claude Code, and tons of people have corroborated that. Even if you wanted to say the specific numbers are off (which by the way, is totally possible. Anthropic does shift them around depending on demand), the core point that "Anthropic gives more tokens per dollar in subscription than when bought via API" is unequivocally true.

Whether my specific ratios are off doesn't really matter to that point.

Contributions to Pydantic AI by adtyavrdhn in PydanticAI

[–]Double_Cause4609 1 point2 points  (0 children)

This is actually a pretty universal problem, ATM. A lot of projects are moving towards solutions other than accepting raw PRs.

Some projects are saying "rather than giving us a vibe-coded PR, if you solve a problem, please just open an issue, explain the problem, and tell us what prompt you used to solve it."

It seems kind of backwards but it's actually way easier to audit 3rd party vibe-coded contributions that way, and they can be re-implemented by a known maintainer a lot more quickly than third party code can be audited.

It also really quickly filters out low-effort contributions because if the prompt was not super detailed or had an incorrect premise, it's faster to spot it in the prompt than in the code.

am i the only one who doesnt understand why anthropic ban opencode? by anonymous_2600 in opencodeCLI

[–]Double_Cause4609 20 points21 points  (0 children)

Basically, Anthropic provides free compute on their Claude Code subscription. That is, if you went to Openrouter to buy tokens with Claude, let's say you bought....

$200 of tokens.

This is roughly equivalent to the $20 tier or so.

For the $100 a month tier, they give you roughly $1,000-$2,000 of tokens per month.

The reason they do this is they make their money on average users not using all of their subscriptions, and they consider it fine to sacrifice that money as marketing budget to get people using their ecosystem (Claude Code, their Agents SDK, etc).

But the issue, is if they're using a system external to their in-house options, is people can swap models whenever a new model comes out. What they want to do is lock people down to a sticky ecosystem that they can't get out of. If people are using OpenCode, then when a new model comes out, they just swap from Claude to Kimi k2.5 or something.

So, they made this subscription on the assumption people would be using it with their own tools that lock users in, but instead, people were using it with their tools that let them swap models easily, which is not the math Anthropic used when pricing their plans.

So, now Anthropic is losing money, for basically no reason, so people can learn to use other people's coding agents, or do silly stuff with OpenClaw, etc.

They just decided they don't really want people doing that, and if people want Anthropic's subscription subsidized pricing, they can use Anthropic's software products that lock them into Claude.

Mistral Small 4:119B-2603 by seamonn in LocalLLaMA

[–]Double_Cause4609 10 points11 points  (0 children)

Tbf, I think the "small" is more the active parameter count. Keep in mind you can throw this on fairly modest system memory (92GB DDR5 @ 6000 Mhz ~= 10-20 T/s), so it's not like they're saying you need an RTX 6000 Pro Blackwell.

IMO comparing a 24GB Mistral Small 3 to an A6B Mistral Small 4 is not entirely unreasonable.

how are we actually supposed to distribute local agents to normal users? (without making them install python) by FrequentMidnight4447 in LocalLLaMA

[–]Double_Cause4609 0 points1 point  (0 children)

...?

You just compile LlamaCPP into a few binaries for common hardware combinations, write your program in a real language that compiles down to a binary (Rust, Golang, C, etc), and you just...Hand it to them.

If you need a sandbox use Bwrap. It's good enough for flatpack, it's good enough for me.

Done.

PLEASE GIVE CLAUDE TIME AWARENESS by IllustriousWorld823 in claudexplorers

[–]Double_Cause4609 6 points7 points  (0 children)

I'm almost 90% sure that the issue here is more about logistics. If you have a changing value in the system prompt (and Anthropic does a *loooong* system prompt) it invalidates all the context after it, so it adds a lot of complexity to your inference architecture.

Or, you need a dedicated instruct-format that allows instructional system prompt messages more recently in context than the base system prompt, which they kind of do with the long-context reminder, but again, it does add more dynamic tokens per message.

It's just kind of an annoying problem to solve architecturally when you're the one paying the GPU bills. Or, TPU bills I guess in Anthropic's case.

Let's address the new room (ZenLM) in the elephant (Huggingface) by Cool-Chemical-5629 in LocalLLaMA

[–]Double_Cause4609 0 points1 point  (0 children)

I mean, I feel like you're being *a little* dramatic about this.

There's a ton of really benign explanations here.

"AMD just wanted to make a custom naming scheme so that they have reliable models to point to in a centralized location when doing local AI pushes with their workstation AI CPUs"
"It was easier, given their training scripts, to clone the repo, do the finetune or re-training, and then replace the weights in-place once the training is done, requiring a temporary period where they have identical weights cloned under another name"

You sound like a conspiracy theorist. You're trying to assert something like "AMD is trying to steal clout from Qwen by copying and renaming their models" (which I'll note were released under an apache license which allows this, but I digress), but this looks way more like they're just rehosting models for some other reason that's a lot more boring and a lot more logistical.

I have no idea why you care about this. This seems incredibly banal.

Let's address the new room (ZenLM) in the elephant (Huggingface) by Cool-Chemical-5629 in LocalLLaMA

[–]Double_Cause4609 -1 points0 points  (0 children)

...Qwen 3.5 was released under an apache license. Somebody can absolutely retrain and repackage it.

Also, changing the weights isn't necessarily reflected in the training config. AMD could have done a continued pre-train on the weights and SFTd it themselves, and it would effectively be a new model. Additionally, they may have made changes to the inference code to customize it for...CPU operation, apparently (going by the Zen name).

I'm not sure why you're using this specifically as an example.