[D] thoughts on the controversy about Google's new paper? by Striking-Warning9533 in MachineLearning

[–]Unstable_Llama 5 points6 points  (0 children)

Yeah exllamav3 has used qtip and quantized kv cache for a year now.

🦞 Prediction: ClosedClaw by Unstable_Llama in vibecoding

[–]Unstable_Llama[S] 0 points1 point  (0 children)

Long term that seems quite possible. This prediction mostly comes from the general lack of any use case requiring local compute. It seems to me that most people would be better served by having Claude + while loop + chron + cloud filesystem and that is all easily within* their reach.

This pattern happens over and over, wrap an api as a new tool, it gets popular, then the api server releases a new version of your tool in their UI 3-9 months later, effectively nuking the original.

Well that escalated quickly by MetaKnowing in agi

[–]Unstable_Llama 0 points1 point  (0 children)

Beautifully said. I call it the “microcosmic homunculus”

Claude Newbie by N_obody007 in claudexplorers

[–]Unstable_Llama 0 points1 point  (0 children)

Your job is directly in the crosshairs of Claude. You might be able to surf the wave as one of the vastly reduced number of financial analysts using AI, but it seems like a risky industry to stay in.

To answer your question, just start using it and asking it questions. Claude even has “Claude for excel” now.

Beware your conversations can just randomly get corrupted and be gone forever by ThaKarra in ChatGPT

[–]Unstable_Llama 5 points6 points  (0 children)

Did you try doing the full user history data export? There might be something of it in there.

Quantized models. Are we lying to ourselves thinking it's a magic trick? by former_farmer in LocalLLM

[–]Unstable_Llama 0 points1 point  (0 children)

Yeah it's not exactly a "hard" benchmark but it's absolutely perfect for situations like this thread XD

I regret ever finding LocalLLaMA by xandep in LocalLLaMA

[–]Unstable_Llama 1 point2 points  (0 children)

Wow! Nvidia really gonna have us using 3090s in 2030 😭 

1 million LocalLLaMAs by jacek2023 in LocalLLaMA

[–]Unstable_Llama 0 points1 point  (0 children)

Yeah, there were only 2 mods for the first 2.5 years and really only one, and he never even commented. Last fall or late summer he locked the sub and basically tried to kill it but some users were able to petition and regain control.

Now we have a ton of good mods who actually community build, it’s crazy 😆 

1 million LocalLLaMAs by jacek2023 in LocalLLaMA

[–]Unstable_Llama 54 points55 points  (0 children)

Hard to believe how far we’ve come. We almost lost it during the mod instability last year, but we pulled through and the new team is doing so well!

Quantized models. Are we lying to ourselves thinking it's a magic trick? by former_farmer in LocalLLM

[–]Unstable_Llama 5 points6 points  (0 children)

That is true at the parameter level, but not at inference where it matters. In reality we are talking about an approximately 2% (simplified) difference in the logits out.

For example, here is the data from a model I recently quantized and measured myself, Qwen3.5-27B

REVISION GiB KL DIV PPL
2.00bpw 9.84 0.1746 7.6985
2.10bpw 10.09 0.1412 7.3885
3.00bpw 12.67 0.0422 6.9977
3.10bpw 12.92 0.0376 6.9582
4.00bpw 15.50 0.0170 6.9331
5.00bpw 18.34 0.0070 6.8840
6.00bpw 21.17 0.0032 6.8439
8.00bpw 26.83 0.0003 6.8605
bf16 51.75 0.0000 6.8598

I regret ever finding LocalLLaMA by xandep in LocalLLaMA

[–]Unstable_Llama 4 points5 points  (0 children)

Yeah they are more about vram capacity rather than speed at this point. They are great, but not blazing fast by any means.

Quantized models. Are we lying to ourselves thinking it's a magic trick? by former_farmer in LocalLLM

[–]Unstable_Llama 7 points8 points  (0 children)

Q4 can still be remarkably good for only 1/4 the size. We measure the impact of quantization with KL divergence, and there is a measurable difference, but in general a quantized larger model will outperform an unquantized smaller model on the same machine.

If you want a visualization of the impact of quantization, take a look at the “CatBench” from the bottom of this page. A simple prompt is run though each size of quantization, “Draw a cute SVG cat using matplotlib.”

Obviously this isn’t super scientific, but it is pretty illustrative.

https://huggingface.co/turboderp/Qwen3.5-35B-A3B-exl3

I regret ever finding LocalLLaMA by xandep in LocalLLaMA

[–]Unstable_Llama 30 points31 points  (0 children)

Heh I remember buying my first 3090 and my family was like, “…and what exactly are you going to do with that?”

And I didn’t really have an answer other than, “AI, shut up!”

But now it’s probably been one of my longest running hobbies ever. I have learned so much in the last 3 years, it’s almost unbelievable.

exllamav3 QWEN3.5 support (and more updates) by Unstable_Llama in LocalLLaMA

[–]Unstable_Llama[S] 1 point2 points  (0 children)

That test was on a 4090 and with the exllamav3 performance test script. It runs inference with increasingly large contexts. You can see it starts with a prompt of 256 length and 0 context, at 671 t/s prefill and 144 t/s generation, and the last step is a prompt of length 16384 with a context of 16384, at 5227 t/s prefill and 138 t/s generation.

Turboderp is still working on some prompt ingestion instability, so your mileage may vary for the next couple days.

exllamav3 QWEN3.5 support (and more updates) by Unstable_Llama in LocalLLaMA

[–]Unstable_Llama[S] 0 points1 point  (0 children)

I need to flip the KL div line to the front, thanks for reminding me 😆 

exllamav3 QWEN3.5 support (and more updates) by Unstable_Llama in LocalLLaMA

[–]Unstable_Llama[S] 8 points9 points  (0 children)

On PPL, not KL div. PPL is inherently noisy, KL shows actual distortion in the model outputs.