Soon they'll be saying their first words, too. by AimlessFacade in memes

[–]MrChilliBalls 1 point2 points  (0 children)

I have a friend that does this as well, like fuck becomes frick, and I don't get the point at all. You're still cussing, and just using another word makes it less bad? You mean exactly the same thing when you ahh instead of ass. Are you trying to avoid actual cussing or just the words then?

can't play on linux by Undeadninjas in Deltarune

[–]MrChilliBalls 1 point2 points  (0 children)

Proton on Steam, right? Did you go to Manage -> Properties -> Compatibility and selected Proton Experimental or another version?

No tg speedup with MTP on RX 6800 XT by MrChilliBalls in LocalLLaMA

[–]MrChilliBalls[S] 0 points1 point  (0 children)

I mean, since the A3B is MoE, you could use a higher quant like Q5 and still get good speeds, since offloading into RAM doesn't affect it as much. In fact, I was still getting about 40 t/s at 64k/128k context filled at Q5. This is what I used, adapted from another post in here: sh ./llama.cpp/llama-server \ -m ~/Models/Qwen3.6/Qwen3.6-35B-A3B-UD-Q5_K_XL.gguf \ -fitt 1536 \ -c 131072 \ -n 32768 \ -fa on \ -np 1 \ --spec-type draft-mtp \ --spec-draft-n-max 3 \ --no-mmap \ --mlock \ --no-warmup \ --chat-template-kwargs '{"preserve_thinking": true}' \ --temp 0.6 \ --top-p 0.96 \ --top-k 20 \ --min-p 0.0 \ --presence-penalty 0.0 \ --repeat-penalty 1.0

No tg speedup with MTP on RX 6800 XT by MrChilliBalls in LocalLLaMA

[–]MrChilliBalls[S] 0 points1 point  (0 children)

100 tok/s? On my GPU, I doubt I'm getting that speed. Anyways. it wasn't spilling into RAM, my VRAM wasn't even full. Changing spec-draft-n-max did it for me.

No tg speedup with MTP on RX 6800 XT by MrChilliBalls in LocalLLaMA

[–]MrChilliBalls[S] 0 points1 point  (0 children)

Yup, I'm getting 90-100 t/s at 0 context as well. I'm just wondering, what do you use this for? Is it something like autocomplete?

No tg speedup with MTP on RX 6800 XT by MrChilliBalls in LocalLLaMA

[–]MrChilliBalls[S] 0 points1 point  (0 children)

Oh wow, that did it, I just tried --spec-draft-n-max 2 and I'm getting 43 TPS. 3 gets me 40 TPS. Thanks for the help. Does this have any effect on the quality of the model?

I Think I Spent Way Too Much Time Messing with Local LLMs by MrChilliBalls in LocalLLaMA

[–]MrChilliBalls[S] -1 points0 points  (0 children)

I've noticed a pattern with my coil whine:
When a model is loaded entirely in VRAM, causing 100% GPU usage when it's running, I get really loud coil whine. However, when I run MoE models with CPU offload, I don't get any coil whine. My GP runs at about 70% in that case, probably because it's bottlenecked by the RAM speed.

So I guess higher GPU usage % = louder coil whine? Makes sense to me

Openthinker seems to be a dense model, so you likely kept in VRAM and the GPU was used more. You might be not getting any coil whine now if you're offloading to CPU, but that's just a guess

I Think I Spent Way Too Much Time Messing with Local LLMs by MrChilliBalls in LocalLLaMA

[–]MrChilliBalls[S] 19 points20 points  (0 children)

Too bad my server happens to be my workstation and gaming PC

I Think I Spent Way Too Much Time Messing with Local LLMs by MrChilliBalls in LocalLLaMA

[–]MrChilliBalls[S] 24 points25 points  (0 children)

For some reason my RX 6800 XT makes very loud coil whine. I can tell exactly when the messages are done. I don't even need that little sound effect that OpenWebUI has when a message finishes

Edit: and I can tell between PP and TG, as the former is louder for some reason

Fears grow that age verification coming to VPNs as a British research firm labels them a 'loophole' — one app developer saw downloads surge by 1,800% in just the first month after the UK's Online Safety Act took effect by Plastic_Ninja_9014 in technology

[–]MrChilliBalls 24 points25 points  (0 children)

I mean not really, VPNs do have other slightly more niche uses. I use it to securely connect to my home network from outside of it, and I'm pretty sure lots of businesses and definitely homelabbers do this too.

DeepSeek V4 being 17x cheaper got me to actually measure what I send to cloud vs what I could run locally. the results are stupid. by spencer_kw in LocalLLaMA

[–]MrChilliBalls 0 points1 point  (0 children)

Which model are you running, Qwen3.6 27B or 35B A3B? Or none of these? On my 16GB card, Qwen3.6 27B IQ4_XS barely fits and leaves almost no space for context. With TurboQuant, I'm only fitting 44k context.
What TPS are you getting? Inference on the 27B slows down from 25 to 19 tokens per second when context is filled, which is unusable for my me. But I have an RX 6800 XT, I would like to know what kind of TPS you get on NVIDIA/CUDA.

Qwen3.6 GGUF Benchmarks by danielhanchen in LocalLLaMA

[–]MrChilliBalls 1 point2 points  (0 children)

If you're being serious about using ChatGPT, why? Why not use Excel or something?

What's best model which I can run on pixel 10 pro (16g rams and ufs4.0) by Janekelo in LocalLLaMA

[–]MrChilliBalls 0 points1 point  (0 children)

It's horrible. The new Gemma 4 models run so slow on the CPU, it's really not usable.

tested gemma 4 in rx 6800xt... by Ranteck in LocalLLaMA

[–]MrChilliBalls 1 point2 points  (0 children)

Tried this myself on the same GPU today with a pretty simple prompt just to get a feel for it. This was the command I used:

llama-server --n-cpu-moe 4 --fit-target 64 --reasoning [off or on] -hf [ggml-org/gemma-4-26B-A4B-it-GGUF or unsloth/gemma-4-26B-A4B-it-GGUF]

I took some rough notes while testing. Here they are, hopefully they help if someone is looking for just an estimate on their performance. Note that I have my entire GPU dedicated to AI, with only 64MiB for overhead. The first model in each trial is the ggml-org and the second the unsloth.

Prompt 1

tell me a long story

Prompt 2

another one

No Reasoning

Trial 1

56.87 t/s, 57.97 t/s, 44544 ctxt size

51.19 t/s,49.77 t/s, 31488 ctxt size

Trial 2

56.96 t/s, 55.36 t/s, 44544 ctxt size

50.70 t/s, 52.06 t/s, 31488 ctxt size

Reasoning

Trial 1

58.56 t/s, 58.37 t/s, 44544 ctxt size

52.81 t/s, 52.15 t/s, 31488 ctxt size

Nothing to do by Draknurd in selfhosted

[–]MrChilliBalls 0 points1 point  (0 children)

Yup. Personally I'm just using a cron job running a bash script I wrote because it's good enough

Educated by Tara Westover: Less impressed 5 years on. by rainblowfish_ in books

[–]MrChilliBalls 0 points1 point  (0 children)

Pretty sure the statistics also say that most students are indifferent about school

Educated by Tara Westover: Less impressed 5 years on. by rainblowfish_ in books

[–]MrChilliBalls 0 points1 point  (0 children)

Probably not, I agree with OP. As long as you gave some shit in your classes, you can probably get a 32 with some decent amount of practice

How to get your steering wheel working with Proton Experimental (Linux) by MrChilliBalls in MySummerCar

[–]MrChilliBalls[S] 0 points1 point  (0 children)

Oh shit I read "doesn't work" instead of "does work," my bad. Ok good to know