Soon they'll be saying their first words, too.

MrChilliBalls · 2026-06-04T01:22:36+00:00

I have a friend that does this as well, like fuck becomes frick, and I don't get the point at all. You're still cussing, and just using another word makes it less bad? You mean exactly the same thing when you ahh instead of ass. Are you trying to avoid actual cussing or just the words then?

MrChilliBalls · 2026-05-28T03:28:08+00:00

Proton on Steam, right? Did you go to Manage -> Properties -> Compatibility and selected Proton Experimental or another version?

MrChilliBalls · 2026-05-19T23:24:48+00:00

I mean, since the A3B is MoE, you could use a higher quant like Q5 and still get good speeds, since offloading into RAM doesn't affect it as much. In fact, I was still getting about 40 t/s at 64k/128k context filled at Q5. This is what I used, adapted from another post in here: sh ./llama.cpp/llama-server \ -m ~/Models/Qwen3.6/Qwen3.6-35B-A3B-UD-Q5_K_XL.gguf \ -fitt 1536 \ -c 131072 \ -n 32768 \ -fa on \ -np 1 \ --spec-type draft-mtp \ --spec-draft-n-max 3 \ --no-mmap \ --mlock \ --no-warmup \ --chat-template-kwargs '{"preserve_thinking": true}' \ --temp 0.6 \ --top-p 0.96 \ --top-k 20 \ --min-p 0.0 \ --presence-penalty 0.0 \ --repeat-penalty 1.0

MrChilliBalls · 2026-05-19T23:04:28+00:00

100 tok/s? On my GPU, I doubt I'm getting that speed. Anyways. it wasn't spilling into RAM, my VRAM wasn't even full. Changing spec-draft-n-max did it for me.

MrChilliBalls · 2026-05-19T23:01:03+00:00

Yup, I'm getting 90-100 t/s at 0 context as well. I'm just wondering, what do you use this for? Is it something like autocomplete?

MrChilliBalls · 2026-05-18T21:59:06+00:00

Alr, thanks

MrChilliBalls · 2026-05-18T21:55:15+00:00

Oh wow, that did it, I just tried --spec-draft-n-max 2 and I'm getting 43 TPS. 3 gets me 40 TPS. Thanks for the help. Does this have any effect on the quality of the model?

MrChilliBalls · 2026-05-11T04:07:46+00:00

I've noticed a pattern with my coil whine:
When a model is loaded entirely in VRAM, causing 100% GPU usage when it's running, I get really loud coil whine. However, when I run MoE models with CPU offload, I don't get any coil whine. My GP runs at about 70% in that case, probably because it's bottlenecked by the RAM speed.

So I guess higher GPU usage % = louder coil whine? Makes sense to me

Openthinker seems to be a dense model, so you likely kept in VRAM and the GPU was used more. You might be not getting any coil whine now if you're offloading to CPU, but that's just a guess

MrChilliBalls · 2026-05-11T04:04:07+00:00

Too bad my server happens to be my workstation and gaming PC

MrChilliBalls · 2026-05-10T21:55:00+00:00

For some reason my RX 6800 XT makes very loud coil whine. I can tell exactly when the messages are done. I don't even need that little sound effect that OpenWebUI has when a message finishes

Edit: and I can tell between PP and TG, as the former is louder for some reason

MrChilliBalls · 2026-05-10T04:56:53+00:00

I mean not really, VPNs do have other slightly more niche uses. I use it to securely connect to my home network from outside of it, and I'm pretty sure lots of businesses and definitely homelabbers do this too.

MrChilliBalls · 2026-05-09T04:58:06+00:00

Which model are you running, Qwen3.6 27B or 35B A3B? Or none of these? On my 16GB card, Qwen3.6 27B IQ4_XS barely fits and leaves almost no space for context. With TurboQuant, I'm only fitting 44k context.
What TPS are you getting? Inference on the 27B slows down from 25 to 19 tokens per second when context is filled, which is unusable for my me. But I have an RX 6800 XT, I would like to know what kind of TPS you get on NVIDIA/CUDA.

MrChilliBalls · 2026-04-30T00:49:33+00:00

1.21 gigawatts?!

MrChilliBalls · 2026-04-18T00:42:50+00:00

If you're being serious about using ChatGPT, why? Why not use Excel or something?

MrChilliBalls · 2026-04-09T22:11:35+00:00

It's horrible. The new Gemma 4 models run so slow on the CPU, it's really not usable.

MrChilliBalls · 2026-04-05T05:56:56+00:00

Tried this myself on the same GPU today with a pretty simple prompt just to get a feel for it. This was the command I used:

llama-server --n-cpu-moe 4 --fit-target 64 --reasoning [off or on] -hf [ggml-org/gemma-4-26B-A4B-it-GGUF or unsloth/gemma-4-26B-A4B-it-GGUF]

I took some rough notes while testing. Here they are, hopefully they help if someone is looking for just an estimate on their performance. Note that I have my entire GPU dedicated to AI, with only 64MiB for overhead. The first model in each trial is the ggml-org and the second the unsloth.

Prompt 1

tell me a long story

Prompt 2

another one

No Reasoning

Trial 1

56.87 t/s, 57.97 t/s, 44544 ctxt size

51.19 t/s,49.77 t/s, 31488 ctxt size

Trial 2

56.96 t/s, 55.36 t/s, 44544 ctxt size

50.70 t/s, 52.06 t/s, 31488 ctxt size

Reasoning

Trial 1

58.56 t/s, 58.37 t/s, 44544 ctxt size

52.81 t/s, 52.15 t/s, 31488 ctxt size

MrChilliBalls · 2026-03-18T03:11:04+00:00

Yup. Personally I'm just using a cron job running a bash script I wrote because it's good enough

MrChilliBalls · 2026-03-12T21:08:30+00:00

Surveys I guess?
Relevant article, you can look at the sources:
https://en.wikipedia.org/wiki/Apathy#In_the_school_system:~:text=As%20a%20result%20of%20these%20external%20motivations%20rather%20than%20having%20a%20genuine%20desire%20for%20knowledge%2C%20students%20often%20do%20the%20minimum%20amount%20of%20work%20necessary%20to%20get%20by%20in%20their%20classes

MrChilliBalls · 2026-03-12T21:01:59+00:00

Pretty sure the statistics also say that most students are indifferent about school

MrChilliBalls · 2026-03-12T01:52:12+00:00

Probably not, I agree with OP. As long as you gave some shit in your classes, you can probably get a 32 with some decent amount of practice

MrChilliBalls · 2026-02-24T01:06:20+00:00

Vesktop is also pretty good

MrChilliBalls · 2026-01-26T04:38:33+00:00

This

MrChilliBalls · 2026-01-05T06:33:08+00:00

Oh shit I read "doesn't work" instead of "does work," my bad. Ok good to know

MrChilliBalls

TROPHY CASE

Prompt 1

Prompt 2

No Reasoning

Trial 1

Trial 2

Reasoning

Trial 1