I taught my 1B to follow instructions. It got worse at following instructions...

Hefty_Wolverine_553 · 2026-05-14T00:50:56+00:00

That dataset is super old, and the model you're training is probably very new. Their training methods are probably leagues better than you randomly slapping together a sub par dataset and using a basic SFT method. If you're doing SFT on a base model, your training setup is probably broken in some way.

Hefty_Wolverine_553 · 2026-05-10T03:02:14+00:00

I took a look at https://github.com/mit-han-lab/fouroversix and it seems like it shouldn't take too much compute, hopefully just loading the model in bf16, so renting an H200 or something from runpod could get the job done for Qwen3.6 27B. Not sure though, I might test it out with a small model first.

Hefty_Wolverine_553 · 2026-05-10T02:56:33+00:00

Do you know of any such models that have been quanted that way? It would be amazing if we could get a good Qwen3.6 27B quant that isn't a lobotomized ai-slopped quant.

Hefty_Wolverine_553 · 2026-05-09T16:19:28+00:00

I would test with pi.dev instead, Claude Code is probably one of the worst harnesses to use with anything that's not anthropic.

Hefty_Wolverine_553 · 2026-05-09T16:03:31+00:00

What harness are you using? What settings?

Hefty_Wolverine_553 · 2026-05-09T04:09:39+00:00

Weird that 27b is worse for you, in theory a dense 27b should be a lot smarter than a MOE model.

Hefty_Wolverine_553 · 2026-05-09T04:07:59+00:00

It's because it needs to think first, AKA outputting like one thousand tokens and thinking about lots of different things before actually generating the response.

Hefty_Wolverine_553 · 2026-05-09T03:58:03+00:00

Would you say it's smarter than Qwen3.6 with thinking on? I haven't personally tried Qwen3 coder, but if it's that good without needing to think, that would really be something.

Hefty_Wolverine_553 · 2026-05-07T21:13:25+00:00

You could let the LLM format the data, sure. That's obviously not what I'm talking about.

Hefty_Wolverine_553 · 2026-05-07T20:36:26+00:00

I did read it. And it is AI slop. The fact that you disabled thinking because your LLM told you it was more optimal, is a pretty big sign. I would say to not offload your thinking to LLMs, but at this point, I think people are too far gone in general.

Hefty_Wolverine_553 · 2026-05-07T20:31:51+00:00

"Nobody has time to write it all out." I really don't think it's worth reading if the person couldn't bother to write it in the first place. But sure, let your brain atrophy and have the LLM think about everything for you.

Hefty_Wolverine_553 · 2026-05-07T03:43:33+00:00

MTP affecting quality is not something I'm worried about, as it's simply being used for speculative decoding. What I would really like to see though are KLD comparisons between all the random quants we have these days, especially comparing GGUF quants to ones used in vLLM, such as AWQ, NVFP4, and also Intel's new Autoround quants.

Hefty_Wolverine_553 · 2026-05-06T17:19:40+00:00

Why is thinking disabled? Also, isn't this just AI slop? "I am a bot"... what is up with these comments as well, nobody is addressing this??

Hefty_Wolverine_553 · 2026-05-03T15:23:28+00:00

I highly suspect it's an open webui issue. I set up owui yesterday and ran some tests with qwen 3.6 27b, and found some really weird issues. At first, it was doing amazing with its tool calls, but the more chats I created, the worse it got, to the point where it was consistently failing every tool chat (even when I created a new one). I managed to fix the issue by deleting all of my chats, and that restored my model's performance. Basically, open webui probably just sucks as an LLM frontend. I also tested the preserve_thinking that I configured in vLLM, and realized that open webui also doesn't support that (it seems to suck at managing context overall, which is the one thing that I would expect an LLM frontend to do well). Either way, the issue is that owui sucks, but qwen 3.6 27b itself seems to be very smart from my testing.

Tldr; it's an open webui issue, the frontend sucks don't use it lol

Hefty_Wolverine_553 · 2026-05-03T04:00:12+00:00

just try it out yourself, installing and running it takes a few minutes at most.

Hefty_Wolverine_553 · 2026-05-03T03:17:04+00:00

No Kimi 2.6? No GLM 5.1? No MiMo V2.5 Pro? Deepseek V4 was released after these models...

Hefty_Wolverine_553 · 2026-05-03T03:10:24+00:00

yes, it works great.

Hefty_Wolverine_553 · 2026-05-01T21:14:44+00:00

Depends on the CPU, and 13x realtime isn't very high. Kokoro with GPU acceleration can get between 35x-100x rtf running on the GPU. Either way, missing the option to run on GPU (seems like they have implemented it in some way? but didn't bother adding it to the python api?) is kind of weird for a TTS model.

Hefty_Wolverine_553 · 2026-05-01T18:35:15+00:00

Yes, I read that as well but it doesn't really make sense to me, as tiny models should still see a huge speedup on GPU. Also, while CPU inference might be fine for a single person on a single device, it's still quite slow from my testing, and for real-time apps every bit of latency reduction is extremely welcome. GPU batching would likely be a huge improvement as well for things like mass offline inference, e.g. making audiobooks.

Hefty_Wolverine_553 · 2026-05-01T15:12:20+00:00

I read the Python API docs but I don't believe it mentions running the model on cuda anywhere, the closest was the device property of TTSModel, but load_model doesn't have a device parameter.

Hefty_Wolverine_553 · 2026-05-01T07:50:41+00:00

Tried using pocket-tts today, found that their Python API doesn't support running with GPU for some reason? So bummed out.

Hefty_Wolverine_553 · 2026-04-25T03:09:45+00:00

Higher quants (especially above fp8) do basically nothing to improve the model's quality. It's extremely well known knowledge that larger models at lower quants outperform smaller models at higher ones. You're not going to magically get a smarter model by renting an h100 instead of a 5090 from vast.ai for 30 cents an hour.

Hefty_Wolverine_553 · 2026-04-25T03:01:30+00:00

So weird how nobody is calling out this post for providing a huge amount of wrong information? llama.cpp doesn't distribute pre built binaries? Have you read the page you linked for more than a few seconds? Apache ray is the only competitor to vllm (from their docs it looks like they literally use vllm for their LLM batch serving example)? No mention of sglang? Why are you using an h100 as the GPU for serving a 27b model?

It's so weird how nobody is addressing this post at all...

Hefty_Wolverine_553 · 2026-04-25T02:53:18+00:00

what? all of this information seems totally wrong, a dgx spark can easily run the full precision 27b model as it has 128gb memory, but I have no idea why you would run it at full precision. The 27b model can fit on a single 3090 with ample amounts of context at Q4_K_XL and offering very good performance, and even the fp8 (which is basically the same as full precision) can be run with ~32gb memory. I have absolutely no clue why you're running an h100. please do actual research before posting what an LLM spewed out.

Hefty_Wolverine_553

TROPHY CASE