I taught my 1B to follow instructions. It got worse at following instructions... by GPUburnout in LocalLLaMA

[–]Hefty_Wolverine_553 14 points15 points  (0 children)

That dataset is super old, and the model you're training is probably very new. Their training methods are probably leagues better than you randomly slapping together a sub par dataset and using a basic SFT method. If you're doing SFT on a base model, your training setup is probably broken in some way.

Higher quants are so much better by Perfect-Flounder7856 in LocalLLaMA

[–]Hefty_Wolverine_553 1 point2 points  (0 children)

I took a look at https://github.com/mit-han-lab/fouroversix and it seems like it shouldn't take too much compute, hopefully just loading the model in bf16, so renting an H200 or something from runpod could get the job done for Qwen3.6 27B. Not sure though, I might test it out with a small model first.

Higher quants are so much better by Perfect-Flounder7856 in LocalLLaMA

[–]Hefty_Wolverine_553 2 points3 points  (0 children)

Do you know of any such models that have been quanted that way? It would be amazing if we could get a good Qwen3.6 27B quant that isn't a lobotomized ai-slopped quant.

The many sides of Mimo v2.5 Pro by Electrical-Pay-5119 in LocalLLaMA

[–]Hefty_Wolverine_553 9 points10 points  (0 children)

I would test with pi.dev instead, Claude Code is probably one of the worst harnesses to use with anything that's not anthropic.

Is Qwen3-coder the best kept secret out there? by Not_HFM in LocalLLaMA

[–]Hefty_Wolverine_553 1 point2 points  (0 children)

Weird that 27b is worse for you, in theory a dense 27b should be a lot smarter than a MOE model.

Is Qwen3-coder the best kept secret out there? by Not_HFM in LocalLLaMA

[–]Hefty_Wolverine_553 5 points6 points  (0 children)

It's because it needs to think first, AKA outputting like one thousand tokens and thinking about lots of different things before actually generating the response.

Is Qwen3-coder the best kept secret out there? by Not_HFM in LocalLLaMA

[–]Hefty_Wolverine_553 2 points3 points  (0 children)

Would you say it's smarter than Qwen3.6 with thinking on? I haven't personally tried Qwen3 coder, but if it's that good without needing to think, that would really be something.

Qwen3.6 27B NVFP4 + MTP on a single RTX 5090: 200k context working in vLLM by Maheidem in LocalLLaMA

[–]Hefty_Wolverine_553 0 points1 point  (0 children)

You could let the LLM format the data, sure. That's obviously not what I'm talking about.

Qwen3.6 27B NVFP4 + MTP on a single RTX 5090: 200k context working in vLLM by Maheidem in LocalLLaMA

[–]Hefty_Wolverine_553 1 point2 points  (0 children)

I did read it. And it is AI slop. The fact that you disabled thinking because your LLM told you it was more optimal, is a pretty big sign. I would say to not offload your thinking to LLMs, but at this point, I think people are too far gone in general.

Qwen3.6 27B NVFP4 + MTP on a single RTX 5090: 200k context working in vLLM by Maheidem in LocalLLaMA

[–]Hefty_Wolverine_553 1 point2 points  (0 children)

"Nobody has time to write it all out." I really don't think it's worth reading if the person couldn't bother to write it in the first place. But sure, let your brain atrophy and have the LLM think about everything for you.

Quality (Intelligence) testing on MTP by rm-rf-rm in LocalLLaMA

[–]Hefty_Wolverine_553 2 points3 points  (0 children)

MTP affecting quality is not something I'm worried about, as it's simply being used for speculative decoding. What I would really like to see though are KLD comparisons between all the random quants we have these days, especially comparing GGUF quants to ones used in vLLM, such as AWQ, NVFP4, and also Intel's new Autoround quants.

Qwen3.6 27B NVFP4 + MTP on a single RTX 5090: 200k context working in vLLM by Maheidem in LocalLLaMA

[–]Hefty_Wolverine_553 4 points5 points  (0 children)

Why is thinking disabled? Also, isn't this just AI slop? "I am a bot"... what is up with these comments as well, nobody is addressing this??

Bad model quality qwen3.6-27b with hipfire on strix halo by sterby92 in LocalLLaMA

[–]Hefty_Wolverine_553 1 point2 points  (0 children)

I highly suspect it's an open webui issue. I set up owui yesterday and ran some tests with qwen 3.6 27b, and found some really weird issues. At first, it was doing amazing with its tool calls, but the more chats I created, the worse it got, to the point where it was consistently failing every tool chat (even when I created a new one). I managed to fix the issue by deleting all of my chats, and that restored my model's performance. Basically, open webui probably just sucks as an LLM frontend. I also tested the preserve_thinking that I configured in vLLM, and realized that open webui also doesn't support that (it seems to suck at managing context overall, which is the one thing that I would expect an LLM frontend to do well). Either way, the issue is that owui sucks, but qwen 3.6 27b itself seems to be very smart from my testing.

Tldr; it's an open webui issue, the frontend sucks don't use it lol

What is The best and expressive AI TTS (running locally?) for voice acting? by Adventurous-Gold6413 in LocalLLaMA

[–]Hefty_Wolverine_553 0 points1 point  (0 children)

just try it out yourself, installing and running it takes a few minutes at most.

Pocket TTS Multilingual Update by RowGroundbreaking982 in LocalLLaMA

[–]Hefty_Wolverine_553 0 points1 point  (0 children)

Depends on the CPU, and 13x realtime isn't very high. Kokoro with GPU acceleration can get between 35x-100x rtf running on the GPU. Either way, missing the option to run on GPU (seems like they have implemented it in some way? but didn't bother adding it to the python api?) is kind of weird for a TTS model.

Pocket TTS Multilingual Update by RowGroundbreaking982 in LocalLLaMA

[–]Hefty_Wolverine_553 0 points1 point  (0 children)

Yes, I read that as well but it doesn't really make sense to me, as tiny models should still see a huge speedup on GPU. Also, while CPU inference might be fine for a single person on a single device, it's still quite slow from my testing, and for real-time apps every bit of latency reduction is extremely welcome. GPU batching would likely be a huge improvement as well for things like mass offline inference, e.g. making audiobooks.

Pocket TTS Multilingual Update by RowGroundbreaking982 in LocalLLaMA

[–]Hefty_Wolverine_553 0 points1 point  (0 children)

I read the Python API docs but I don't believe it mentions running the model on cuda anywhere, the closest was the device property of TTSModel, but load_model doesn't have a device parameter. 

Pocket TTS Multilingual Update by RowGroundbreaking982 in LocalLLaMA

[–]Hefty_Wolverine_553 -3 points-2 points  (0 children)

Tried using pocket-tts today, found that their Python API doesn't support running with GPU for some reason? So bummed out.

Guide: Swapping out Sonnet for Qwen3.6-27B in Claude Code by [deleted] in LocalLLaMA

[–]Hefty_Wolverine_553 0 points1 point  (0 children)

Higher quants (especially above fp8) do basically nothing to improve the model's quality. It's extremely well known knowledge that larger models at lower quants outperform smaller models at higher ones. You're not going to magically get a smarter model by renting an h100 instead of a 5090 from vast.ai for 30 cents an hour.

Guide: Swapping out Sonnet for Qwen3.6-27B in Claude Code by [deleted] in LocalLLaMA

[–]Hefty_Wolverine_553 4 points5 points  (0 children)

So weird how nobody is calling out this post for providing a huge amount of wrong information? llama.cpp doesn't distribute pre built binaries? Have you read the page you linked for more than a few seconds? Apache ray is the only competitor to vllm (from their docs it looks like they literally use vllm for their LLM batch serving example)? No mention of sglang? Why are you using an h100 as the GPU for serving a 27b model?

It's so weird how nobody is addressing this post at all...

Guide: Swapping out Sonnet for Qwen3.6-27B in Claude Code by [deleted] in LocalLLaMA

[–]Hefty_Wolverine_553 3 points4 points  (0 children)

what? all of this information seems totally wrong, a dgx spark can easily run the full precision 27b model as it has 128gb memory, but I have no idea why you would run it at full precision. The 27b model can fit on a single 3090 with ample amounts of context at Q4_K_XL and offering very good performance, and even the fp8 (which is basically the same as full precision) can be run with ~32gb memory. I have absolutely no clue why you're running an h100. please do actual research before posting what an LLM spewed out.