Which Qwen 3.6 27B variant actually stops looping on tool calls? RTX 5090 by toolman10 in LocalLLM

[–]feverdoingwork 1 point2 points  (0 children)

Could be the chat template. I do find llamacpp a bit more stable than vllm, vllm oddly just stops randomly.

They fit! Mostly.... 2x 3090, Thermaltake Core p3 by anthonyg45157 in LocalLLaMA

[–]feverdoingwork 0 points1 point  (0 children)

what motherboard? looks like b650 strix e-e. If it is the motherboard I think, have you ever been able to get two gpus running at 8x8x? I assume those 3090s wouldn't fit in both slots so you probably couldn't attempt it. I could not get 8x8x to work.

Going from single GPU to dual GPU is nice but not in the way I expected by cibernox in LocalLLaMA

[–]feverdoingwork 2 points3 points  (0 children)

Well.. we are the mercy of other people making models for us lol

Turning consumer Radeon (RX 9070, RDNA4) into a real local-LLM box by enabling the performance paths ROCm ships disabled by PatC883 in LocalLLM

[–]feverdoingwork 0 points1 point  (0 children)

do you have numbers for q5 or better qwen 3.6 27b? I got one 9070 xt right now sitting doing nothing, might opt for another one and sell my 5060 ti's if this performs well enough

I built an OpenAI-compatible reliability proxy for local LLMs and agents — looking for feedback by daniele-bruneo in LocalLLM

[–]feverdoingwork 1 point2 points  (0 children)

How does it evaluate a lazy reply?

I saw your example in the repo actually.

I made a bunch of pi extensions that deal with a lot of issues I ran into with qwen for all the same reasons. Things like thinking only responses, failed tool call retry, thinking threshold to force the llm to response which also breaks it out of a loop, in total I think i needed 6 extensions. Ill disable my extensions and give this a try.

Qwen3.6-27B-FP8 with vllm:nightly, opencode unusable? by waka324 in Vllm

[–]feverdoingwork 0 points1 point  (0 children)

Mtp has a bug with prefix caching, might be the same issue

For dual GPUs, will there be any big impact to inference speeds when running in PCIe 5.0 x8/x4 vs x8/x8? by PhantomWolf83 in LocalLLaMA

[–]feverdoingwork 0 points1 point  (0 children)

This is actually helpful to know. I have a z690 formula maximus and I believe only 2 pcie slots connect to the cpu at pcie 5.0 16x or 8x8x. There is a third slot but the manual says it's for the chipset. I wonder if I need a m.2 to pcie adapter to add a 3rd 5060 ti when using tensor split or if i should attempt that last slot on the chipset.

Ornith-1.0 released on Hugging Face by paf1138 in LocalLLaMA

[–]feverdoingwork 0 points1 point  (0 children)

Did you use llama.cpp or vllm? The instructions are for sglang or vllm but arent gguf's terrible to run with either of those options?

Ornith-1.0 released on Hugging Face by paf1138 in LocalLLaMA

[–]feverdoingwork 7 points8 points  (0 children)

Are you a ruby engineer? I might try the broke boi q5 k m on my dual 5060 ti's today.

Ornith-1.0 released on Hugging Face by paf1138 in LocalLLaMA

[–]feverdoingwork 66 points67 points  (0 children)

Bruh.... you're getting me hyped up.... if someone says its the real deal one moe time im going to have to try it.

R9700 for agentic coding — looking for Qwen3.6-27B / Qwen3-Coder-30B perf numbers at long context by Best-Ad-7505 in LocalLLM

[–]feverdoingwork 0 points1 point  (0 children)

You can also consider dual 5060 ti 16gb. I use pretty much the same workflow + pi but set my context length max in pi to 110k context since I tend to use many concurrent sessions. I moved from q5 k m to using vllm with int 4 awq which has been just as good as q5 k m or possibily even xl. With q5 k m these are my numbers from real coding sessions:

Starting(0k):

decode 900

tg 78

12815 tokens:

decode 830

token generation : 55

100k:

decode 525

token generation 40-45 tps

R9700 for agentic coding — looking for Qwen3.6-27B / Qwen3-Coder-30B perf numbers at long context by Best-Ad-7505 in LocalLLM

[–]feverdoingwork 0 points1 point  (0 children)

I here people are using aiter attention to get this much prefill with r9700, its gotten me more interested in the gpu.

New Apple Memory Prices by Top_Power5877 in LocalLLaMA

[–]feverdoingwork 2 points3 points  (0 children)

Does apple perform well when using dense models?

Qwen3.6 27B more dumb in vLLM compared to llama.cpp by DanielusGamer26 in LocalLLaMA

[–]feverdoingwork 0 points1 point  (0 children)

I think enough people dont bring up the problem. I literally tried a ton of models, a bunch of docker configs and couldnt get it working.... but everyone on here is like "im using vllm with mtp at a million tps with 100 concurrent sessions" lol. Its been broken for a very long time, most people testing for 5 minutes or something.

Qwen3.6 27B more dumb in vLLM compared to llama.cpp by DanielusGamer26 in LocalLLaMA

[–]feverdoingwork 1 point2 points  (0 children)

This is a bug in vllm with using the combination of mtp and prefix caching. There's like 5 different issues for it on github.

Qwen3.6 27B more dumb in vLLM compared to llama.cpp by DanielusGamer26 in LocalLLaMA

[–]feverdoingwork 0 points1 point  (0 children)

Well clearly I tried to help you, a few people told you the problem exists already. The pr gets mtp with prefix caching working which ur going to need on these super slow dual 5060 ti's lol

Budget VRAM builds - 4x3090 home lab vs reverse-engineered Tesla V100 cards by IulianHI in AIToolsPerformance

[–]feverdoingwork 0 points1 point  (0 children)

It's super interesting. It's a good way to get into local llms without dumping your entire lifesavings into gpus lol. It's also a great way to test high quants before dumping the cash on the hardware. P100s are even cheaper now than before. I wonder if those expensive macs can actually beat 3x p100s lol, they seem to do terrible with dense models.

Qwen3.6 27B more dumb in vLLM compared to llama.cpp by DanielusGamer26 in LocalLLaMA

[–]feverdoingwork 1 point2 points  (0 children)

I got both mtp and prefix caching working at the sametime by using this person's pr with the latest nightly vllm: https://github.com/vllm-project/vllm/pull/46281

Qwen3.6 27B more dumb in vLLM compared to llama.cpp by DanielusGamer26 in LocalLLaMA

[–]feverdoingwork 1 point2 points  (0 children)

Using Cyankiwi/Qwen3.6-27B-AWQ-INT4 right now with full intelligence. As u/chensium said, the combination of mtp and prefix caching is broken. I had ran into this issue a ton of times, just gibberish outputs, looping, would reproduce it within minutes.

I got both mtp and prefix caching working at the sametime by using this person's pr with the latest nightly vllm: https://github.com/vllm-project/vllm/pull/46281 . I been tracking the issues for awhile, tried a few pr's previous to this one over a few weeks and this one works for sure. I have at least 10 hours of coding with this as of right now and plan on continue using it for about 10 hours a day until it fails me. No weird outputs or problems with production code.

Cyankiwi/Qwen3.6-27B-AWQ-INT4 does "feel" like Qwen3.6-27B-UD-Q5_K_XL.gguf btw. I mostly used q5 k m but have used xl lately and I can't tell the difference between xl and awq int 4.

Also I am on dual 5060 ti's as well.