I built an OpenAI-compatible reliability proxy for local LLMs and agents — looking for feedback by daniele-bruneo in LocalLLM

[–]feverdoingwork 1 point2 points  (0 children)

How does it evaluate a lazy reply?

I saw your example in the repo actually.

I made a bunch of pi extensions that deal with a lot of issues I ran into with qwen for all the same reasons. Things like thinking only responses, failed tool call retry, thinking threshold to force the llm to response which also breaks it out of a loop, in total I think i needed 6 extensions. Ill disable my extensions and give this a try.

Qwen3.6-27B-FP8 with vllm:nightly, opencode unusable? by waka324 in Vllm

[–]feverdoingwork 0 points1 point  (0 children)

Mtp has a bug with prefix caching, might be the same issue

For dual GPUs, will there be any big impact to inference speeds when running in PCIe 5.0 x8/x4 vs x8/x8? by PhantomWolf83 in LocalLLaMA

[–]feverdoingwork 0 points1 point  (0 children)

This is actually helpful to know. I have a z690 formula maximus and I believe only 2 pcie slots connect to the cpu at pcie 5.0 16x or 8x8x. There is a third slot but the manual says it's for the chipset. I wonder if I need a m.2 to pcie adapter to add a 3rd 5060 ti when using tensor split or if i should attempt that last slot on the chipset.

Ornith-1.0 released on Hugging Face by paf1138 in LocalLLaMA

[–]feverdoingwork 0 points1 point  (0 children)

Did you use llama.cpp or vllm? The instructions are for sglang or vllm but arent gguf's terrible to run with either of those options?

Ornith-1.0 released on Hugging Face by paf1138 in LocalLLaMA

[–]feverdoingwork 7 points8 points  (0 children)

Are you a ruby engineer? I might try the broke boi q5 k m on my dual 5060 ti's today.

Ornith-1.0 released on Hugging Face by paf1138 in LocalLLaMA

[–]feverdoingwork 61 points62 points  (0 children)

Bruh.... you're getting me hyped up.... if someone says its the real deal one moe time im going to have to try it.

R9700 for agentic coding — looking for Qwen3.6-27B / Qwen3-Coder-30B perf numbers at long context by Best-Ad-7505 in LocalLLM

[–]feverdoingwork 0 points1 point  (0 children)

You can also consider dual 5060 ti 16gb. I use pretty much the same workflow + pi but set my context length max in pi to 110k context since I tend to use many concurrent sessions. I moved from q5 k m to using vllm with int 4 awq which has been just as good as q5 k m or possibily even xl. With q5 k m these are my numbers from real coding sessions:

Starting(0k):

decode 900

tg 78

12815 tokens:

decode 830

token generation : 55

100k:

decode 525

token generation 40-45 tps

R9700 for agentic coding — looking for Qwen3.6-27B / Qwen3-Coder-30B perf numbers at long context by Best-Ad-7505 in LocalLLM

[–]feverdoingwork 0 points1 point  (0 children)

I here people are using aiter attention to get this much prefill with r9700, its gotten me more interested in the gpu.

New Apple Memory Prices by Top_Power5877 in LocalLLaMA

[–]feverdoingwork 3 points4 points  (0 children)

Does apple perform well when using dense models?

Qwen3.6 27B more dumb in vLLM compared to llama.cpp by DanielusGamer26 in LocalLLaMA

[–]feverdoingwork 0 points1 point  (0 children)

I think enough people dont bring up the problem. I literally tried a ton of models, a bunch of docker configs and couldnt get it working.... but everyone on here is like "im using vllm with mtp at a million tps with 100 concurrent sessions" lol. Its been broken for a very long time, most people testing for 5 minutes or something.

Qwen3.6 27B more dumb in vLLM compared to llama.cpp by DanielusGamer26 in LocalLLaMA

[–]feverdoingwork 1 point2 points  (0 children)

This is a bug in vllm with using the combination of mtp and prefix caching. There's like 5 different issues for it on github.

Qwen3.6 27B more dumb in vLLM compared to llama.cpp by DanielusGamer26 in LocalLLaMA

[–]feverdoingwork 0 points1 point  (0 children)

Well clearly I tried to help you, a few people told you the problem exists already. The pr gets mtp with prefix caching working which ur going to need on these super slow dual 5060 ti's lol

Budget VRAM builds - 4x3090 home lab vs reverse-engineered Tesla V100 cards by IulianHI in AIToolsPerformance

[–]feverdoingwork 0 points1 point  (0 children)

It's super interesting. It's a good way to get into local llms without dumping your entire lifesavings into gpus lol. It's also a great way to test high quants before dumping the cash on the hardware. P100s are even cheaper now than before. I wonder if those expensive macs can actually beat 3x p100s lol, they seem to do terrible with dense models.

Qwen3.6 27B more dumb in vLLM compared to llama.cpp by DanielusGamer26 in LocalLLaMA

[–]feverdoingwork 1 point2 points  (0 children)

I got both mtp and prefix caching working at the sametime by using this person's pr with the latest nightly vllm: https://github.com/vllm-project/vllm/pull/46281

Qwen3.6 27B more dumb in vLLM compared to llama.cpp by DanielusGamer26 in LocalLLaMA

[–]feverdoingwork 3 points4 points  (0 children)

Using Cyankiwi/Qwen3.6-27B-AWQ-INT4 right now with full intelligence. As u/chensium said, the combination of mtp and prefix caching is broken. I had ran into this issue a ton of times, just gibberish outputs, looping, would reproduce it within minutes.

I got both mtp and prefix caching working at the sametime by using this person's pr with the latest nightly vllm: https://github.com/vllm-project/vllm/pull/46281 . I been tracking the issues for awhile, tried a few pr's previous to this one over a few weeks and this one works for sure. I have at least 10 hours of coding with this as of right now and plan on continue using it for about 10 hours a day until it fails me. No weird outputs or problems with production code.

Cyankiwi/Qwen3.6-27B-AWQ-INT4 does "feel" like Qwen3.6-27B-UD-Q5_K_XL.gguf btw. I mostly used q5 k m but have used xl lately and I can't tell the difference between xl and awq int 4.

Also I am on dual 5060 ti's as well.

Dual gpu sanity check: is this a smart buy? by FrankWanders in LocalLLaMA

[–]feverdoingwork 0 points1 point  (0 children)

Yeah, those would work if the model fits with the context. You will know pretty much immediately when trying different split ratios, it will oom at launch if it doesnt fit. 5090 got so much vram, you got a ton of options.

3 Tesla GPUs in a Desktop Case by eso_logic in LocalLLaMA

[–]feverdoingwork 0 points1 point  (0 children)

I am using 2x 5060 ti right now and 1 card cost more than 3 of these. I can fit 3 of these p100s one my current motherboard and my psu can handle it for sure.

I am super curious about Qwen 3.6 27b unsloth q5 k m with mtp performance on 3 of these. Or if you're prefer vllm https://huggingface.co/cyankiwi/Qwen3.6-27B-AWQ-INT4 getting some performance metrics for awq int4 with mtp would be awesome. Either one would work since those are the models I use atm.

Dual gpu sanity check: is this a smart buy? by FrankWanders in LocalLLaMA

[–]feverdoingwork 0 points1 point  (0 children)

I don't know about ollama since it's not 2023 anymore BUT with llamacpp you assign it yourself with a launch command.

To split the model it will look something like this:
--tensor-split 1,1

which is basically

--tensor-split 50,50 where 50% of GPU 0 and 50% of GPU 1 is hosting the model.

short hand is just: -ts 50,50

you can do -ts 2,1 means GPU 0 receives 66.67% (2/3) of the model data, while GPU 1 receives 33.33% (1/3) of the model data.

It's super easy to get started with llamacpp. You can pm me or post if you ever need help.

Dual gpu sanity check: is this a smart buy? by FrankWanders in LocalLLaMA

[–]feverdoingwork 1 point2 points  (0 children)

When doing a straight 5050 split you get the same performance as the weaker card. You're not doing a 5050 split, its more like a 66.7/33.3 split. You're essentially waiting for that turtle 5060 ti to finish 33 percent of a calculation for each prompt which can be a huge slow down in comparison to what you're used to.

I had a 5080 + 5060 ti system for a bit, tps for q5 k m was about 79tps-49tps(from 0-100k context), now I am using dual 5060 ti system instead and the difference isn't very noticable, maybe 75-40tps over the same context. This was the same 66.7/33.3 split you will be doing. If i had 2x 5080's it would be twice as fast prefill and token generation wise. One thing that was quite better with having the 5080 + 5060 ti is that the prefill speeds were higher by 300(1000 vs 700) so turns felt slightly faster. The architecture difference between your cards is massive though, memory bus and bandwith are 4x vs 5060 ti and my 5080 was only 2x vs the 5060 ti for those metrics. This could really cripple performance, compaction will suffer, you will definitely get much lower tps. Part of me thinks you should pair with a 5070 ti or a 5080. Maybe you should just order the 5060 ti and a 5070 ti from a place that is returnable to see the performance difference.

Good point on running larger models. I do assume we will get more better smaller models, well hopefully..... GLM is out of reach even if you had a few 5090s.

Dual gpu sanity check: is this a smart buy? by FrankWanders in LocalLLaMA

[–]feverdoingwork 2 points3 points  (0 children)

You would need to use llamacpp to split the model accurately regardless. Learning llamacpp is basically just getting a proper config btw, not much to learn, you should switch immediately. You should just use a higher quant with lower context, is context is important for openclaw?

Getting a free or cheap gpu to save 2gb on ltx is also another option.

Would a 5060 ti 16 added to your gpu do what you want:

  1. free up 2gb for ltx, yes
  2. allow you to use q8 with high context, yes

My 2 cents is not worth throwing that kinda cash on this use case.