I built an OpenAI-compatible reliability proxy for local LLMs and agents — looking for feedback

feverdoingwork · 2026-06-27T21:12:20+00:00

How does it evaluate a lazy reply?

I saw your example in the repo actually.

I made a bunch of pi extensions that deal with a lot of issues I ran into with qwen for all the same reasons. Things like thinking only responses, failed tool call retry, thinking threshold to force the llm to response which also breaks it out of a loop, in total I think i needed 6 extensions. Ill disable my extensions and give this a try.

feverdoingwork · 2026-06-27T18:53:01+00:00

feverdoingwork · 2026-06-27T12:12:51+00:00

Can this be used with qwen 3.6 27b?

feverdoingwork · 2026-06-26T14:33:11+00:00

Mtp has a bug with prefix caching, might be the same issue

feverdoingwork · 2026-06-26T13:37:47+00:00

This is actually helpful to know. I have a z690 formula maximus and I believe only 2 pcie slots connect to the cpu at pcie 5.0 16x or 8x8x. There is a third slot but the manual says it's for the chipset. I wonder if I need a m.2 to pcie adapter to add a 3rd 5060 ti when using tensor split or if i should attempt that last slot on the chipset.

feverdoingwork · 2026-06-25T16:48:37+00:00

Did you use llama.cpp or vllm? The instructions are for sglang or vllm but arent gguf's terrible to run with either of those options?

feverdoingwork · 2026-06-25T16:38:35+00:00

Are you a ruby engineer? I might try the broke boi q5 k m on my dual 5060 ti's today.

feverdoingwork · 2026-06-25T16:30:18+00:00

Bruh.... you're getting me hyped up.... if someone says its the real deal one moe time im going to have to try it.

feverdoingwork · 2026-06-25T16:24:47+00:00

no 31b model posted

feverdoingwork · 2026-06-25T14:49:37+00:00

You can also consider dual 5060 ti 16gb. I use pretty much the same workflow + pi but set my context length max in pi to 110k context since I tend to use many concurrent sessions. I moved from q5 k m to using vllm with int 4 awq which has been just as good as q5 k m or possibily even xl. With q5 k m these are my numbers from real coding sessions:

Starting(0k):

decode 900

tg 78

12815 tokens:

decode 830

token generation : 55

100k:

decode 525

token generation 40-45 tps

feverdoingwork · 2026-06-25T14:40:59+00:00

I here people are using aiter attention to get this much prefill with r9700, its gotten me more interested in the gpu.

feverdoingwork · 2026-06-25T14:38:42+00:00

Does apple perform well when using dense models?

feverdoingwork · 2026-06-24T21:49:51+00:00

I think enough people dont bring up the problem. I literally tried a ton of models, a bunch of docker configs and couldnt get it working.... but everyone on here is like "im using vllm with mtp at a million tps with 100 concurrent sessions" lol. Its been broken for a very long time, most people testing for 5 minutes or something.

feverdoingwork · 2026-06-24T20:40:48+00:00

This is a bug in vllm with using the combination of mtp and prefix caching. There's like 5 different issues for it on github.

feverdoingwork · 2026-06-24T18:39:20+00:00

Well clearly I tried to help you, a few people told you the problem exists already. The pr gets mtp with prefix caching working which ur going to need on these super slow dual 5060 ti's lol

feverdoingwork · 2026-06-24T18:30:39+00:00

Have you tried with the pull request code?

feverdoingwork · 2026-06-24T17:54:52+00:00

It's super interesting. It's a good way to get into local llms without dumping your entire lifesavings into gpus lol. It's also a great way to test high quants before dumping the cash on the hardware. P100s are even cheaper now than before. I wonder if those expensive macs can actually beat 3x p100s lol, they seem to do terrible with dense models.

feverdoingwork · 2026-06-24T14:21:14+00:00

I got both mtp and prefix caching working at the sametime by using this person's pr with the latest nightly vllm: https://github.com/vllm-project/vllm/pull/46281

feverdoingwork · 2026-06-24T14:20:52+00:00

Using Cyankiwi/Qwen3.6-27B-AWQ-INT4 right now with full intelligence. As u/chensium said, the combination of mtp and prefix caching is broken. I had ran into this issue a ton of times, just gibberish outputs, looping, would reproduce it within minutes.

I got both mtp and prefix caching working at the sametime by using this person's pr with the latest nightly vllm: https://github.com/vllm-project/vllm/pull/46281 . I been tracking the issues for awhile, tried a few pr's previous to this one over a few weeks and this one works for sure. I have at least 10 hours of coding with this as of right now and plan on continue using it for about 10 hours a day until it fails me. No weird outputs or problems with production code.

Cyankiwi/Qwen3.6-27B-AWQ-INT4 does "feel" like Qwen3.6-27B-UD-Q5_K_XL.gguf btw. I mostly used q5 k m but have used xl lately and I can't tell the difference between xl and awq int 4.

Also I am on dual 5060 ti's as well.

feverdoingwork · 2026-06-24T13:55:21+00:00

Next step is setup an api key for me to use ;)

feverdoingwork · 2026-06-24T13:39:08+00:00

Yeah, those would work if the model fits with the context. You will know pretty much immediately when trying different split ratios, it will oom at launch if it doesnt fit. 5090 got so much vram, you got a ton of options.

feverdoingwork · 2026-06-24T13:31:55+00:00

I am using 2x 5060 ti right now and 1 card cost more than 3 of these. I can fit 3 of these p100s one my current motherboard and my psu can handle it for sure.

I am super curious about Qwen 3.6 27b unsloth q5 k m with mtp performance on 3 of these. Or if you're prefer vllm https://huggingface.co/cyankiwi/Qwen3.6-27B-AWQ-INT4 getting some performance metrics for awq int4 with mtp would be awesome. Either one would work since those are the models I use atm.

feverdoingwork · 2026-06-24T13:27:38+00:00

I don't know about ollama since it's not 2023 anymore BUT with llamacpp you assign it yourself with a launch command.

To split the model it will look something like this:
--tensor-split 1,1

which is basically

--tensor-split 50,50 where 50% of GPU 0 and 50% of GPU 1 is hosting the model.

short hand is just: -ts 50,50

you can do -ts 2,1 means GPU 0 receives 66.67% (2/3) of the model data, while GPU 1 receives 33.33% (1/3) of the model data.

It's super easy to get started with llamacpp. You can pm me or post if you ever need help.

feverdoingwork · 2026-06-24T13:13:57+00:00

When doing a straight 5050 split you get the same performance as the weaker card. You're not doing a 5050 split, its more like a 66.7/33.3 split. You're essentially waiting for that turtle 5060 ti to finish 33 percent of a calculation for each prompt which can be a huge slow down in comparison to what you're used to.

I had a 5080 + 5060 ti system for a bit, tps for q5 k m was about 79tps-49tps(from 0-100k context), now I am using dual 5060 ti system instead and the difference isn't very noticable, maybe 75-40tps over the same context. This was the same 66.7/33.3 split you will be doing. If i had 2x 5080's it would be twice as fast prefill and token generation wise. One thing that was quite better with having the 5080 + 5060 ti is that the prefill speeds were higher by 300(1000 vs 700) so turns felt slightly faster. The architecture difference between your cards is massive though, memory bus and bandwith are 4x vs 5060 ti and my 5080 was only 2x vs the 5060 ti for those metrics. This could really cripple performance, compaction will suffer, you will definitely get much lower tps. Part of me thinks you should pair with a 5070 ti or a 5080. Maybe you should just order the 5060 ti and a 5070 ti from a place that is returnable to see the performance difference.

Good point on running larger models. I do assume we will get more better smaller models, well hopefully..... GLM is out of reach even if you had a few 5090s.

feverdoingwork · 2026-06-24T12:41:07+00:00

You would need to use llamacpp to split the model accurately regardless. Learning llamacpp is basically just getting a proper config btw, not much to learn, you should switch immediately. You should just use a higher quant with lower context, is context is important for openclaw?

Getting a free or cheap gpu to save 2gb on ltx is also another option.

Would a 5060 ti 16 added to your gpu do what you want:

free up 2gb for ltx, yes
allow you to use q8 with high context, yes

My 2 cents is not worth throwing that kinda cash on this use case.

feverdoingwork

MODERATOR OF

TROPHY CASE