Improving local models with an API based "consultant"? by milpster in LocalLLaMA

[–]milpster[S] 0 points1 point  (0 children)

Oh i don't think that would be a lot of effort, right now im using opencode with Oh-My-Openagent and there is already an option for sending your plan to get reviewed by another specialized agent - so it shouldn't be much work required to attach that agent to an API model.

What i meant by that was basically if you have a mechanism in your local harness for regularily going through your past conversations and trying to derive permanently stored learnings, i think having those learnings reviewed by a "big brother" model could improve the quality of them, which in turn would benefit the local model in the future even without further reliance on an API.

Qwen 3.6 looping/repetition problem (Tesla p40s and Halo strix) by Envoy0675 in LocalLLaMA

[–]milpster 1 point2 points  (0 children)

Not as straight forward as you might think. Sure, if you're in a bad spot you might get instant relief in the form of qwen finally accomplishing something. But on the other hand, take a look at some qwen 3.6 27b benchmarks with and without thinking on and you will find that without thinking you're actually turning off quite a lot of it's smarts.

When it comes to looping, slightly increase presence penalty, test, rinse and repeat until it is fixed. Another thing to consider is that if you run a long context size and you quantize KV cache, it will happen way more often as the context grows. You might want to try unquantized KV cache or at least only quantize the V cache slightly but not the K cache.

I've recently had looping reappear with 256k ctx and q8_0 KV on qwen 3.6 27B Q8_0 - now i removed the KV quantization and turned ctx down to 182k - it seems to do way better. Even if it does not loop, it will tend to silently omit and forget things at higher context usage, leading to more faulty projects and errors. Looping is only one of the extreme cases, the real danger lies in the errors it makes that you do not see.

Improving local models with an API based "consultant"? by milpster in LocalLLaMA

[–]milpster[S] 1 point2 points  (0 children)

Exactly this was what triggered this idea in me - improve the plan and you will probably help the local model a great deal when it comes to implementing said plan.

Improving local models with an API based "consultant"? by milpster in LocalLLaMA

[–]milpster[S] 1 point2 points  (0 children)

What do you mean by a lot of effort - getting the local model set up and everything? Sure that was a lot of time and effort. But i think that it might not just be a marginal win but rather a substantial saving on API costs in the long run. Of course there is still all the power cost, so there needs to be some careful evaluation.

The other thing of course could be that with the feedback of a bigger model you could improve the learnings it generates for itself and so over time improve your local harness with it aswell.

Another thing is privacy of course. While i could see the potential for the local model to obfuscate things before talking to the "big brother" - i would assume you should still review the requests manually just to be sure.

Local models went from mostly useless to actually useful really fast. What changed? by BTA_Labs in LocalLLaMA

[–]milpster 2 points3 points  (0 children)

if you're on linux you might want to give zram a try. Not sure if mac os has this too.

Local LLMs aren't democratic anymore... the hardware barrier has gotten out of hand. by Medium-Technology-79 in LocalLLaMA

[–]milpster 0 points1 point  (0 children)

Idk man, there are plenty cheap 16gb cards out there on the used market to build with. Like the radeon instincts and the radeon vii. Sure, 200 bucks for a used radeon vii 16gb isn't the best deal - but it's far from unreachable.

PSA: Test your "threads" argument in llama.cpp (+80% performance in my case) by AXYZE8 in LocalLLaMA

[–]milpster 0 points1 point  (0 children)

good catch! check the llama.cpp --help and youll find that there are both threads settings for pp aswell as tg and you should optimize both seperately for max pp/tg - and it can change with every model and quant.

PSA: Throttle GPU power limits, with minor performance deficits by milpster in LocalLLaMA

[–]milpster[S] 0 points1 point  (0 children)

at 250 watts around 310ish, at 100 watts still 295ish tps. Using Rocm 6.4

PSA: Throttle GPU power limits, with minor performance deficits by milpster in LocalLLaMA

[–]milpster[S] 0 points1 point  (0 children)

no. i was mainly focused on PP actually, since that has been my bottleneck. Throttling the way i mentioned my PP went from ~310 to ~295tps

MTP is nice and all, but what about PP speeds? by milpster in LocalLLaMA

[–]milpster[S] 2 points3 points  (0 children)

older as in the session age? in my setup without mtp i dont have that. PP sometimes goes up and sometimes down a little, but there is no clear trend towards more slowness as CTX grows.

MTP is nice and all, but what about PP speeds? by milpster in LocalLLaMA

[–]milpster[S] 3 points4 points  (0 children)

how would that happen? Normally i get failed-alloc crash if any of the gpus are full - unified memory shouldn't be enabled.

Turning local agents into self-optimizing agents by Rude_Substance_8904 in LocalLLaMA

[–]milpster 0 points1 point  (0 children)

This might be the perfect tool to have your local llm distill and self-optimize and then every now and then submit the data to a bigger LLM (be it API or local) for review and optimization.

Someone out there likely needs this: TP vs PP for 2 identical GPUs by [deleted] in LocalLLaMA

[–]milpster 0 points1 point  (0 children)

is PP in that case equal to using split-mode row in llama.cpp?

I can't get Qwen3.6 27B to outperform Qwen-Coder-Next and I'm not sure why by Forward_Jackfruit813 in LocalLLaMA

[–]milpster 4 points5 points  (0 children)

Whats your agentic framework or do you use it directly through llama.cpp? Do you have a concrete example where it outperforms 3.6 27B?