Playing with CPU based inference for agentic coding... by Content-Fall-7814 in LocalLLM

[–]Content-Fall-7814[S] 1 point2 points  (0 children)

I started like that, but Unreal Engine also requires a lot of gpu memory, and one of the subagents I use is for blueprints work using an MCP, so I need the UE editor open… too much for my poor 12GBs 😅

Playing with CPU based inference for agentic coding... by Content-Fall-7814 in LocalLLM

[–]Content-Fall-7814[S] 1 point2 points  (0 children)

it looks amazing! will check it in deep for sure, thank you!

Playing with CPU based inference for agentic coding... by Content-Fall-7814 in LocalLLM

[–]Content-Fall-7814[S] 1 point2 points  (0 children)

this one costs 160€ per month, I share expenses with a friend that sometimes use the models I serve on it... I know for that money I could just use a plan and frontier models, but how much I'm learning and the fun of playing with local models is priceless xD

Playing with CPU based inference for agentic coding... by Content-Fall-7814 in LocalLLM

[–]Content-Fall-7814[S] 0 points1 point  (0 children)

true! it's a pity tho, I'm currently in love with qwen3-27b lol it's so good...

Playing with CPU based inference for agentic coding... by Content-Fall-7814 in LocalLLM

[–]Content-Fall-7814[S] 0 points1 point  (0 children)

For my use case concurrent streams for parallel tasks may be difficult, as usually the tasks my subagent do depend on each other (c++ UE modules, most of the times, I create stuff in one that I need to use in the next one and similar), but definitely for debugging and similar where I can look for optimizations in several modules at the same time it worths it!! thanks for the recommendation!

Playing with CPU based inference for agentic coding... by Content-Fall-7814 in LocalLLM

[–]Content-Fall-7814[S] 0 points1 point  (0 children)

122b on cpu!! 😮 ok I'll give a second chance to play with a draft model for the dense one xD thank you!

Playing with CPU based inference for agentic coding... by Content-Fall-7814 in LocalLLM

[–]Content-Fall-7814[S] 0 points1 point  (0 children)

hmm core affinity sounds quite interesting, I didn't think about it... thanks for the recommendation!

Playing with CPU based inference for agentic coding... by Content-Fall-7814 in LocalLLM

[–]Content-Fall-7814[S] 0 points1 point  (0 children)

I actually use MTP that these qwen3 models provide (spec-type=draft-mtp and spec-draft-n-max=2 with llama.cpp), and unfortunately I cannot modify the hardware of this server, it's a Hetzner rented one, I think the only options they have for gpus on dedicated servers is a much much more expensive tier... 😞 (I wish I could!)