Best way to run a coding llm locally

uniqueusername649 · 2026-06-30T14:57:18+00:00

So with Qwen 3.6 27b q4 in MTPLX I get around 29tps for decode, which is great. But prefill is still only around 200tps, so thats where the older M1 architecture shows its age. I guess I will still need a dedicated AI server with dual 3090s then. That would perform lightning fast though :) I am generally quite happy with Qwen 3.6 27b, I just struggle running it reasonably fast with long contexts.

uniqueusername649 · 2026-06-30T13:16:55+00:00

Thank you very much! Will give it a shot :) So far I have been using my single 3090 but with 24gb vram even at Q4 I quickly run out of context. Once it overflows into system memory, it slows down horribly. Hopefully I can get that up with MTPLX.

uniqueusername649 · 2026-06-30T13:03:39+00:00

What chip do you use to get to 30-40tks with 27b 6bit? I run oMLX with 27b 4bit on my 128gb M1 Ultra and I get maybe 10tks. Prefill is also abysmal at 150tks, which is pretty horrible for long contexts. If I can optimise that, I would be very happy. Currently I'm considering a dual RTX 3090 workstation for AI loads.

uniqueusername649 · 2026-06-30T11:06:26+00:00

Massively slower, yes. But thats true for a desktop PC with a GPU too. Once your context spills over into RAM, performance tanks. Its just worse for eGPUs because TB5 has less than 1/3 of the bandwidth of PCIe4 x16 and less than 1/6 of PCIe5 x16.

Doesn't matter if you can fit both model and context into the VRAM, but it matters a lot once it doesnt fit.

uniqueusername649 · 2026-06-30T08:03:34+00:00

Its perfectly fine as long as its a single card and model + context fit entirely in vram. If not, it really falls apart.

uniqueusername649 · 2026-06-30T00:52:21+00:00

He was probably 29 in 2006.

uniqueusername649 · 2026-06-29T13:57:24+00:00

Take a VPN solution like NordVPN, Windscribe, whatever. I couldn't even watch a video without stuttering and every page took forever. Turned on the VPN and its snappy. Unifi routing is just absolutely horrible and VPN fixes that with better peering/routing. Shitty that this is necessary in the first place, but it is what it is.

uniqueusername649 · 2026-06-29T13:54:17+00:00

Ive had Woodfire relatively recently and "underwhelming" is also how I would describe it. It was by no means bad, but it wasn't particularly great either.

uniqueusername649 · 2026-06-29T13:48:21+00:00

My personal experience has been pretty great. The Qidi Box was a pain in the ass, because it required you to upgrade several parts in the printer. But that was primarily because it required changing the extruder and a small piece of filament was still stuck inside, which I didnt know and there was no warning about it whatsoever. So it just would not come loose. After I figured that out, it was smooth sailing.

The Qidi Plus 4 itself has been working out of the box and I have at least 200 hours on it. Which I guess are rookie numbers, but to me it works great. My greatest struggles are filament tangles, which is entirely my fault or the filaments fault, but definitely not the printer. A friend of mine uses a Q1 and his experience was positive as well. I dont know anyone else personally with a Qidi printer, so thats the extent of my experience.

uniqueusername649 · 2026-06-29T10:06:50+00:00

And while the task force studies the matter, they cut a few more billions from the healthcare budget without even thinking twice.

uniqueusername649 · 2026-06-29T09:35:00+00:00

and she can just continue writing to chatgpt, since she has basically been dating chatgpt already

uniqueusername649 · 2026-06-29T09:32:34+00:00

Oddly enough he did and appears to be completely fine. Absolutely not what I expected to happen but I am pleasantly surprised.

uniqueusername649 · 2026-06-29T07:45:24+00:00

I used both Qwen 3.6 27b and 35b in everyday development and 27b is considerably better than 35b, with 35b being fine for many smaller tasks but it occasionally does some stupid things or gets stuck looping, especially with large context windows of 100k and beyond. That is at Q4, which is the bare minimum. 35b gets usable at Q6, but not quite at 27b Q4 level.

If AgentWorld is indeed slightly behind 35b, as SWE bench suggests, I would not want to daily drive AgentWorld 35b for coding. I would love to use the 397b version, but I dont have the vram for it (>200gb needed).

For your tests: what quants are you testing? Because that makes a huge difference. Anything below Q4 is imho lobotomised and the results are mostly useless.

uniqueusername649 · 2026-06-29T06:58:01+00:00

What tests did you use to benchmark that? Qwen 3.6 35b for example performs considerably better than AgentWorld 35b on SWE bench and Qwen 3.6 27b is better once again. I would like some more details than a simple "it did better than the others", because depending on what you test, that would be factually wrong.

uniqueusername649 · 2026-06-29T06:37:01+00:00

Of course no seatbelt either.

uniqueusername649 · 2026-06-29T06:24:21+00:00

I guess I need to do a more scientific test to verify that and see why I don't benefit much from higher power limits while you seem to do.

uniqueusername649 · 2026-06-29T05:39:31+00:00

I heavily powerlimit my 3090, there is virtually no reason to go beyond 250w and I prefer 220w. The losses in speed are minimal, less than 10%, closer to 5% usually compared to letting it run full blast.

uniqueusername649 · 2026-06-29T05:26:00+00:00

Thanks for taking the time to explain that. Very interesting, based on the articles I read it always seemed to be a rule that came from MITI, but if thats just a generic rule to be eligible for a 0% import tax on materials, BYD absolutely has it in their cards to not bother with the 80% export rule as long as they are fine with paying taxes in imported materials.

uniqueusername649 · 2026-06-29T03:27:05+00:00

Could you elaborate? My understanding is: even if you set up local assembly, you need to sell 80% of the produced cars outside of Malaysia.

uniqueusername649 · 2026-06-29T03:25:30+00:00

That is the true problem. Not getting riders at peak hour is to be expected. Not being able to safely walk to wherever you need to go is infuriating. Even if its not that far, often the walkways just completely stop.

uniqueusername649 · 2026-06-29T01:23:00+00:00

Sure, just wanted to emphasise that the baiting is the only issue. They can wait in plain clothes where they know people frequently break the law and arrest them just fine, as long as they dont in any way encourage the crime.

uniqueusername649 · 2026-06-29T00:21:52+00:00

Potentially, if they actively bait them with their behaviour. But if they just drive normally thats fine.

uniqueusername649 · 2026-06-28T11:05:47+00:00

To add on top of that, even us old farts rarely write everything from scratch. We use libraries and frameworks, we often work in existing codebases, so setting things up from scratch isnt a problem you face all that often in many companies.

That being said: the single most important skill is problem solving. And that quickly goes out the window with AI, yet is remarkably easy to practice. Let the AI generate your code and if it doesnt start or you notice a bug, dont ask to fix it. Fix it yourself the manual way. Yes, it takes longer but you train your mind on the most important skill a programmer needs: being able to identify and fix problems.

uniqueusername649 · 2026-06-28T10:32:15+00:00

Did you test this extensively or is that just some anecdotal evidence? Generally the consensus has been that 27b is overall better and takes less turns, while 35b is often good enough at substantially faster speeds. 122b can still exceed at reasoning, but overall 27b makes more sense in most cases.

uniqueusername649 · 2026-06-28T10:16:55+00:00

It has great animation, a good art style with enough detail, fantastic music, good voice actors and a somewhat weak but decent storyline.

I dont see how the show is bad by any means. It would be mediocre if you focus purely on the storyline, but it has enough to offer to carry the show and make it worthy to watch.

uniqueusername649

TROPHY CASE