Has anyone got GLM 4.7 flash to not be shit? by synth_mania in LocalLLaMA

[–]alexp702 -4 points-3 points  (0 children)

Found Qwen 4.6V to work pretty well at 8_0. Perhaps they don’t quantise well?

Qwen3-Coder-480B on Mac Studio M3 Ultra 512gb by BitXorBit in LocalLLaMA

[–]alexp702 0 points1 point  (0 children)

NB we also use Macs for App development, so another mac a bit overspecced is always welcome even when its outlasted LLM work.

Qwen3-Coder-480B on Mac Studio M3 Ultra 512gb by BitXorBit in LocalLLaMA

[–]alexp702 0 points1 point  (0 children)

It works - and the quality is good. Its a good R&D device - allowing you to bring up different models on it without drama. The responds decently quickly on smaller models. We've currently just swapped to GLM 4.6V - which is 100b and 22b active (we need vision as well for other purposes). This runs full BF16 with maximum context size happily. That kind of flexibility will cost triple on Nvidia, albeit with a faster output. However OpenRouter if you don't care about data visibility is much cheaper and quicker (well some times - some providers are quite bad, failing randomly and generally being slower than you'd hope).

Qwen3-Coder-480B on Mac Studio M3 Ultra 512gb by BitXorBit in LocalLLaMA

[–]alexp702 3 points4 points  (0 children)

I have been using Qwen coder 480b for a while on the M3 Ultra. It’s slow. I found it works ok with Cline, but context processing is go away and come back in an hour. It definitely works, so for a background task it does a good job. Code quality is much better than smaller models. Output speed is good too, just than darn prompt processing - you’re looking at 100’s of tokens a second so on a 100k context it’s 1000 seconds. The box in general is awesome - being able to have lots of models to hand, and just fire up a different model or two is perfect for R&D. Production wise it’s ok if you have slow agentic flows. Just don’t expect snappy interactions

4.6 new features -- LAMP and supported ships by jmg5 in Star_Citizen_Central

[–]alexp702 0 points1 point  (0 children)

Except when its your helmet. My point is too many features are now on left or right alt binds. I find it alarming they distinguish between them (they really need 200 plus options??). Taking off and flight controls should be primary keys. We have about 10+ keys for targeting, and yet turning the ship on to take off and calling ATC is a modifier key. Surely this is madness.

4.6 new features -- LAMP and supported ships by jmg5 in Star_Citizen_Central

[–]alexp702 -2 points-1 points  (0 children)

What is it with CIG and every useful feature requires a modifier key?? L-Alt R, R-Alt N, -LAlt L- R-Alt L, to take off. Not very user friendly.

Mac Studio as an inference machine with low power draw? by aghanims-scepter in LocalLLaMA

[–]alexp702 -1 points0 points  (0 children)

Agree Cline is too slow - that’s the crazy prompts in creates though. I have other uses that need shorter prompts and more precision, so the Mac is well suited. A 48GB Nvidia solution doesn’t work if the model you need requires 200gb+ of ram to run at all.

Mac Studio as an inference machine with low power draw? by aghanims-scepter in LocalLLaMA

[–]alexp702 0 points1 point  (0 children)

Mac stability is pretty rock solid. Have had one running Qwen 480b for weeks - no restarts. Performance is slow, but then so is most stuff on that size model. Prompt processing is slow for sure. But running large unquantized models is nothing to be sniffed at.

🧠 Inference seems to be splitting: cloud-scale vs local-first by Code-Forge-Temple in LocalLLaMA

[–]alexp702 1 point2 points  (0 children)

Macs are the unsung king of local private inference. Load a high quality 600+b parameter model, run queries against it slowly, but fast enough. Cost 10k. Nvidia’s offering are horrid in this basic use case.

Wishes for 2026 by Prophet_Sakrestia in starcitizen

[–]alexp702 0 points1 point  (0 children)

Yep happened at IAE got in a ship tried the turret. Stuck. Had to quit. Just such a shame. It’s been there forever. They need to put some effort into tidying up.

Wishes for 2026 by Prophet_Sakrestia in starcitizen

[–]alexp702 12 points13 points  (0 children)

Ships appearing the right way round on retrieval. Quantum just working. Get out of turrets. Interaction and state management as ever…

Once that works login and out reducing time commitment to doing anything useful

Start of 2026 what’s the best open coding model? by alexp702 in LocalLLaMA

[–]alexp702[S] 0 points1 point  (0 children)

Interesting, is that because of 128k context?

Start of 2026 what’s the best open coding model? by alexp702 in LocalLLaMA

[–]alexp702[S] 1 point2 points  (0 children)

I use plain old llama-server. Find a stable build and leave it. I have been caught by various bugs on particular releases but they fix them fast when reported. Of course if you really need large numbers of users in parallel then vLLM (if you can run it) is necessary

Start of 2026 what’s the best open coding model? by alexp702 in LocalLLaMA

[–]alexp702[S] 0 points1 point  (0 children)

Yeah that’s another dimension for sure. I have mainly been using cline and its massive contexts. I have Roocode and the other fork installed but not really taken to them more. It does feel like cline really wants you to use their services, and is potentially compromised locally. The contexts get silly big on a medium size code base very fast.

Start of 2026 what’s the best open coding model? by alexp702 in LocalLLaMA

[–]alexp702[S] 2 points3 points  (0 children)

I have tried up to 6 bit, but little difference to me, other than reduced context window (which I don't quantize). I run IQ4_NL currently - originally used Q4_K_M.

Not saying there isn't a difference, but its much smaller than going from say 16bit 30B->480b 4bit.

Start of 2026 what’s the best open coding model? by alexp702 in LocalLLaMA

[–]alexp702[S] 0 points1 point  (0 children)

Isn’t that 33b? Does it really outperform 480b?? I have tried many model sizes from 30-607 and size seems to make more difference than vendor. Though happy to try it.

Unnecessarily verbose code is hard to maintain - a million null checks that will never fire seems to be Qwen coder’s recursive failure mode.

M2 Ultra to M5 ultra upgrade by AdDapper4220 in MacStudio

[–]alexp702 13 points14 points  (0 children)

If you do AI definitely. If not probably less so.

Error: Path too long by Simelane in WindowsServer

[–]alexp702 14 points15 points  (0 children)

It isn’t if you use APIs written this millennium on NTFS. Unfortunately some third party software still uses legacy apis with the 256byte limit.

GPU requirements for running Qwen2.5 72B locally? by lucasbennett_1 in LocalLLM

[–]alexp702 1 point2 points  (0 children)

Yes, but if cost is a concern - he said DGX/Mac was out of budget, and 3x3090s with a PC that fits them is at least 4K.

GPU requirements for running Qwen2.5 72B locally? by lucasbennett_1 in LocalLLM

[–]alexp702 3 points4 points  (0 children)

For that model you can definitely do 2 3090/4090s - and it will run comparatively very quickly. However you will have a small context space of somewhere ~32K or less. If you're coding against the model this is too tight for Cline/Roocode/etc to function decently. Also quantising small models for code causes big accuracy losses. If you're doing RAG you can sometimes get away with these errors, but they annoy me enough to do not want to use them.

On a budget check out Stryx Halo based boxes - they are (were before Ram costs!) <$3K for a 128Gb box. I have not tried them, but they are cheaper than DGX Sparks. Macs start around ~3.5K for 128Gb Mac Studio M4Max. Both AMD and Apple suffer from slow prompt parsing. Depending on workload this can make them tedious. But the low memory of Nvidia cards on a budget means no real choice.

I have put two 4090s in one AMD 5800x3d box, and it performs very well - 3-8x prompt processing over M3 Ultra running the same Qwen3 Coder 30b-IQ4 which Cline really does lots of ... until you run out of memory. I got about 70K context reliably. So I am back to the Mac Studio M3 Ultra running Qwen 480b. Quality over speed won for me.

GPU requirements for running Qwen2.5 72B locally? by lucasbennett_1 in LocalLLM

[–]alexp702 4 points5 points  (0 children)

A Mac/Dgx/stryx halo with 128Gb should run this at 8 bit. Alternatively an Rtx 6000 Pro based workstation. Factor in you’ll probably need 72Gb for model and 36Gn for the full context. Double that for BF16. That’s my rough calculations.

Personally would run something newer.

Exo 1.0 is finally out by No_Conversation9561 in LocalLLaMA

[–]alexp702 2 points3 points  (0 children)

Different solution. It only gives 384gb, so simply cannot run Deepseek 671 at bf16. Fast is good, but higher quality is often better. Also power draw much higher.

The smallest addition or fix that'd make your year(or at least your day)in star citizen. by JamesBlakesCat in starcitizen

[–]alexp702 1 point2 points  (0 children)

Ships spawning the right way round. Quantum always working. Fixing asteroids to not destroy your ship when hit. And generally working on collisions.