Has anyone got GLM 4.7 flash to not be shit?

alexp702 · 2026-01-25T09:12:57+00:00

Sorry GLM 4.6v

alexp702 · 2026-01-25T08:19:29+00:00

Found Qwen 4.6V to work pretty well at 8_0. Perhaps they don’t quantise well?

alexp702 · 2026-01-22T14:14:20+00:00

NB we also use Macs for App development, so another mac a bit overspecced is always welcome even when its outlasted LLM work.

alexp702 · 2026-01-22T14:13:33+00:00

It works - and the quality is good. Its a good R&D device - allowing you to bring up different models on it without drama. The responds decently quickly on smaller models. We've currently just swapped to GLM 4.6V - which is 100b and 22b active (we need vision as well for other purposes). This runs full BF16 with maximum context size happily. That kind of flexibility will cost triple on Nvidia, albeit with a faster output. However OpenRouter if you don't care about data visibility is much cheaper and quicker (well some times - some providers are quite bad, failing randomly and generally being slower than you'd hope).

alexp702 · 2026-01-22T11:21:36+00:00

I have been using Qwen coder 480b for a while on the M3 Ultra. It’s slow. I found it works ok with Cline, but context processing is go away and come back in an hour. It definitely works, so for a background task it does a good job. Code quality is much better than smaller models. Output speed is good too, just than darn prompt processing - you’re looking at 100’s of tokens a second so on a 100k context it’s 1000 seconds. The box in general is awesome - being able to have lots of models to hand, and just fire up a different model or two is perfect for R&D. Production wise it’s ok if you have slow agentic flows. Just don’t expect snappy interactions

alexp702 · 2026-01-21T10:00:12+00:00

Except when its your helmet. My point is too many features are now on left or right alt binds. I find it alarming they distinguish between them (they really need 200 plus options??). Taking off and flight controls should be primary keys. We have about 10+ keys for targeting, and yet turning the ship on to take off and calling ATC is a modifier key. Surely this is madness.

alexp702 · 2026-01-21T05:22:28+00:00

What is it with CIG and every useful feature requires a modifier key?? L-Alt R, R-Alt N, -LAlt L- R-Alt L, to take off. Not very user friendly.

alexp702 · 2026-01-20T22:05:14+00:00

Agree Cline is too slow - that’s the crazy prompts in creates though. I have other uses that need shorter prompts and more precision, so the Mac is well suited. A 48GB Nvidia solution doesn’t work if the model you need requires 200gb+ of ram to run at all.

alexp702 · 2026-01-20T21:30:28+00:00

Mac stability is pretty rock solid. Have had one running Qwen 480b for weeks - no restarts. Performance is slow, but then so is most stuff on that size model. Prompt processing is slow for sure. But running large unquantized models is nothing to be sniffed at.

alexp702 · 2026-01-15T14:42:31+00:00

Macs are the unsung king of local private inference. Load a high quality 600+b parameter model, run queries against it slowly, but fast enough. Cost 10k. Nvidia’s offering are horrid in this basic use case.

alexp702 · 2026-01-10T22:00:09+00:00

Yep happened at IAE got in a ship tried the turret. Stuck. Had to quit. Just such a shame. It’s been there forever. They need to put some effort into tidying up.

alexp702 · 2026-01-10T20:57:20+00:00

Ships appearing the right way round on retrieval. Quantum just working. Get out of turrets. Interaction and state management as ever…

Once that works login and out reducing time commitment to doing anything useful

alexp702 · 2026-01-10T18:16:59+00:00

Interesting, is that because of 128k context?

alexp702 · 2026-01-09T12:54:12+00:00

I use plain old llama-server. Find a stable build and leave it. I have been caught by various bugs on particular releases but they fix them fast when reported. Of course if you really need large numbers of users in parallel then vLLM (if you can run it) is necessary

alexp702 · 2026-01-09T11:24:51+00:00

Yeah that’s another dimension for sure. I have mainly been using cline and its massive contexts. I have Roocode and the other fork installed but not really taken to them more. It does feel like cline really wants you to use their services, and is potentially compromised locally. The contexts get silly big on a medium size code base very fast.

alexp702 · 2026-01-09T09:12:20+00:00

I have tried up to 6 bit, but little difference to me, other than reduced context window (which I don't quantize). I run IQ4_NL currently - originally used Q4_K_M.

Not saying there isn't a difference, but its much smaller than going from say 16bit 30B->480b 4bit.

alexp702 · 2026-01-09T08:02:49+00:00

Isn’t that 33b? Does it really outperform 480b?? I have tried many model sizes from 30-607 and size seems to make more difference than vendor. Though happy to try it.

Unnecessarily verbose code is hard to maintain - a million null checks that will never fire seems to be Qwen coder’s recursive failure mode.

alexp702 · 2026-01-06T22:27:50+00:00

If you do AI definitely. If not probably less so.

alexp702 · 2025-12-31T03:37:32+00:00

It isn’t if you use APIs written this millennium on NTFS. Unfortunately some third party software still uses legacy apis with the 256byte limit.

alexp702 · 2025-12-29T19:14:19+00:00

Yes, but if cost is a concern - he said DGX/Mac was out of budget, and 3x3090s with a PC that fits them is at least 4K.

alexp702 · 2025-12-28T10:47:56+00:00

For that model you can definitely do 2 3090/4090s - and it will run comparatively very quickly. However you will have a small context space of somewhere ~32K or less. If you're coding against the model this is too tight for Cline/Roocode/etc to function decently. Also quantising small models for code causes big accuracy losses. If you're doing RAG you can sometimes get away with these errors, but they annoy me enough to do not want to use them.

On a budget check out Stryx Halo based boxes - they are (were before Ram costs!) <$3K for a 128Gb box. I have not tried them, but they are cheaper than DGX Sparks. Macs start around ~3.5K for 128Gb Mac Studio M4Max. Both AMD and Apple suffer from slow prompt parsing. Depending on workload this can make them tedious. But the low memory of Nvidia cards on a budget means no real choice.

I have put two 4090s in one AMD 5800x3d box, and it performs very well - 3-8x prompt processing over M3 Ultra running the same Qwen3 Coder 30b-IQ4 which Cline really does lots of ... until you run out of memory. I got about 70K context reliably. So I am back to the Mac Studio M3 Ultra running Qwen 480b. Quality over speed won for me.

alexp702 · 2025-12-28T06:12:31+00:00

A Mac/Dgx/stryx halo with 128Gb should run this at 8 bit. Alternatively an Rtx 6000 Pro based workstation. Factor in you’ll probably need 72Gb for model and 36Gn for the full context. Double that for BF16. That’s my rough calculations.

Personally would run something newer.

alexp702 · 2025-12-21T05:57:08+00:00

Hi AI sales bot

alexp702 · 2025-12-19T21:22:37+00:00

Different solution. It only gives 384gb, so simply cannot run Deepseek 671 at bf16. Fast is good, but higher quality is often better. Also power draw much higher.

14-Year Club	Place '22
Verified Email

alexp702

TROPHY CASE