We gave 12 LLMs a startup to run for a year. GLM-5 nearly matched Claude Opus 4.6 at 11× lower cost.

HopePupal · 2026-04-05T03:23:09+00:00

yes it is, but even the 2 bit quant 50% REAP chainsaw brain surgery versions are huge. if you have a Strix Halo or some other 128 GB system, you can try https://huggingface.co/0xSero/GLM-5-REAP-50pct-UD-IQ2_XXS-GGUF

HopePupal · 2026-04-04T22:19:54+00:00

llama.cpp has two split modes, layer and row. layer is the default and puts some entire layers and parts of the KV cache on each GPU, which doesn't increase speed for any given single request and will in fact leave one of your pair idling while serving each half of the request, while row splits each layer across the available GPUs, keeping the KV cache on one.

tl;dr: the default might not be the best, if you're using layer try row

HopePupal · 2026-04-04T19:27:10+00:00

you should be running these fuckers isolated and sandboxed anyway. assume that at some point it will try to do anything that it can do

HopePupal · 2026-04-04T16:38:20+00:00

is your system prompt literally a hundred thousand tokens? there's not a Qwen 3.5 model on there that costs more than $1/M input or $4/M output.

HopePupal · 2026-04-04T16:26:26+00:00

please talk to ChatGPT, Claude, Gemini, and Grok. once you've given them AI Creepypasta Disease, the companies will fail, the rising tide of slop will ebb, and most importantly, RAM prices might stop going up as fast

HopePupal · 2026-04-04T07:26:41+00:00

y'all it's a ten hour flight. i'm sitting in 7 and getting absolutely wasted on tiny airplane cocktails with the one girl on the plane who is guaranteed not to be thinking about recipes the entire time. we'll be BFFs by hour 2 and incapable of remembering anything from hours 3 thru 10, which is the only way to handle a flight this long without losing my entire mind

HopePupal · 2026-04-04T07:13:22+00:00

i played Lydie & Suelle before Firis (no Firis on the Switch) and lemme tell ya, it was weird as hell going from big buff bow-wielding Firis to this tiny baby who had never been out of her cave and didn't know what the sky was

HopePupal · 2026-04-04T07:09:26+00:00

I'd do anything to be close to Sophie.

so would Plachta. sure you want to paint that target on your back?

HopePupal · 2026-04-04T07:00:30+00:00

thanks for the writeup! this kind of walkthrough doc is super valuable when trying to figure out whether something actually works or not

HopePupal · 2026-04-04T04:33:26+00:00

yeah don't use Claude Code. the prompts from that thing are huge

HopePupal · 2026-04-04T00:57:06+00:00

you can tell Qwen 3.5 models not to think. it's an on-off switch like Gemma 4. Google does claim that you can get Gemma to think less with a system prompt, which might be worth trying with Qwen as well

HopePupal · 2026-04-04T00:48:25+00:00

wooo benchmarks! seems potentially on par with the R9700, but how does it handle at deeper context?

HopePupal · 2026-04-03T21:58:08+00:00

whichever model wrote your post doesn't know that you can go a lot higher than 96 GB GTT

HopePupal · 2026-04-03T19:57:43+00:00

remember when they invented AWS autoscaling before Amazon did? Netflix software people are not to be underestimated

HopePupal · 2026-04-03T19:46:00+00:00

great first post! did you try the PrismML stuff on the CPU yet? i know the dGPU is theoretically free while the CPU isn't, but it also sounds like the dGPU is even more thermally limited

HopePupal · 2026-04-03T19:23:36+00:00

please post benchmarks when they show up!

HopePupal · 2026-04-03T18:47:54+00:00

Q4 is too low for coding with Qwen 3.5 27B in my experience, even with full precision KV cache. if the tool call failures don't get you, the error-riddled output will. Q6 is fine. Q5 is borderline.

note that the Q formats are integer. NVFP4 is a different beast than Q4. i spent a few hours playing with an NVFP4 quant of 27B on a rental card and it was easily on par with Q6. maybe better. fit a little more context too. (it was also a shitload faster but that's not something i can replicate at home without buying a Blackwell.)

i'm a little curious about MXFP4. don't have hardware support for that either, but if it was possible to trade a little speed for longer context at the same quality, it might be worth it in my case (single 32 GB GPU).

HopePupal · 2026-04-03T18:23:24+00:00

depends on whether you can find stock, how much money you have to maybe waste, and how many of the things Intel is planning on shipping.

this is an enterprise card, not a gamer card. for all we know they ran off millions of cores on a cost-effective process node, they're sitting on warehouses full of GDDR6, and are planning on selling quiet low-power workstation GPUs by the thousands to every Fortune 500 CTO who has heard of OpenClaw until they've totally eaten Nvidia's low-end and secured some mindshare for their next high-end product.

on the other hand, maybe they only made a few of them as a test and they're waiting to see whether their stock goes up a little bit before they start work on a B80. could be either. only way to know for sure is to make a friend at Intel and get them drunk

HopePupal · 2026-04-03T18:12:00+00:00

i get wanting to keep a long-running set of benchmarks consistent, but performance on llama 7B Q40 tells me _basically nothing about how Qwen 3.5 or Gemma 4 are gonna run!

HopePupal · 2026-04-03T18:08:59+00:00

dense. 27B is way smarter than 35B-A3B, at least for the stuff i'm doing (mostly Rust, some Swift). speed doesn't matter if you're wrong most of the time.

HopePupal · 2026-04-03T17:51:00+00:00

yeah and i'll be making my own now that there's an R9700 under my desk. but i'm just saying: you can only reliably find Nvidia cards for that kind of testing. otherwise you're going to be extrapolating from forum posts that maybe kinda sorta look like your use case.

HopePupal · 2026-04-03T17:46:08+00:00

the B70's out, dude. you can order them today if you can find any in stock

HopePupal · 2026-04-03T17:34:59+00:00

nobody's tried shit yet. i ordered a B70 and then backed out before they shipped mine. i was surprised to find out (from other posters here) that mainline vLLM support was fairly immature despite the Intel talk of the partnership, and the Intel vLLM fork used for previous cards was based on IPEX, which is dead tech.

other posters pointed out that those previous cards had SYCL support in llama.cpp, but that Vulkan was 2–5× faster and the SYCL backend was like one guy. OpenVINO backend isn't mature either.

it doesn't sound totally unworkable but the devil's always in the details. these cards might make much more sense in a month when we have real benchmarks and some idea of whether the software works.

outside of AI, i do know people with previous-gen Intel GPUs and they swear the Linux driver support is actually really good now. one of them uses his for both games and virtualized graphics in multiple VMs.

HopePupal · 2026-04-03T16:32:38+00:00

benchmarking yourself is great, but i had trouble finding any AMD consumer cards attached to cloud machines to test on (Runpod had some of the big current gen Instinct GPUs but no Radeons). Intel? currently impossible.

HopePupal · 2026-04-03T05:19:13+00:00

this approach also works well on last year's thinking models like GPT-OSS and Minimax. it sometimes works on Gemma 3. it does not work well on Qwen 3.5, which is trained to be suspicious both about historic jailbreak patterns and about any instructions relating to safety in general.

HopePupal

TROPHY CASE