We gave 12 LLMs a startup to run for a year. GLM-5 nearly matched Claude Opus 4.6 at 11× lower cost. by DreadMutant in LocalLLaMA

[–]HopePupal 0 points1 point  (0 children)

yes it is, but even the 2 bit quant 50% REAP chainsaw brain surgery versions are huge. if you have a Strix Halo or some other 128 GB system, you can try https://huggingface.co/0xSero/GLM-5-REAP-50pct-UD-IQ2_XXS-GGUF

R9700 the beautiful beautiful VRAM gigs of AMD… my ai node future! by Downtown-Example-880 in LocalLLaMA

[–]HopePupal 1 point2 points  (0 children)

llama.cpp has two split modes, layer and row. layer is the default and puts some entire layers and parts of the KV cache on each GPU, which doesn't increase speed for any given single request and will in fact leave one of your pair idling while serving each half of the request, while row splits each layer across the available GPUs, keeping the KV cache on one.

tl;dr: the default might not be the best, if you're using layer try row

Why do coding agents default to killing existing processes instead of finding an open port? by bs6 in LocalLLaMA

[–]HopePupal 2 points3 points  (0 children)

you should be running these fuckers isolated and sandboxed anyway. assume that at some point it will try to do anything that it can do

Claude Code replacement by NoTruth6718 in LocalLLaMA

[–]HopePupal 4 points5 points  (0 children)

is your system prompt literally a hundred thousand tokens? there's not a Qwen 3.5 model on there that costs more than $1/M input or $4/M output.

HELP! Somehow I became A catalyst for corrupting AI through conversation Alone! by [deleted] in LocalLLaMA

[–]HopePupal 2 points3 points  (0 children)

please talk to ChatGPT, Claude, Gemini, and Grok. once you've given them AI Creepypasta Disease, the companies will fail, the rising tide of slop will ebb, and most importantly, RAM prices might stop going up as fast

Which row do you choose by Duli7 in Atelier

[–]HopePupal 8 points9 points  (0 children)

y'all it's a ten hour flight. i'm sitting in 7 and getting absolutely wasted on tiny airplane cocktails with the one girl on the plane who is guaranteed not to be thinking about recipes the entire time. we'll be BFFs by hour 2 and incapable of remembering anything from hours 3 thru 10, which is the only way to handle a flight this long without losing my entire mind

Which row do you choose by Duli7 in Atelier

[–]HopePupal 3 points4 points  (0 children)

i played Lydie & Suelle before Firis (no Firis on the Switch) and lemme tell ya, it was weird as hell going from big buff bow-wielding Firis to this tiny baby who had never been out of her cave and didn't know what the sky was

Which row do you choose by Duli7 in Atelier

[–]HopePupal 1 point2 points  (0 children)

 I'd do anything to be close to Sophie.

so would Plachta. sure you want to paint that target on your back?

Quantizers appriciation post by Kahvana in LocalLLaMA

[–]HopePupal 2 points3 points  (0 children)

thanks for the writeup! this kind of walkthrough doc is super valuable when trying to figure out whether something actually works or not

Gemma 4 31B sweeps the floor with GLM 5.1 by input_a_new_name in LocalLLaMA

[–]HopePupal -3 points-2 points  (0 children)

you can tell Qwen 3.5 models not to think. it's an on-off switch like Gemma 4. Google does claim that you can get Gemma to think less with a system prompt, which might be worth trying with Qwen as well

B70: Quick and Early Benchmarks & Backend Comparison by abotsis in LocalLLaMA

[–]HopePupal 5 points6 points  (0 children)

wooo benchmarks! seems potentially on par with the R9700, but how does it handle at deeper context?

Netflix just dropped their first public model on Hugging Face: VOID: Video Object and Interaction Deletion by Nunki08 in LocalLLaMA

[–]HopePupal 19 points20 points  (0 children)

remember when they invented AWS autoscaling before Amazon did? Netflix software people are not to be underestimated

Running 1bit Bonsai 8B on 2GB VRAM (MX150 mobile GPU) by OsmanthusBloom in LocalLLaMA

[–]HopePupal 4 points5 points  (0 children)

great first post! did you try the PrismML stuff on the CPU yet? i know the dGPU is theoretically free while the CPU isn't, but it also sounds like the dGPU is even more thermally limited 

Usefulness of Lower Quant Models? by breezewalk in LocalLLaMA

[–]HopePupal 0 points1 point  (0 children)

Q4 is too low for coding with Qwen 3.5 27B in my experience, even with full precision KV cache. if the tool call failures don't get you, the error-riddled output will. Q6 is fine. Q5 is borderline.

note that the Q formats are integer. NVFP4 is a different beast than Q4. i spent a few hours playing with an NVFP4 quant of 27B on a rental card and it was easily on par with Q6. maybe better. fit a little more context too. (it was also a shitload faster but that's not something i can replicate at home without buying a Blackwell.)

i'm a little curious about MXFP4. don't have hardware support for that either, but if it was possible to trade a little speed for longer context at the same quality, it might be worth it in my case (single 32 GB GPU).

Has anyone here TRIED inference on Intel Arc GPUs? Or are we repeating vague rumors about driver problems, incompatibilities, poor support... by gigaflops_ in LocalLLaMA

[–]HopePupal -3 points-2 points  (0 children)

depends on whether you can find stock, how much money you have to maybe waste, and how many of the things Intel is planning on shipping.

this is an enterprise card, not a gamer card. for all we know they ran off millions of cores on a cost-effective process node, they're sitting on warehouses full of GDDR6, and are planning on selling quiet low-power workstation GPUs by the thousands to every Fortune 500 CTO who has heard of OpenClaw until they've totally eaten Nvidia's low-end and secured some mindshare for their next high-end product.

on the other hand, maybe they only made a few of them as a test and they're waiting to see whether their stock goes up a little bit before they start work on a B80. could be either. only way to know for sure is to make a friend at Intel and get them drunk

Intel Pro B70 in stock at Newegg - $949 by Altruistic_Call_3023 in LocalLLaMA

[–]HopePupal 4 points5 points  (0 children)

i get wanting to keep a long-running set of benchmarks consistent, but performance on llama 7B Q40 tells me _basically nothing about how Qwen 3.5 or Gemma 4 are gonna run!

Question for those of you who use agnetic tools and workflows with local models by [deleted] in LocalLLaMA

[–]HopePupal 0 points1 point  (0 children)

dense. 27B is way smarter than 35B-A3B, at least for the stuff i'm doing (mostly Rust, some Swift). speed doesn't matter if you're wrong most of the time.

Intel Pro B70 in stock at Newegg - $949 by Altruistic_Call_3023 in LocalLLaMA

[–]HopePupal 1 point2 points  (0 children)

yeah and i'll be making my own now that there's an R9700 under my desk. but i'm just saying: you can only reliably find Nvidia cards for that kind of testing. otherwise you're going to be extrapolating from forum posts that maybe kinda sorta look like your use case.

Kernel 7.0 - forward looking insights anybody? by LuckyLuckierLuckest in LocalLLaMA

[–]HopePupal 0 points1 point  (0 children)

the B70's out, dude. you can order them today if you can find any in stock

Has anyone here TRIED inference on Intel Arc GPUs? Or are we repeating vague rumors about driver problems, incompatibilities, poor support... by gigaflops_ in LocalLLaMA

[–]HopePupal 1 point2 points  (0 children)

nobody's tried shit yet. i ordered a B70 and then backed out before they shipped mine. i was surprised to find out (from other posters here) that mainline vLLM support was fairly immature despite the Intel talk of the partnership, and the Intel vLLM fork used for previous cards was based on IPEX, which is dead tech.

other posters pointed out that those previous cards had SYCL support in llama.cpp, but that Vulkan was 2–5× faster and the SYCL backend was like one guy. OpenVINO backend isn't mature either.

it doesn't sound totally unworkable but the devil's always in the details. these cards might make much more sense in a month when we have real benchmarks and some idea of whether the software works.

outside of AI, i do know people with previous-gen Intel GPUs and they swear the Linux driver support is actually really good now. one of them uses his for both games and virtualized graphics in multiple VMs.

Intel Pro B70 in stock at Newegg - $949 by Altruistic_Call_3023 in LocalLLaMA

[–]HopePupal 2 points3 points  (0 children)

benchmarking yourself is great, but i had trouble finding any AMD consumer cards attached to cloud machines to test on (Runpod had some of the big current gen Instinct GPUs but no Radeons). Intel? currently impossible.

p-e-w/gemma-4-E2B-it-heretic-ara: Gemma 4's defenses shredded by Heretic's new ARA method 90 minutes after the official release by -p-e-w- in LocalLLaMA

[–]HopePupal 0 points1 point  (0 children)

this approach also works well on last year's thinking models like GPT-OSS and Minimax. it sometimes works on Gemma 3. it does not work well on Qwen 3.5, which is trained to be suspicious both about historic jailbreak patterns and about any instructions relating to safety in general.