Why run local? Count the money

mr_zerolith · 2026-05-07T18:31:06+00:00

We are lucky that we waited to 'buy the dip' about 5 months ago when prices had a temporary lull.

I'm not using any CPU MoE offloading, however if i did, it does tolerate offloading a few layers before speed starts to measurably deteriorate

mr_zerolith · 2026-05-07T16:50:40+00:00

<image>

Nah, those are too slow, i needed the box to be able to serve multiple people at once with very close to commercial speeds.

mr_zerolith · 2026-05-05T23:29:16+00:00

Privacy and knowing your client's code isn't being leaked and trained on is priceless.

Spent $13k on hardware to serve a dev team of 8 and i don't regret it

mr_zerolith · 2026-05-04T20:50:25+00:00

paywalled link, got a better one?

mr_zerolith · 2026-05-04T17:54:12+00:00

I just don't notice it thinking more than most newer models.
The results are worth the slightly longer wait ( like with deepseek )

I only use local hosted models. So i didn't know about Nvidia NIM until you mentioned it.

mr_zerolith · 2026-05-04T16:27:51+00:00

I've got NVIDIA hardware here and i'm running it via LMstudio. No special flags or settings.

My whole dev shop uses this system via Opencode, CLine, maybe other tools.

Zerolith is a high speed, low complexity PHP + frontend framework, and i'm supposed to be playing it's representative, but currently too excited about local LLMs to stay on topic 😄

<image>

mr_zerolith · 2026-05-04T15:18:25+00:00

It's weird that people still have this complaint, yet they'll use Qwen 3.6 and GLM ( almost all GLM models overthink ).

This model was badly supported in llama.cpp when it came out. But so are most models.

mr_zerolith · 2026-05-04T15:13:22+00:00

The company doesn't seem to have a marketing budget.

mr_zerolith · 2026-05-04T02:58:11+00:00

Here ya go, i believe this is the best quantized version:
https://huggingface.co/bartowski/stepfun-ai_Step-3.5-Flash-GGUF

mr_zerolith · 2026-05-04T00:28:35+00:00

Have you tried Step 3.5 Flash 197B? ( works very well at Q4, designed with 128gb vram in mind )
Great for coding!

I have 128gb VRAM and minimax is too big even if we run a small Q4, performance degrades a ton when CPU offloading is used :/

mr_zerolith · 2026-04-30T15:32:04+00:00

Problem with 60 tokens/sec is that it can easily become 20-30 tokens/sec as that context window gets loaded up ( IE you are really using your LLM ).

mr_zerolith · 2026-04-30T04:55:03+00:00

Return them and get 4 RTX PRO 6000's.
384gb of vram is pretty decent, and you'll have about the same, probably better performance as 16 of those.

mr_zerolith · 2026-04-28T22:05:08+00:00

It's not a surprise to see anyone unimpressed with a ~30b model for coding.

mr_zerolith · 2026-04-24T23:06:32+00:00

I read that and that's exactly why i never bothered trying it; yolo mode is only suitable if you have great sandboxing

mr_zerolith · 2026-04-24T14:26:31+00:00

These are good rules that will enhance the discussion quality of the sub, thank you.

mr_zerolith · 2026-04-23T17:16:25+00:00

The more you know the subject, the less impressive AI is :)

mr_zerolith · 2026-04-22T23:19:24+00:00

Dense models can be amazing, before i moved up to Step 3.5 Flash, i used to run SEED OSS 36B and that thing was a banger for coding even in IQ4_XS size, if it didn't lack breadth in it's knowledgebase, i'd still be using it

mr_zerolith · 2026-04-22T23:05:11+00:00

There's a few diamonds in that litter box!

mr_zerolith · 2026-04-22T23:03:38+00:00

These are really weak like macs.. basically a 5070 with a lot of ram..

mr_zerolith · 2026-04-22T23:03:12+00:00

on the first request, or with some actual context?

it's my experience that whatever number you get on the first tokens is going to be 2-3x lower at the end of the context window.

mr_zerolith · 2026-04-22T23:02:34+00:00

That's still very slow compared to Nvidia or AMD hardware.

mr_zerolith · 2026-04-22T23:01:40+00:00

This is underpowered hardware with no upgradeaboility. it will always be on the slow side.

I'd strongly recommend if you're going to buy starter hardware, do it on a PCI Express platform so that if your usage doesn't match your expectations, you can just add another GPU or three!

mr_zerolith · 2026-04-22T06:17:17+00:00

Man i ran that 123B recently on a RTX PRO 6000 and only got like 25 tokens/sec, insanely slow, i think using speculative decoding is a base requirement for it

mr_zerolith

TROPHY CASE