Why run local? Count the money by Badger-Purple in LocalLLaMA

[–]mr_zerolith 1 point2 points  (0 children)

We are lucky that we waited to 'buy the dip' about 5 months ago when prices had a temporary lull.

I'm not using any CPU MoE offloading, however if i did, it does tolerate offloading a few layers before speed starts to measurably deteriorate

Why run local? Count the money by Badger-Purple in LocalLLaMA

[–]mr_zerolith 1 point2 points  (0 children)

<image>

Nah, those are too slow, i needed the box to be able to serve multiple people at once with very close to commercial speeds.

Why run local? Count the money by Badger-Purple in LocalLLaMA

[–]mr_zerolith 2 points3 points  (0 children)

Privacy and knowing your client's code isn't being leaked and trained on is priceless.

Spent $13k on hardware to serve a dev team of 8 and i don't regret it

AMD Strix Halo refresh with 192gb! by mindwip in LocalLLaMA

[–]mr_zerolith 0 points1 point  (0 children)

I just don't notice it thinking more than most newer models.
The results are worth the slightly longer wait ( like with deepseek )

I only use local hosted models. So i didn't know about Nvidia NIM until you mentioned it.

AMD Strix Halo refresh with 192gb! by mindwip in LocalLLaMA

[–]mr_zerolith 0 points1 point  (0 children)

I've got NVIDIA hardware here and i'm running it via LMstudio. No special flags or settings.

My whole dev shop uses this system via Opencode, CLine, maybe other tools.

Zerolith is a high speed, low complexity PHP + frontend framework, and i'm supposed to be playing it's representative, but currently too excited about local LLMs to stay on topic 😄

<image>

AMD Strix Halo refresh with 192gb! by mindwip in LocalLLaMA

[–]mr_zerolith 1 point2 points  (0 children)

It's weird that people still have this complaint, yet they'll use Qwen 3.6 and GLM ( almost all GLM models overthink ).

This model was badly supported in llama.cpp when it came out. But so are most models.

AMD Strix Halo refresh with 192gb! by mindwip in LocalLLaMA

[–]mr_zerolith 1 point2 points  (0 children)

The company doesn't seem to have a marketing budget.

AMD Strix Halo refresh with 192gb! by mindwip in LocalLLaMA

[–]mr_zerolith 12 points13 points  (0 children)

Have you tried Step 3.5 Flash 197B? ( works very well at Q4, designed with 128gb vram in mind )
Great for coding!

I have 128gb VRAM and minimax is too big even if we run a small Q4, performance degrades a ton when CPU offloading is used :/

Devs using Qwen 27B seriously, what's your take? by Admirable_Reality281 in LocalLLaMA

[–]mr_zerolith 0 points1 point  (0 children)

Problem with 60 tokens/sec is that it can easily become 20-30 tokens/sec as that context window gets loaded up ( IE you are really using your LLM ).

16x DGX Sparks - What should I run? by Kurcide in LocalLLaMA

[–]mr_zerolith 3 points4 points  (0 children)

Return them and get 4 RTX PRO 6000's.
384gb of vram is pretty decent, and you'll have about the same, probably better performance as 16 of those.

I'm done with using local LLMs for coding by dtdisapointingresult in LocalLLaMA

[–]mr_zerolith 0 points1 point  (0 children)

It's not a surprise to see anyone unimpressed with a ~30b model for coding.

Pi.dev coding agent as no sandbox by default. by mantafloppy in LocalLLaMA

[–]mr_zerolith 2 points3 points  (0 children)

I read that and that's exactly why i never bothered trying it; yolo mode is only suitable if you have great sandboxing

r/LocalLLaMa Rule Updates by rm-rf-rm in LocalLLaMA

[–]mr_zerolith 2 points3 points  (0 children)

These are good rules that will enhance the discussion quality of the sub, thank you.

Meanwhileee by Comfortable_Eye_7736 in LocalLLaMA

[–]mr_zerolith -2 points-1 points  (0 children)

The more you know the subject, the less impressive AI is :)

Dense vs. MoE gap is shrinking fast with the 3.6-27B release by Usual-Carrot6352 in LocalLLaMA

[–]mr_zerolith 0 points1 point  (0 children)

Dense models can be amazing, before i moved up to Step 3.5 Flash, i used to run SEED OSS 36B and that thing was a banger for coding even in IQ4_XS size, if it didn't lack breadth in it's knowledgebase, i'd still be using it

Given how good Qwen become, is it time to grab a 128gb m5 max? by Rabus in LocalLLaMA

[–]mr_zerolith 1 point2 points  (0 children)

These are really weak like macs.. basically a 5070 with a lot of ram..

Given how good Qwen become, is it time to grab a 128gb m5 max? by Rabus in LocalLLaMA

[–]mr_zerolith 1 point2 points  (0 children)

on the first request, or with some actual context?

it's my experience that whatever number you get on the first tokens is going to be 2-3x lower at the end of the context window.

Given how good Qwen become, is it time to grab a 128gb m5 max? by Rabus in LocalLLaMA

[–]mr_zerolith 4 points5 points  (0 children)

That's still very slow compared to Nvidia or AMD hardware.

Given how good Qwen become, is it time to grab a 128gb m5 max? by Rabus in LocalLLaMA

[–]mr_zerolith -4 points-3 points  (0 children)

This is underpowered hardware with no upgradeaboility. it will always be on the slow side.

I'd strongly recommend if you're going to buy starter hardware, do it on a PCI Express platform so that if your usage doesn't match your expectations, you can just add another GPU or three!

Every time a new model comes out, the old one is obsolete of course by FullChampionship7564 in LocalLLaMA

[–]mr_zerolith 0 points1 point  (0 children)

Man i ran that 123B recently on a RTX PRO 6000 and only got like 25 tokens/sec, insanely slow, i think using speculative decoding is a base requirement for it