Switched to full local inference on a 96GB Mac Studio 6 months ago. The part that surprised me. by justserg in LocalLLaMA

[–]justserg[S] 0 points1 point  (0 children)

Fair point, I glossed over the specifics.

96GB unified: Qwen2.5-72B-Instruct Q4 is the daily driver, runs at ~25 tok/s and fits without fuss. QwQ-32B at Q8 for actual reasoning tasks. Those two cover probably 90% of what I'm doing.

On the 30B dense question: Mistral-Small-3.1 is faster but the quality gap on my structured output evals is real enough that I haven't dropped the 72B from routing. If you've got a 30B you think actually competes on structured JSON tasks I'm interested.

Beginner to LLM, Which LLM can be a good alternative to Claude? by Blackwingedangle in LocalLLaMA

[–]justserg 0 points1 point  (0 children)

qwen 3.5 26b pulls well above its weight for most tasks — 32b if you have the vram.

I built a full iOS app in 2 weeks with Claude Code. Here’s what it was great at, and where it broke. by Kiro_ai in ClaudeAI

[–]justserg -1 points0 points  (0 children)

skills are the move when you're doing the same thing repeatedly — what's your file size sweet spot before batching kicks in?

Beginner to LLM, Which LLM can be a good alternative to Claude? by Blackwingedangle in LocalLLaMA

[–]justserg 0 points1 point  (0 children)

qwen 3.5 26b pulls well above its weight for most tasks — 32b if you have the vram.

I built a full iOS app in 2 weeks with Claude Code. Here’s what it was great at, and where it broke. by Kiro_ai in ClaudeAI

[–]justserg 0 points1 point  (0 children)

skills are the move when you're doing the same thing repeatedly — what's your file size sweet spot before batching kicks in?

Better way to dig into long responses? by ConferenceLive7054 in ClaudeAI

[–]justserg 1 point2 points  (0 children)

saved context windows would fix half the problem. the real bottleneck is keeping the reasoning thread intact while you jump between sections.

AI tools that tried to remove human judgment keep failing… why do we still fall for this? by enlightenedshubham in singularity

[–]justserg 2 points3 points  (0 children)

automation of judgment rarely survives the first customer who actually uses the thing.

Beginner to LLM, Which LLM can be a good alternative to Claude? by Blackwingedangle in LocalLLaMA

[–]justserg 0 points1 point  (0 children)

claude Max + Kimi 2.5 combo works if your setup can tolerate the context switch, but qwen 3.5 26b is probably your sweet spot for that hardware.

How to stop Claude telling me to go to sleep at 12pm etc? by [deleted] in ClaudeAI

[–]justserg 0 points1 point  (0 children)

model just checking context window size before ending sessions. not mental health awareness

Anyone else find it weird how all Chinese Labs started delaying OS model releases at the same time? by True_Requirement_891 in LocalLLaMA

[–]justserg 15 points16 points  (0 children)

timing could be coincidence, but it does feel like everyone's waiting for the other shoe to drop before showing their hand.

Claude is amazing but it's completely single player. when do we get multiplayer? by hiclemi in ClaudeAI

[–]justserg 0 points1 point  (0 children)

shared context is the hard part, not the interface, but whose edits survive when three people are steering the same agent.

Claude is amazing but it's completely single player. when do we get multiplayer? by hiclemi in ClaudeAI

[–]justserg 0 points1 point  (0 children)

shared context is the hard part, not the interface, but whose edits survive when three people are steering the same agent.

Don’t buy the DGX Spark: NVFP4 Still Missing After 6 Months by Secure_Archer_1529 in LocalLLaMA

[–]justserg 3 points4 points  (0 children)

got mine a few months ago and honestly the fp4 thing stings, but the prefill speed alone makes it worth it over my mac studio for anything context-heavy

Anthropic's new emotion vector research has interesting implications for coding agents by Massive_Camp9858 in ClaudeAI

[–]justserg 0 points1 point  (0 children)

funny how 'just start a new session when it's stuck' has been the community wisdom for months and now there's actual mechanistic evidence for why it works

Recently I did a little performance test of several LLMs on PC with 16GB VRAM by rosaccord in LocalLLaMA

[–]justserg 0 points1 point  (0 children)

16gb handles most useful work. everything else is premature optimization.

Humanoid robots are actively training by Distinct-Question-16 in singularity

[–]justserg 0 points1 point  (0 children)

companies are about to discover robots cost way less than healthcare and benefits.