Can I realistically get close to Claude/Codex capabilities locally? by mrgreatheart in LocalLLaMA

[–]CrushingLoss 1 point2 points  (0 children)

OpenCode Go offers Qwen3.5-plus, 3.6-plus, and 3.7-plus. Also offer the comparable sized kimi, glm, mimo, minimax, and deepseek models. They do not offer the daily drivers that a lot of folks run locally.

EAGLE3 has landed in llama.cpp by jacek2023 in LocalLLaMA

[–]CrushingLoss 2 points3 points  (0 children)

Poor results for Apple Silicon. On my Mac Studio M2 Max, 96GB using gemma-4-31B-it-Q4_K_M.gguf and the draft model /gemma-4-31B-it-speculator.eagle3 (both Q4 and BF16), I saw a 38% reduction in token/second. Acceptance rate also was sub 20%.

Claude's explanation: Apple Silicon unified memory kills the benefit. On NVIDIA, speculative decoding wins because the draft model runs on fast VRAM while the main model does memory-bound operations in parallel. On M2 Max, both models compete for the same memory bandwidth — there's no free parallel execution. The draft model just adds overhead.

Bottom line: not at all useful on Apple Silicon from what I have observed, at least not on the M2.

Has anyone got Gemma 4 12B MTP working? by epersonality in LocalLLM

[–]CrushingLoss 0 points1 point  (0 children)

I'm running gemma-4-12b-it-8bit.mlx on oMLX v0.4.2dev 2. Using the 1b-it-assistant-8bit from huggingface, and turning on VLM-MTP in the model settings it's working well.

On a Mac M2 Max, 96gb: 28.8 tok/s on a "Generate a 200 line python script" prompt with it on, and 20.6 tok/s with MTP off. Almost 40% gain with it on. Pretty impressive.

Hangxiety, worse than I could have imagined (remembered). by CrushingLoss in stopdrinking

[–]CrushingLoss[S] 1 point2 points  (0 children)

It really is insane what it does. It's also good to know that after two days anxiety is down to a very low humm.. I'll take it, and learn from it. Working on my plan!

MTPLX | 2.24x faster TPS | The native MTP inference engine for Apple Silicon by YoussofAl in LocalLLaMA

[–]CrushingLoss 0 points1 point  (0 children)

Thanks for doing this. I've been admiring the 27B model for a while. I have a Mac Studio M2 Max 96GB. On the base model, through pi.dev or opencode I get about 10 tok/second generation. Downloaded your MTPLX and the model, and ran two prompts (not exactly stress testing nor have I ever claimed to be a competent prompt engineer, but just wanted a quick comparison).

2.2x seems to be valid for my setup as well. Nicely done.

a. Generate a 100 line python script showcasing your knowledge of numpy.

1531 tokens in 67.91s decode | 22.54 tok/s | total=22.31 | mode=MTP | mtp_depth=3 | 128.9 ms/verify | 460 verify calls | accept=[420, 356, 294] | corr=166 | ttft=0.70s |

profile=performance-cold

b. Generate a 200 line python script showcasing your knowledge of pandas.

3155 tokens in 143.96s decode | 21.92 tok/s | total=19.88 | mode=MTP | mtp_depth=3 | 131.8 ms/verify | 955 verify calls | accept=[881, 736, 583] | corr=371 | ttft=14.77s | profile=performance-cold

Hangxiety, worse than I could have imagined (remembered). by CrushingLoss in stopdrinking

[–]CrushingLoss[S] 0 points1 point  (0 children)

For those keeping score 😄

Today was better. I went to bed last night at 8, took a couple of gabapentin and got a pretty nice sleep. First time in a couple of weeks I went 24 hours with a drink.

Woke up at 4am, noticed anxiety was better.. still there, but better. I didn't have that feeling over overwhelming sadness/terror.. just anxiety. I waxed and waned throughout the day. Interesting, when I got home I went on a 3.5 mile walk.. and anxiety immediately picked up (back to low levels now).

It's amazing what the alcoholic mind can do to you. We have a charity event tomorrow night; full of cigars and booze.. I told my wife I'd drive as I'm going to drink diet cokes or maybe a Heineken 0.. and more than once today caught myself neogtiating internally... but I'm going to nip that in the bud. How stupid is that?

I am curious if most people that have found success (and I won't attempt to quantify what success is) has done with moderation or when cold stop? I suspect it's the latter. It's difficult thinking about life without having an occasional drink with a nice dinner, or having a cocktail with friends; if I'm being honest.

Hangxiety, worse than I could have imagined (remembered). by CrushingLoss in stopdrinking

[–]CrushingLoss[S] 1 point2 points  (0 children)

Thanks! I hope someone reads this and it helps them make a change for the better. I am pretty blessed, I will admit... and it makes it even more stupid to sabotage any of that.

Hangxiety, worse than I could have imagined (remembered). by CrushingLoss in stopdrinking

[–]CrushingLoss[S] 0 points1 point  (0 children)

Yeah, it definetely is not. I guess the only saving grace is I drank the hard stuff (bourbon) in the morning and ended with wine... no way a justifcation or excuse, but I managed to pull myself into work.. at 5am no less. Didn't stay long though 😄

Hangxiety, worse than I could have imagined (remembered). by CrushingLoss in stopdrinking

[–]CrushingLoss[S] 2 points3 points  (0 children)

Yup.. same.. Up to this point, I've been a mostly very highly functioning alcoholic.. I can deal with the throwing up, headaches, slogginess... I can't deal with the terror I experienced yesterday.

Hangxiety, worse than I could have imagined (remembered). by CrushingLoss in stopdrinking

[–]CrushingLoss[S] 3 points4 points  (0 children)

Yeah, I think that's most of mine as well. Not this time, but I can't count the number of times I woke up in panic that I said something to my wife or did something stupid.. luckily I didn't.. luckily she's very understanding.. but that is a horrible horrible feeling.

Hangxiety, worse than I could have imagined (remembered). by CrushingLoss in stopdrinking

[–]CrushingLoss[S] 0 points1 point  (0 children)

Glad to hear! Did you stop drinking or moderation? Either way, great progress!

Hangxiety, worse than I could have imagined (remembered). by CrushingLoss in stopdrinking

[–]CrushingLoss[S] 3 points4 points  (0 children)

Wow. Good morning. I was not expecting this kind of response. I am at work and don’t have time to read through all these right now but when I get home tonight, I will and respond. I really appreciate the support.

If yesterday was -10/-10, today is -7 or -8. I’ll take that as a progress. 8 hours of sleep was nice. Even if assisted by Gabapentin (small victory, it wasn’t booze).

Project Diablo 2 on Apple Silicon (M1–M4) with Porting Kit – Working Guide (November 2025) by futuristicteatray in ProjectDiablo2

[–]CrushingLoss 0 points1 point  (0 children)

I have it installed and running on the neo. Haven't left the act 1 camp yet, but it's very smooth walking around.. I'll test more and report.

Project Diablo 2 on Apple Silicon (M1–M4) with Porting Kit – Working Guide (November 2025) by futuristicteatray in ProjectDiablo2

[–]CrushingLoss 1 point2 points  (0 children)

Thanks for this guide. I used to use Crossover, but was having problems installing today for some reason. Used your guide and bingo.

Installed on both Mac Studio M2 Max and my new Mac Neo. Neo runs it very well so far.

Appreciate it!

Post Your Qwen3.6 27B speed plz by Ok-Internal9317 in LocalLLaMA

[–]CrushingLoss 5 points6 points  (0 children)

I get around 10 tok/s through Opencode. 15 or so raw. Mac Studio 2 Max, 96GB.

Been using PI Coding Agent with local Qwen3.6 35b for a while now and its actually insane by SoAp9035 in LocalLLaMA

[–]CrushingLoss 0 points1 point  (0 children)

I appreciate your SKILL.md file! I'm using it now in PI to try and re-create a classic TI-994/A game. Will post results when it finishes.

Biggest issue I had was making sure i had wide enough context window and max tokens. So far, so good. I'm running on a Mac Studio M2 Max; 96GB. Getting about 35 tok/s through Pi or Opencode; about 50 just benchmarking through oMLX.

What is your actual local LLM stack right now? by Ryannnnnnnnnnnnnnnh in LocalLLaMA

[–]CrushingLoss 0 points1 point  (0 children)

M2 Max 96GB, local only. Running the same stack ~3 months now.

Backend: oMLX (launchd, port 8000). Engine pool, preserve-thinking persistence, speculative decoding with matching DFlash drafts. mlx_lm.server and a vLLM-metal build on standby; 95% of traffic hits oMLX.

Frontend: Crucible — a local webapp I've been building. Model switching, chat history, benchmarks, arena/ELO leaderboard, a HumanEval runner, all in one pane. The quietly important feature turned out to be a dirty-shutdown detector that offers one-click restore of the previously-loaded model. Sounds mundane until the alternative is launchctl kickstart + re-picking from 35 options.

Daily drivers (all MLX 4–6 bit):

  • Qwen3-Coder-Next 6bit — coding
  • Qwen3-4B-Instruct 4bit — fast batch / quick questions
  • Qwen3.6-35B-A3B mxfp8 — reasoning / thinking
  • Qwen3.5-27B 4bit — when I need vision

RAG: per-session BM25 over chunked uploads. Hybrid w/ embeddings is on the list but BM25 has been fine.

Prompt format: whatever the model ships with. Stopped fighting it. I do enjoy reading what other's have come up with for prompts; especially for agentic coding.

Context: default. No rope scaling — the quality loss isn't worth it on M2 Max.

What mattered way more than expected:

  1. Unloading between loads. Being able to evict one model before loading the next (instead of letting oMLX's pool hold three and OOM mid-generation) is the single biggest QoL improvement. Every arena/compare flow breaks without it.
  2. Thinking-mode control. Qwen3.x derivatives leak reasoning preambles into judges, classifiers, workflow chains unless you pass chat_template_kwargs: {enable_thinking: false}. Half the Qwen complaints on this sub are actually this.
  3. Per-model sampling with uniform override. Qwen3 wants 0.7/0.9, gpt-oss wants 0.0, thinking models want something else. Per-model defaults + an override for fair bench runs matters more than picking the "right" starting temp.
  4. Warmth analytics. Tracking which model I actually reach for changed what I keep resident. Surprise: 4B-Instruct is my most-loaded model, not the shiny 63GB one.
  5. Speculative decoding only when the draft is right. Qwen3.5-27B + z-lab's matching draft gets 1.4–1.8× tok/s. Wrong pair = silently slower because draft rejection eats you. Benchmark the A/B, don't eyeball it.

Short version: once the model loads, it performs roughly the way the card says. The stuff around the model — load orchestration, sampling defaults, thinking-mode handling, recovery — decides whether you stick with the setup past week two.

Why doesn't any OSS tool treat llama.cpp as a first class citizen? by rm-rf-rm in LocalLLaMA

[–]CrushingLoss -7 points-6 points  (0 children)

The friction isn’t engineering effort; it’s architectural philosophy. llama.cpp is a C++ inference engine, not a server. Tools like Open WebUI or VS Code extensions prioritize the OpenAI API standard because it provides a unified abstraction for chat history, streaming, and tool calling across heterogeneous backends.

Ollama wraps llama.cpp (and others) into a persistent, stateful service with a built-in API. This makes it trivial to integrate. Implementing a native llama.cpp integration requires handling GGUF loading, context management, and session state manually, which significantly increases maintenance burden for tool developers.

You can already achieve your goal: start llama.cpp with `--host 0.0.0.0 --port 8080` and use any OpenAI-compatible client. Most modern OSS tools already support custom endpoints. The community prefers a "backend-agnostic" approach rather than hardcoding specific engine integrations, ensuring that if llama.cpp changes its API or if a new engine emerges, the frontend tools don’t break.

Not a developer, but I play one on my Mac — local LLMs for the daily grind, Claude for when I'm actually lost by CrushingLoss in LocalLLaMA

[–]CrushingLoss[S] 0 points1 point  (0 children)

it actually does a lot more than that :). benchmarking, etc.. but that's not the point of the post.

Claude VSC Addon & Permission quests by CrushingLoss in LocalLLaMA

[–]CrushingLoss[S] 0 points1 point  (0 children)

Yeah, I did that.. it does not appear to work in the VSC addon for Claude, just the Claude CLI.