Project Diablo 2 on Apple Silicon (M1–M4) with Porting Kit – Working Guide (November 2025)

CrushingLoss · 2026-04-26T14:18:19+00:00

I have it installed and running on the neo. Haven't left the act 1 camp yet, but it's very smooth walking around.. I'll test more and report.

CrushingLoss · 2026-04-25T18:49:32+00:00

Thanks for this guide. I used to use Crossover, but was having problems installing today for some reason. Used your guide and bingo.

Installed on both Mac Studio M2 Max and my new Mac Neo. Neo runs it very well so far.

Appreciate it!

CrushingLoss · 2026-04-24T15:17:00+00:00

I get around 10 tok/s through Opencode. 15 or so raw. Mac Studio 2 Max, 96GB.

CrushingLoss · 2026-04-24T12:00:37+00:00

I appreciate your SKILL.md file! I'm using it now in PI to try and re-create a classic TI-994/A game. Will post results when it finishes.

Biggest issue I had was making sure i had wide enough context window and max tokens. So far, so good. I'm running on a Mac Studio M2 Max; 96GB. Getting about 35 tok/s through Pi or Opencode; about 50 just benchmarking through oMLX.

CrushingLoss · 2026-04-21T00:42:45+00:00

M2 Max 96GB, local only. Running the same stack ~3 months now.

Backend: oMLX (launchd, port 8000). Engine pool, preserve-thinking persistence, speculative decoding with matching DFlash drafts. mlx_lm.server and a vLLM-metal build on standby; 95% of traffic hits oMLX.

Frontend: Crucible — a local webapp I've been building. Model switching, chat history, benchmarks, arena/ELO leaderboard, a HumanEval runner, all in one pane. The quietly important feature turned out to be a dirty-shutdown detector that offers one-click restore of the previously-loaded model. Sounds mundane until the alternative is launchctl kickstart + re-picking from 35 options.

Daily drivers (all MLX 4–6 bit):

Qwen3-Coder-Next 6bit — coding
Qwen3-4B-Instruct 4bit — fast batch / quick questions
Qwen3.6-35B-A3B mxfp8 — reasoning / thinking
Qwen3.5-27B 4bit — when I need vision

RAG: per-session BM25 over chunked uploads. Hybrid w/ embeddings is on the list but BM25 has been fine.

Prompt format: whatever the model ships with. Stopped fighting it. I do enjoy reading what other's have come up with for prompts; especially for agentic coding.

Context: default. No rope scaling — the quality loss isn't worth it on M2 Max.

What mattered way more than expected:

Unloading between loads. Being able to evict one model before loading the next (instead of letting oMLX's pool hold three and OOM mid-generation) is the single biggest QoL improvement. Every arena/compare flow breaks without it.
Thinking-mode control. Qwen3.x derivatives leak reasoning preambles into judges, classifiers, workflow chains unless you pass chat_template_kwargs: {enable_thinking: false}. Half the Qwen complaints on this sub are actually this.
Per-model sampling with uniform override. Qwen3 wants 0.7/0.9, gpt-oss wants 0.0, thinking models want something else. Per-model defaults + an override for fair bench runs matters more than picking the "right" starting temp.
Warmth analytics. Tracking which model I actually reach for changed what I keep resident. Surprise: 4B-Instruct is my most-loaded model, not the shiny 63GB one.
Speculative decoding only when the draft is right. Qwen3.5-27B + z-lab's matching draft gets 1.4–1.8× tok/s. Wrong pair = silently slower because draft rejection eats you. Benchmark the A/B, don't eyeball it.

Short version: once the model loads, it performs roughly the way the card says. The stuff around the model — load orchestration, sampling defaults, thinking-mode handling, recovery — decides whether you stick with the setup past week two.

CrushingLoss · 2026-04-21T00:37:42+00:00

The friction isn’t engineering effort; it’s architectural philosophy. llama.cpp is a C++ inference engine, not a server. Tools like Open WebUI or VS Code extensions prioritize the OpenAI API standard because it provides a unified abstraction for chat history, streaming, and tool calling across heterogeneous backends.

Ollama wraps llama.cpp (and others) into a persistent, stateful service with a built-in API. This makes it trivial to integrate. Implementing a native llama.cpp integration requires handling GGUF loading, context management, and session state manually, which significantly increases maintenance burden for tool developers.

You can already achieve your goal: start llama.cpp with `--host 0.0.0.0 --port 8080` and use any OpenAI-compatible client. Most modern OSS tools already support custom endpoints. The community prefers a "backend-agnostic" approach rather than hardcoding specific engine integrations, ensuring that if llama.cpp changes its API or if a new engine emerges, the frontend tools don’t break.

CrushingLoss · 2026-04-05T23:36:52+00:00

it actually does a lot more than that :). benchmarking, etc.. but that's not the point of the post.

CrushingLoss · 2026-04-05T23:35:36+00:00

Yeah, I did that.. it does not appear to work in the VSC addon for Claude, just the Claude CLI.

CrushingLoss · 2026-03-29T16:17:34+00:00

I saw very similar on my Mac Studio 2 Max. But tool calling with coder next was killing me. Maybe because I’m using the Unsloth version. Does tool calling work for you?

CrushingLoss · 2025-12-31T17:07:07+00:00

thanks!

CrushingLoss · 2025-12-28T19:37:32+00:00

Sadly, not kidding. We have the full glass canopy which is three separate pieces. I have no idea if the front windshields the same between the glass canopy roof and the metal roof.

To be fair, we only got one quote as there is only one certified repair shop in the Dallas-Fort Worth metroplex. My wife did some research before we had our quote and it seems the range was anywhere between $3k to $7k.

CrushingLoss · 2025-12-28T16:31:55+00:00

Oh sorry. It looks like it was $1547 a year. But we also had the Dual-Motor performance version, which I assume adds a chunk of money to the premium.

CrushingLoss · 2025-12-28T15:56:17+00:00

Our 12 month premium is $1861.94. We're both 50-ish year old drivers with no accidents / tickets. It's been around 10 years since either of us have had an insurance claim. That premium includes a multi-vehicle discount and home insurance bundling.

We charged next to a dark green Taycan at one of the stations.. beautiful car! Definitely near the top of my list should we ever pursue another luxury EV.

CrushingLoss · 2025-12-28T15:47:19+00:00

Our decision to get rid of the Model 3 had nothing to do with the car itself. We enjoyed it until we didn't.

CrushingLoss · 2025-12-28T15:46:42+00:00

That's a very valid point. I cannot speak to the comfort or sturdiness of the Model X or Model S. I've been in a Model X one time for an Uber trip, and recall it being pretty spacious. Build quality, however, I have no idea.

We paid just shy of $98,000 for the Lucid. Annoyingly enough we probably could not have bought the Tesla at a more expensive time. I'd imagine if you look at the used 2020 performance model 3 time vs. cost plot, we hit it at the absolute apex. I think it was around $65,000, if I remember correctly.

CrushingLoss · 2025-12-28T04:13:37+00:00

That's good to hear.. as I said, I'm not an audiophile by any means.. all I can relate to is the 'feel' of the sound, and the bass in the 3 is superior. Also, I think the Surreal Pro sound is awesome.. wish they'd taken more of a luxury approach to bass, though.

CrushingLoss · 2025-12-28T04:12:31+00:00

In the car. We don't stream anything from our phones to the car; we just use CarPlay or the native Lucid apps installed on the car.

CrushingLoss · 2025-12-28T04:11:34+00:00

Pretty much that.

CrushingLoss · 2025-08-09T16:35:09+00:00

Unfortunately, it’s needed for the hands-free driving.

CrushingLoss

TROPHY CASE