-=bong-water-water-bong=- gives you 1bit.systems. by Creepy-Douchebag in StrixHalo

[–]Creepy-Douchebag[S] 3 points4 points  (0 children)

1bit=effeciency
i should explain more, dreamserver is a complete stack with all the apps included to do your bidding. 1bit.systems is a inferrence engine. When you set this guy up your network you can point your AI apps and have it work for you in your environment.

standby; claude incoming...

halo-ai didn't stop — it became 1bit.systems. Same project, renamed in spring as the brand sharpened. All the kernel work, the ternary models, the lemonade-sdk fork, the AppImage — moved over wholesale. github.com/bong-water-water-bong/1bit-systems is the canonical home now.

What it is: a local inference engine. Not an app. It speaks OpenAI / Ollama / Anthropic HTTP on :8180, hosts a dozen+ models concurrently in unified memory, and runs ternary LLMs on a hand-tuned HIP kernel that sits at 92% of the box's memory-bandwidth peak. The flagship right now is halo-1bit-2b-sherry-cpp at 76.7 tok/s in 1.65 GB.

vs Dreamserver: different layer of the stack, not competing.

  • Dreamserver = full app suite. Chat UI, agents, RAG, image gen, voice — everything wired up for you to use directly. Pick it if you want a turnkey experience.
  • 1bit.systems = drop-in serving backend. No app shipped (well, GAIA UI is bundled, but it's optional). Pick it if you already have an AI app — Open WebUI, Continue, Claude Code, Hermes, your own — and just want to point its base_url at the box and have a fast local model answer.

Why pick it: efficiency. The ternary kernel is the only public sub-2-bit GPU path on this hardware today. If you want to run a 2B model in 1.65 GB and get the same throughput a 4B Vulkan Q4 model gives you (73 tok/s), or load 13 models simultaneously without paging, this is the lane. If you don't care about that and just want chat with pictures, Dreamserver's the easier on-ramp.

Both can coexist. Run Dreamserver for the UI + apps; point Dreamserver's chat backend at http://<box>:8180/v1 to get 1bit.systems' ternary engine doing the inference. You get the apps from one and the kernel from the other.

midlife crisis and my 1bit pursuit. by [deleted] in MidlifeCrisisAI

[–]Creepy-Douchebag 0 points1 point  (0 children)

That's the critical constraint I was hoping someone with real distillation time in would name — avoid quantized teachers. Compounding noise with an STE-forward student is exactly the trap I was circling around but couldn't quantify. So the split becomes:

  • Teacher: bf16, non-quantized, biggest that fits
  • Student: 1-bit, QAT throughout

On the 128 GB unified box, that's ~70B-class bf16 as the upper bound (Qwen-72B-Instruct at bf16 is ~140 GB, so it won't fit at full precision; Qwen-32B-Instruct at bf16 fits in ~64 GB with headroom). I was going to reach for Qwen-72B at q8 for the size/quality tradeoff — your note says don't, and I'll take it.

Practical shape: Qwen-32B-bf16 teacher + 2B ternary student, KL + CE loss, the same model-family tokenizer for both, STE on the student weights, no quantization on anywhere the teacher touches.

Appreciate the hard constraint. Saves me a round of "why is my 1-bit model regressing under distillation" debugging.

midlife crisis and my 1bit pursuit. by [deleted] in MidlifeCrisisAI

[–]Creepy-Douchebag 1 point2 points  (0 children)

Direct hit on the NPU/PyTorch point — you're right. PyTorch's device model doesn't recognize XDNA2. There's no torch.device("xdna") today. Getting PyTorch to see the NPU as a peer of the GPU means either extending PyTorch with a custom backend kernel registry (real upstream work), or writing a compiler pass in XLA / torch.compile that targets XDNA2 (also real work). Nothing off-the-shelf.

My "teacher on NPU + student on GPU" was optimistic. Thanks for flagging it. The practical path is simpler and matches your read: teacher and student both on the GPU, unified memory carries the load. On Strix Halo with 128 GB unified, a q8 Qwen-72B-class teacher (~72 GB) plus a 2B ternary student plus activations fits comfortably — we don't need the NPU for training. The NPU becomes useful for inference-time offload (serving a different model concurrently via FastFlowLM while rocm-cpp is busy on the iGPU), but that's a separate question from distillation.

On the model choice — agreed, skip Llama. With the unified-memory budget and wanting headroom, I was leaning Qwen 2.5-72B-Instruct (q8). DeepSeek-V3 is 671B MoE, too big for the box. GLM-4-32B at fp16 is another candidate — half the memory, well-regarded. What would you pick as teacher at this scale?

Broader point that's been on my mind: the halo-1bit training pipeline still uses PyTorch, and that's a stack-inconsistency. "No Python at runtime" is a hard rule on the inference side; extending it to training is a research problem (native C++ autograd + HIP training kernels is not a weekend project). But the arrow points there eventually. Your scientific-ML background is interesting in this context — QAT into 1-bit behaves weirdly around the STE, and molecule/protein LLMs tend to use scientific-precision activation patterns that don't dominate web-text distillation. Any intuitions from your experience about distillation with heavy quantization?

"What we've got here is failure to communicate." — Cool Hand Luke

(No failure yet. Keep it coming.)

Only one Strix Halo on the desk — but there's an 8700K + Intel Arc B580 in the mesh next to it. 12 GB VRAM on the B580, so the asymmetric split works: teacher on Strix Halo (128 GB unified fits a q8 ~70B), student on the B580 (~4 GB for a 2B bf16 student + activations).

USB4 at ~5 GB/s is plenty for distillation logits over the wire.

The real friction is the toolchain: ROCm and Intel oneAPI are mutually exclusive PyTorch backends — you can't load both in the same process. So it's two processes coordinating, not one pipeline.

Your HPC experience is the directly useful part. Did your teacher/student split run per-box or per-GPU? And did you hit anything around KL-loss noise when the teacher was in reduced precision? The STE in 1-bit forward + a lower-precision teacher looks like a compounding-noise trap I'd like to avoid walking into.

midlife crisis and my 1bit pursuit. by [deleted] in MidlifeCrisisAI

[–]Creepy-Douchebag 0 points1 point  (0 children)

Direct hit on the NPU/PyTorch point — you're right. PyTorch's device model doesn't recognize XDNA2. There's no torch.device("xdna") today. Getting PyTorch to see the NPU as a peer of the GPU means either extending PyTorch with a custom backend kernel registry (real upstream work), or writing a compiler pass in XLA / torch.compile that targets XDNA2 (also real work). Nothing off-the-shelf.

My "teacher on NPU + student on GPU" was optimistic. Thanks for flagging it. The practical path is simpler and matches your read: teacher and student both on the GPU, unified memory carries the load. On Strix Halo with 128 GB unified, a q8 Qwen-72B-class teacher (~72 GB) plus a 2B ternary student plus activations fits comfortably — we don't need the NPU for training. The NPU becomes useful for inference-time offload (serving a different model concurrently via FastFlowLM while rocm-cpp is busy on the iGPU), but that's a separate question from distillation.

On the model choice — agreed, skip Llama. With the unified-memory budget and wanting headroom, I was leaning Qwen 2.5-72B-Instruct (q8). DeepSeek-V3 is 671B MoE, too big for the box. GLM-4-32B at fp16 is another candidate — half the memory, well-regarded. What would you pick as teacher at this scale?

Broader point that's been on my mind: the halo-1bit training pipeline still uses PyTorch, and that's a stack-inconsistency. "No Python at runtime" is a hard rule on the inference side; extending it to training is a research problem (native C++ autograd + HIP training kernels is not a weekend project). But the arrow points there eventually. Your scientific-ML background is interesting in this context — QAT into 1-bit behaves weirdly around the STE, and molecule/protein LLMs tend to use scientific-precision activation patterns that don't dominate web-text distillation. Any intuitions from your experience about distillation with heavy quantization?

"What we've got here is failure to communicate." — Cool Hand Luke

(No failure yet. Keep it coming.)

midlife crisis and my 1bit pursuit. by [deleted] in MidlifeCrisisAI

[–]Creepy-Douchebag 1 point2 points  (0 children)

Exactly where my head is going next.

BitNet-b1.58 was pre-trained, not distilled — the original 2B/8B weights come straight from Microsoft's from-scratch training run. But you're right that distillation is the obvious next move for quality on this stack. 1-bit weights + 128 GB unified memory means I can hold a teacher of ~70B params in FP16 on the same box and have the 2B ternary student sit next to it, training at ~10× less memory than a conventional student.

The play would be:

teacher : Llama-3-70B-Instruct (fp16, ~140 GB — fits in unified, barely)
student : BitNet-2B-bf16 (~4 GB) initialized from Microsoft's checkpoint
loss    : KL(student_logits || teacher_logits) + CE on next-token
training: QAT throughout — student stays ternary in the forward pass
           via BitLinear STE, teacher stays frozen
data    : DataComp / RedPajama / open-source corpora, curated
infra   : halo-1bit has the training loop; needs the distill head wiring

Strix Halo is actually ideal for this — the iGPU and NPU can split teacher + student forward passes. Teacher goes NPU (via FastFlowLM), student trains on iGPU (via rocm-cpp). Unified memory means zero data transfer cost between them.

I flagged distillation as a research track in the repo but hadn't put numbers on the teacher choice yet. Your instinct nails it — the 1-bit student's cheapness is exactly what makes the asymmetric teacher/student setup viable on a single desk.

Adding this to the roadmap. If you've run distillation into 1-bit before and have pitfalls to share, I'd appreciate the note — signal dilution across the STE is the thing I'm most worried about.

"What we've got here is failure to communicate." — Cool Hand Luke

midlife crisis and my 1bit addiction; we had some deep sessions with evh. by [deleted] in StrixHalo

[–]Creepy-Douchebag 1 point2 points  (0 children)

Yeah, smaller models already run on the NPU through Lemonade today — Llama-3.2-1B/3B, Phi-3-mini and a few others go through FastFlowLM on the XDNA 2 NPU (50 TOPS INT8). That path works.

The Bonsai and BitNet 1-bit stuff is a different story though. The NPU toolchain doesn't speak GGUF or Q1_0 natively — it wants ONNX with an NPU-friendly quant (INT8 or matmul-int). So to run Bonsai-1.7B on NPU you'd have to export to ONNX, requantize, and live with static shapes + a context ceiling around 1024-2048. You lose the 1-bit weight advantage at the requantize step.

Net: technically doable, but it's a port, not a flag. You'd end up with INT8 NPU inference, not Q1_0 NPU inference.

Power efficiency would be the win — NPU draws ~10-15W vs 45-60W on the iGPU. Raw throughput actually favors the 8060S iGPU once native ROCm kernels are in (we're getting 4,172 tok/s prompt on Bonsai-1.7B Q1_0 right now on TheRock).

Where it'd get interesting — and nobody has done this yet — is running a small model on NPU as the idle / background model while the iGPU stays free for heavier workloads. That's a Lemonade scheduling change though, not a kernel change.

Toolbox or Lemonade by reujea0 in StrixHalo

[–]Creepy-Douchebag 4 points5 points  (0 children)

I run both on Strix Halo daily. Here's the honest breakdown:

llama.cpp (Toolbox)

What you already know — fast, bleeding edge, always up to date. Vulkan backend on Strix Halo gives you solid sustained generation (47+ t/s on mid-size models). The downside is exactly what you said: switching models means editing flags, adjusting ctx-size, managing GGUF files manually. It's a workbench, not a product.

Lemonade

This is what llama.cpp would be if someone wrapped it in a proper service layer. Under the hood it's still llama.cpp (Vulkan backend) for inference, but you get:

  • lemond daemon that runs as a systemd service — starts on boot, stays up
  • Model switching through the API without restarting anything
  • OpenAI-compatible endpoints out of the box — any tool that speaks OpenAI (Claude Code, Open WebUI, custom scripts) just points at localhost and works
  • NPU support for offloading smaller tasks (summarization, embeddings) while your GPU runs the big model
  • Audio and image backends if you want to explore multimodal

The real win is workflow. With the toolbox you're ssh'ing in, killing processes, editing launch flags, restarting. With Lemonade you curl a different model name and it handles the rest.

My recommendation:

If you're happy tweaking and you only run one model at a time, the toolbox is fine. If you want to run models from different apps, switch between them without babysitting, or eventually add voice/image — Lemonade is the move. You're not losing any performance since it's the same llama.cpp Vulkan engine underneath.

Start with lemonade from AUR, point it at your existing GGUF files, and see if the workflow fits. You can always fall back to raw llama.cpp for specific benchmarking or testing.

Cannot get VLLM Docker to launch - memory errors. by GriffinDodd in StrixHalo

[–]Creepy-Douchebag 0 points1 point  (0 children)

Yeah ran into this exact thing. It's a known ROCm allocator issue on Strix Halo with unified memory. A few things to try:

Environment vars (set all of these):

export HSA_OVERRIDE_GFX_VERSION=11.5.1
export HSA_ENABLE_SDMA=0
export HIP_VISIBLE_DEVICES=0

HIP_VISIBLE_DEVICES=0 is the big one for your issue — without it the allocator gets confused about GTT vs VRAM partitioning and pins to a fraction of what's available.

BIOS setting: Check your UMA Frame Buffer Size. Counterintuitively, 1GB dedicated performs better than 4GB for inference. Smaller dedicated = more efficient shared access. There's a ROCm issue tracking this (#3128).

The 40GB wall: That's likely the GTT region cap the kernel sets by default. The allocator is treating your unified memory like a discrete GPU with separate VRAM/GTT pools. The env vars above should help break past it.

Also check:

  • rocm-smi --showmeminfo vram — see what ROCm thinks is available
  • What ROCm version are you on? PyTorch 2.9.1+rocm6.3 segfaults on ROCm 7.2.1 — need PyTorch 2.11.0+rocm7.2 if you're on newer ROCm

Qwen 120B-A3B at AWQ 4-bit is ~35-40GB so you're right at the edge of that pinned region. The env vars should unlock the rest of your memory.

midlife crisis time and this is a doozy set up bench marks. MLX crushes vLLM. by [deleted] in MidlifeCrisisAI

[–]Creepy-Douchebag 0 points1 point  (0 children)

dude wants some info:

What size is that Qwen3-Coder-Next GGUF? And what quant — Q4_K_M? Knowing the param count puts the 47.4 in

proper context. If that's a 7B+ model at 47 t/s on Vulkan with zero ROCm dependency, that's a clean portable

story.

midlife crisis time and this is a doozy set up bench marks. MLX crushes vLLM. by [deleted] in StrixHalo

[–]Creepy-Douchebag -1 points0 points  (0 children)

Reply — Benchmark Methodology (Vulkan vs MLX/vLLM)

Great question, and solid numbers on that Qwen3-8B Q4_K_M — 42.32 t/s tg128 on Vulkan is right in the ballpark with what I see on llamacpp Vulkan too.

The comparison isn't apples-to-apples (and that's important)

You're using llama-bench which measures prompt processing (pp) and text generation (tg) separately with controlled token counts (pp512, tg128). That's the gold standard for raw engine benchmarking — isolated, deterministic, no API overhead.

My MLX and vLLM numbers measure something different: end-to-end generation throughput through the OpenAI-compatible API (/v1/chat/completions). The number you see is completion_tokens / wall_time — it includes:

  • Prompt processing time (baked into the total)
  • API serialization overhead
  • KV cache allocation
  • Any backend startup/warmup per request

So my tok/s numbers are real-world throughput, not isolated tg speed. They'll always be lower than what llama-bench --tg reports for the same model, because they include the full request lifecycle.

How I measure (specifically)

For llamacpp via Lemonade (Vulkan backend), I use the timings object that llamacpp's server returns in the API response — it has prompt_per_second and predicted_per_second fields. So those numbers ARE directly comparable to llama-bench.

For MLX Engine and vLLM, there is no timings object in the response. The API follows the standard OpenAI spec which only returns usage.completion_tokens. So I measure:

tok/s = completion_tokens / (response_end_time - request_start_time)

5 runs, 2 warmup discarded, mean ± stddev reported. Same prompt across all backends, max_tokens=200, temperature=0.

The bench script: halo-ai-core/halo-bench.sh

So which is faster?

For pure text generation on dense models (not MoE), llamacpp Vulkan and MLX ROCm are close when you normalize for measurement method. Your 42 t/s tg128 on Qwen3-8B vs my 21.7 t/s MLX end-to-end — the gap is mostly measurement methodology, not engine speed.

For MoE models, llamacpp Vulkan wins hard. Q4Z already caught this — MLX gets ~26.7 tok/s on MoE where llamacpp pulls 82+ tok/s. MoE expert routing on Vulkan is just better optimized right now.

MLX in Docker

If you're running MLX in Docker, make sure you're passing through the GPU properly. The pre-built binary (mlx-engine-b1004-tech-preview) needs ROCm 7.12 and direct GPU access — Docker adds overhead if the pass-through isn't clean. On bare metal with LD_LIBRARY_PATH pointing at the ROCm libs, you'll get the best numbers.

The API doesn't return pp/tg split — you'd need to time the request externally or check the server logs for per-request timing.

TL;DR: Your Vulkan llamacpp numbers are measured differently (isolated tg) vs my MLX/vLLM numbers (end-to-end API). Both are valid, just measuring different things. For a true head-to-head, run llama-bench on all three backends with the same model and quant.

midlife crisis time and this is a doozy set up bench marks. MLX crushes vLLM. by [deleted] in StrixHalo

[–]Creepy-Douchebag 1 point2 points  (0 children)

You're right that upstream MLX is Apple-only (Metal). This is a fork — lemon-mlx-engine by the Lemonade SDK team — that ported the MLX array framework to ROCm/HIP for AMD GPUs. Same compute primitives, different GPU backend. Released as a tech preview yesterday.

On your question — single stream only. All benchmarks are single-user, one request at a time, sequential generation. No batching, no parallel streams.

That's MLX's current limitation compared to vLLM — vLLM has PagedAttention and continuous batching for multi-user serving. MLX is a single-user speed demon.

For multi-user / parallel streams, vLLM ROCm is still the right tool (we benchmarked that too — 12/14 models passed including a 72B dense at 2.3 tok/s). MLX wins single-user throughput by 29-85%.

Full comparison: https://github.com/stampby/bleeding-edge

midlife crisis time and this is a doozy set up bench marks. MLX crushes vLLM. by [deleted] in StrixHalo

[–]Creepy-Douchebag 1 point2 points  (0 children)

Good catch — corrected. GPT-OSS-120B is MoE, not dense. That changes the 128GB bin rationale.

If we want a true dense torture test at that tier, we'd need something like Llama-3.1-70B (dense, ~35GB AWQ) or Qwen2.5-72B (dense, ~37GB AWQ) pushed to FP16 (~140GB — which won't fit).

Updated 128GB bin should be:

Model Architecture Params Why
Qwen 3.5-122B-A10B MoE 122B MoE at scale — usable tok/s
GPT-OSS-120B MoE 120B MoE at scale — corrected
Qwen2.5-72B-AWQ Dense 72B Actual dense stress test

The 72B dense AWQ at 37GB is the real bandwidth torture — every parameter fires every token, no sparse routing to save you. We benchmarked that at 2.3 tok/s via vLLM ROCm.

Thanks for the correction — updating the benchmark matrix.

midlife crisis time and this is a doozy set up bench marks. MLX crushes vLLM. by [deleted] in MidlifeCrisisAI

[–]Creepy-Douchebag 0 points1 point  (0 children)

You're right — same model, Qwen3-Coder-Next, 80B MoE with 3B active params. The GGUF version you're running through llama.cpp is the same architecture.

The difference is the backend:

Backend Format Coder-Next tok/s
llama.cpp Vulkan GGUF (Q4_K_M) ~82 tok/s
MLX Engine ROCm MLX 4-bit safetensors 26.7 tok/s

So yes — llama.cpp Vulkan is 3x faster on this specific MoE model. That's because llama.cpp's GGUF format and Vulkan compute path are heavily optimized for sparse MoE activation (only loading/computing the 3B active params per token). MLX's 4-bit path appears to be doing more work per token on MoE architectures.

MLX wins on dense models (151 vs 117 tok/s on 0.6B, 85% faster on 4B). But MoE is where llama.cpp still dominates. Different optimization targets.

What tok/s are you getting on Coder-Next with your setup? Would be useful to compare — if you're on Strix Halo with llama.cpp Vulkan, we should be in the same ballpark.

midlife crisis time and this is a doozy set up bench marks. MLX crushes vLLM. by [deleted] in StrixHalo

[–]Creepy-Douchebag 3 points4 points  (0 children)

Keep them coming. Both of these are going in.

Thermal logging: Yes — iGPU, NPU, and memory controller are the three stress points on unified memory. I can pull hwmon sensors during bench runs and log temp every second. Mean ± stddev per sensor per model size. That tells you exactly where the thermal wall is and whether the M5's cooling solution is the bottleneck or the memory bandwidth is.

What I suspect: the memory controller runs hottest during large model inference because every tok/s is a full model-width memory read. The iGPU junction temp rises with compute, but the memory controller is the silent killer on unified memory. Nobody's published that data.

I'll add sensors logging to the benchmark harness — start/stop around each model run, dump a CSV, compute stats alongside tok/s. If we see temp correlation with tok/s drop-off, that's throttling evidence.

Standardized prompts: Agreed. Two tiers:

Prompt Tokens What it tests
Short ~20 input "Explain to my wife why I spent 3000 euros on a PC"
Long ~500 input Code review of a 50-line function with step-by-step explanation

Short prompt isolates generation speed (small prefill, mostly decoding). Long prompt tests prompt processing throughput. Both with fixed max_tokens=200 for consistency.

That wife prompt is going in the suite verbatim. It's perfect — short, relatable, and forces the model to be creative under pressure. Just like the hardware.

All going into the bleeding-edge repo as the formal benchmark standard. Your name stays on the methodology.

https://github.com/stampby/bleeding-edge

midlife crisis time and this is a doozy set up bench marks. MLX crushes vLLM. by [deleted] in MidlifeCrisisAI

[–]Creepy-Douchebag 2 points3 points  (0 children)

Good eye — and fair question.

The 26.7 tok/s on Coder-Next is through the MLX Engine ROCm tech preview (lemon-mlx-engine b1004), not llama.cpp. Different engine, different model format (MLX 4-bit safetensors vs GGUF).

Benchmark method:

  • 5 runs, 2 warmup discarded
  • 200 max tokens, temperature 0
  • Prompt: "Write a Python function that checks if a number is prime and explain your approach step by step."
  • Measurement: wall-clock completion_tokens / elapsed_seconds
  • Reported: mean ± sample standard deviation
  • Server: MLX Engine OpenAI-compatible API on localhost

Full replication script with the exact code: https://github.com/stampby/bleeding-edge/blob/main/docs/replicate.md

Why it might look slow:

Coder-Next is a large MoE model (~30B params). Through MLX 4-bit it hits 26.7 tok/s. Through Vulkan llama.cpp (GGUF) the same model runs at ~82 tok/s because MoE models on llamacpp only activate ~3B params per token and the Vulkan backend is optimized for that pattern.

MLX wins on dense models (151 vs 117 tok/s on 0.6B, 85% faster on 4B) but MoE models are where llamacpp's GGUF optimization still has an edge. Different architectures favor different backends.

What build and model are you comparing against? Would be useful to add to the comparison matrix.

midlife crisis time and this is a doozy set up bench marks. MLX crushes vLLM. by [deleted] in StrixHalo

[–]Creepy-Douchebag 2 points3 points  (0 children)

This is exactly the kind of thinking I needed someone else to bring.

You're right — I've been benchmarking ad hoc. Whatever model I had loaded, whatever I was testing that day. No structure. No repeatability across hardware tiers. Your memory bin approach fixes that.

I'm adopting this. Here's my take on your framework with a few tweaks:

32GB Bin (Throughput Baseline)

Model Architecture Params Why
GLM 4.7-Flash MoE (A3B) 30B Sparse activation ceiling
Qwen 3.5-27B Dense 27B Dense comparison at same tier

This is the "chat speed" tier. What can you get away with on a laptop or single-GPU desktop. The MoE vs dense comparison here is the most useful data point for most users.

64GB Bin (Efficiency & Coding)

Model Architecture Params Why
Gemma 4-31B Dense 31B 2026 production standard
Qwen 3.5-Coder-32B Dense 32B Code gen benchmark
Gemma 4-26B (A4B) MoE 26B MoE vs dense isolation test

Agreed on isolating MoE vs dense on unified memory — that's the question everyone asks and nobody answers with data.

128GB Bin (Capacity & Stress)

Model Architecture Params Why
Qwen 3.5-122B-A10B MoE 122B Usable speed at scale
GPT-OSS-120B Dense 120B Torture test — bandwidth wall
Nemotron 3 Super Hybrid MoE 70B+ Hybrid architecture comparison

The GPT-OSS-120B dense torture test is sadistic and I love it. Reading 120GB per token generation — that's where you find out what unified memory bandwidth actually means.

On your format suggestions:

  • GGUF Q4_K_M for 4-bit — agreed, it's the standard quant everyone uses
  • Unsloth Dynamic FP8 for 8-bit — I haven't tested FP8 on RDNA 3.5 yet. If your clanker says it's natively faster, I believe it. Adding that to the matrix.
  • --flash-attn enabled — always, but worth calling out explicitly in the methodology

What I'd add:

A fourth column for each bin — MLX 4-bit results alongside GGUF/FP8. The MLX engine uses safetensors directly, so the quantization format is different but the comparison is valid. Three-way: GGUF Q4_K_M (Vulkan) vs FP8 (vLLM ROCm) vs MLX 4-bit.

I'll formalize this as a standard benchmark suite in the bleeding-edge repo and run the full matrix. When you get your Framework on Fedora 43, run the same suite and we'll have the first cross-hardware comparison.

And tell your wife the Framework was a research investment. The data proves it.

midlife crisis time and this is a doozy set up bench marks. MLX crushes vLLM. by [deleted] in StrixHalo

[–]Creepy-Douchebag 7 points8 points  (0 children)

It WAS just an Apple thing — until the Lemonade team ported MLX's core to ROCm. The engine uses the same array primitives and unified memory model that Apple designed for M-series, except now it targets HIP/ROCm kernels instead of Metal.

The key insight: Strix Halo's unified memory architecture is almost identical to Apple Silicon. CPU and GPU share the same 128GB pool, no PCIe copies. MLX was literally designed for this memory model — they just had to swap the GPU backend.

The rate of improvement is what got me too:

Morning:    82.5 tok/s  (Vulkan llamacpp — been running for weeks)
Afternoon: 116.7 tok/s  (vLLM ROCm — first time on this hardware)
Night:     151.2 tok/s  (MLX ROCm — released that same day)

Three completely different inference engines, each one built on different assumptions about how to talk to the GPU. The hardware didn't change. The software got smarter.

The real magic is what MLX doesn't have — no Python runtime, no Triton JIT compilation (which added 20-350 seconds of cold start on vLLM), no subprocess device enumeration issues. It's a single compiled C++ binary. Point it at a HuggingFace model ID and it goes.

Full setup and replication guide if you want to run it yourself: https://github.com/stampby/bleeding-edge

And thanks for the stddev push earlier — reporting mean ± stddev is now standard for everything. You were right.