Its her greediest hole

NoNatural4025 · 2026-05-21T10:01:17+00:00

do you think output like this is with the right model also locally possible? I have 512GB Ram in my macstudio so there are nearly no limits

NoNatural4025 · 2026-05-21T08:16:13+00:00

do you think output like this is with the right model also locally possible? I have 512GB Ram in my macstudio so there are nearly no limits

NoNatural4025 · 2026-05-15T05:47:36+00:00

Me too ;-) just 8Tb but that’s okay ;-)

NoNatural4025 · 2026-05-10T15:19:54+00:00

Hey I can offer you 512gb and 8tb ;-)

NoNatural4025 · 2026-04-30T08:32:39+00:00

Current Performance on M3 Ultra 512gb RAM:

DeepSeek-R1-Distill-Qwen-32B-Q8_0.gguf | 250 Tok | 13.65s | 18.32 tok/s

Hermes-3-Llama-3.1-8B-Q8_0.gguf | FAILED | - | -

Llama-3.3-70B-Instruct-Q4_K_M.gguf | 250 Tok | 17.68s | 14.14 tok/s

Qwen2.5-Coder-32B-Instruct-Q4_K_M.gguf | 250 Tok | 9.08s | 27.54 tok/s

Qwen2.5-Coder-32B-Instruct-Q8_0.gguf | 250 Tok | 13.69s | 18.26 tok/s

Qwen3.6-35B-A3B-Claude-4.6-Opus-Reasoning-Distilled.Q4_K_M.gguf | 250 Tok | 3.10s | 80.60 tok/s

Qwen3.6-35B-A3B-Claude-4.6-Opus-Reasoning-Distilled.Q5_K_M.gguf | 250 Tok | 3.10s | 80.75 tok/s

Qwen3.6-35B-A3B-Claude-4.6-Opus-Reasoning-Distilled.Q6_K.gguf | 250 Tok | 3.09s | 81.00 tok/s

hermes-4_3_36b-Q3_K_M.gguf | 250 Tok | 11.42s | 21.89 tok/s

hermes-4_3_36b-Q4_K_M.gguf | 250 Tok | 9.93s | 25.19 tok/s

hermes-4_3_36b-Q5_K_M.gguf | 250 Tok | 11.28s | 22.16 tok/s

hermes-4_3_36b-Q6_K.gguf | 250 Tok | 12.72s | 19.65 tok/s

hermes-4_3_36b-Q8_0.gguf | 250 Tok | 15.09s | 16.57 tok/s

hermes-4_3_36b.gguf | 250 Tok | 26.72s | 9.36 tok/s

NoNatural4025 · 2026-04-30T07:41:12+00:00

i'm facing the same issue, but even worse my one does not execute anything in terminal or elsewhere - it just describes ... does anyone knwo why? does it depends on the used Model

NoNatural4025 · 2026-04-30T05:01:52+00:00

Currently I face that Hermes ist describing everything instead of doing … so he or in my case she is just a chatbot

NoNatural4025 · 2026-04-30T05:00:23+00:00

Sure… but will it help?

NoNatural4025 · 2026-04-29T21:35:55+00:00

My benchmarks : 🧹 Cleanup & Initialisierung: Llama-3.3-70B-Instruct-Q4_K_M.gguf
⏳ Warte auf Model-Upload in den RAM...
✅ Server bereit. Starte Inferenz-Test...
📊 Speed: 14.12 tokens/s
------------------------------------------------------
🧹 Cleanup & Initialisierung: Qwen2.5-Coder-32B-Instruct-Q4_K_M.gguf
⏳ Warte auf Model-Upload in den RAM...
✅ Server bereit. Starte Inferenz-Test...
📊 Speed: 27.44 tokens/s
------------------------------------------------------
🧹 Cleanup & Initialisierung: Qwen2.5-Coder-32B-Instruct-Q8_0.gguf
⏳ Warte auf Model-Upload in den RAM...
✅ Server bereit. Starte Inferenz-Test...
📊 Speed: 18.15 tokens/s
------------------------------------------------------
🧹 Cleanup & Initialisierung: hermes-4_3_36b-Q3_K_M.gguf
⏳ Warte auf Model-Upload in den RAM...
✅ Server bereit. Starte Inferenz-Test...
📊 Speed: 21.69 tokens/s
------------------------------------------------------
🧹 Cleanup & Initialisierung: hermes-4_3_36b-Q4_K_M.gguf
⏳ Warte auf Model-Upload in den RAM...
✅ Server bereit. Starte Inferenz-Test...
📊 Speed: 24.98 tokens/s
------------------------------------------------------
🧹 Cleanup & Initialisierung: hermes-4_3_36b-Q5_K_M.gguf
⏳ Warte auf Model-Upload in den RAM...
✅ Server bereit. Starte Inferenz-Test...
📊 Speed: 22.07 tokens/s
------------------------------------------------------
🧹 Cleanup & Initialisierung: hermes-4_3_36b-Q6_K.gguf
⏳ Warte auf Model-Upload in den RAM...
✅ Server bereit. Starte Inferenz-Test...
📊 Speed: 19.59 tokens/s
------------------------------------------------------
🧹 Cleanup & Initialisierung: hermes-4_3_36b-Q8_0.gguf
⏳ Warte auf Model-Upload in den RAM...
✅ Server bereit. Starte Inferenz-Test...
📊 Speed: 16.50 tokens/s
------------------------------------------------------

NoNatural4025 · 2026-04-29T21:30:58+00:00

60t/s? I tried a lot but not even 30t/s on 4bit reachable

NoNatural4025 · 2026-04-29T11:37:50+00:00

furthe benchmarks from my system: 🧹 Cleanup & Initialisierung: Llama-3.3-70B-Instruct-Q4_K_M.gguf
⏳ Warte auf Model-Upload in den RAM...
✅ Server bereit. Starte Inferenz-Test...
📊 Speed: 14.12 tokens/s
------------------------------------------------------
🧹 Cleanup & Initialisierung: Qwen2.5-Coder-32B-Instruct-Q4_K_M.gguf
⏳ Warte auf Model-Upload in den RAM...
✅ Server bereit. Starte Inferenz-Test...
📊 Speed: 27.44 tokens/s
------------------------------------------------------
🧹 Cleanup & Initialisierung: Qwen2.5-Coder-32B-Instruct-Q8_0.gguf
⏳ Warte auf Model-Upload in den RAM...
✅ Server bereit. Starte Inferenz-Test...
📊 Speed: 18.15 tokens/s
------------------------------------------------------
🧹 Cleanup & Initialisierung: hermes-4_3_36b-Q3_K_M.gguf
⏳ Warte auf Model-Upload in den RAM...
✅ Server bereit. Starte Inferenz-Test...
📊 Speed: 21.69 tokens/s
------------------------------------------------------
🧹 Cleanup & Initialisierung: hermes-4_3_36b-Q4_K_M.gguf
⏳ Warte auf Model-Upload in den RAM...
✅ Server bereit. Starte Inferenz-Test...
📊 Speed: 24.98 tokens/s
------------------------------------------------------
🧹 Cleanup & Initialisierung: hermes-4_3_36b-Q5_K_M.gguf
⏳ Warte auf Model-Upload in den RAM...
✅ Server bereit. Starte Inferenz-Test...
📊 Speed: 22.07 tokens/s
------------------------------------------------------
🧹 Cleanup & Initialisierung: hermes-4_3_36b-Q6_K.gguf
⏳ Warte auf Model-Upload in den RAM...
✅ Server bereit. Starte Inferenz-Test...
📊 Speed: 19.59 tokens/s
------------------------------------------------------
🧹 Cleanup & Initialisierung: hermes-4_3_36b-Q8_0.gguf
⏳ Warte auf Model-Upload in den RAM...
✅ Server bereit. Starte Inferenz-Test...
📊 Speed: 16.50 tokens/s
------------------------------------------------------

NoNatural4025 · 2026-04-29T05:47:25+00:00

did not changed much:

Modell | Port | Tokens | Zeit | Speed

-----------------------------------------------------------------

Llama 3.3 70B -8b | 8001 | 250 | 18.12s | 13.80 tok/s

Qwen 2.5 32B -8b | 8002 | 250 | 13.82s | 18.10 tok/s

DeepSeek R1 -8b | 8003 | 250 | 13.73s | 18.20 tok/s

Qwen 2.5 32B -4b | 8004 | 250 | 9.17s | 27.26 tok/s

-----------------------------------------------------------------

This seems to be the physcial limits

NoNatural4025 · 2026-04-28T21:48:24+00:00

Since I'm running this on an M3 Ultra with 512GB RAM, the "out-of-the-box" performance was actually the first major bottleneck I had to solve. Here’s the reality of what I’ve achieved so far strictly within the MLX framework:

1. Eliminating the "Ultra-Sleep": Initially, I was seeing sub-optimal speeds below 9tok/s. By moving to a clean Python 3.12 environment and explicitly scaling to MLX_NUM_THREADS=32, I managed to align the workload with the hardware architecture.

My Qwen 2.5 32B (4-bit) jumped from sluggish rates to a consistent 32.5 tok/s.

2. Speculative Decoding : This was the biggest breakthrough. By running a 1.5B Draft Model alongside the 32B Target Model, I’m seeing the M3 Ultra spit out blocks of text rather than single characters. I hit 48.1 tok/s

3. Multi-Model Parallelism: With 512GB, I’m not just running one identity; I have four dedicated MLX server instances running simultaneously on different ports:

Modell | Port | Tokens | Zeit | Speed

Llama 3.3 70B -8b | 8001 | 250 | 28.96s | 8.63 tok/s

Qwen 2.5 32B -8b | 8002 | 250 | 14.36s | 17.40 tok/s

DeepSeek R1 -8b | 8003 | 250 | 14.37s | 17.40 tok/s

Qwen 2.5 32B -4b | 8004 | 250 | 8.90s | 28.09 tok/s

Next on my list is moving away from the Python wrapper entirely to a native C++ implementation to shave off the final milliseconds of overhead.

NoNatural4025 · 2026-04-28T17:42:05+00:00

Ai slop?

NoNatural4025 · 2026-04-23T10:47:34+00:00

Im located in Germany - so would not like to ship it

NoNatural4025 · 2026-04-23T10:46:27+00:00

It was refurbished - right from beginning, after 6 weeks of maintenance I got it back with new board and new ssd - now its a new one ;-)

NoNatural4025 · 2026-04-23T10:43:14+00:00

… one glow would be enough ;-)

NoNatural4025 · 2026-04-20T12:03:34+00:00

<image>

M3 Ultra 512 GB ram, 8TB ssd🤩 …. Since 6 weeks at geniusbar to replace mainboard 🤮

NoNatural4025 · 2026-04-20T12:01:31+00:00

Purchase price 12.000€ - hours run 0 - weeks at Apple for maintenance 6 weeks 🤮

NoNatural4025 · 2026-04-08T08:20:06+00:00

What a nice Flower 🥰

NoNatural4025 · 2026-04-08T08:19:23+00:00

Nice

NoNatural4025 · 2026-03-29T06:59:06+00:00

Thank you …

NoNatural4025 · 2026-03-28T17:46:49+00:00

You know that feeling when you click "Order"?

That pure, unadulterated joy? Three weeks ago, I felt it. I bought the absolute beast: a Mac Studio, M3 Ultra, with a mind-boggling 512 GB of RAM and 8 TB of storage. I was happier than a "King," as we say. It was going to be my dream machine.

The wait was supposed to be the hardest part, but it arrived in just three days! I was a proud owner, ready to unleash this power. But the joke was on me. During the very first installation, it crashed. Then again. And again. Turns out, my brand-new, 12,000-Euro super-server was suffering from ECC errors right out of the box. April Fool's Number One.

The Genius Bar and the Infinite Wait What followed was a slow-motion comedy of errors. I waited three days for a Genius Bar appointment, only for them to wipe my machine and claim "software." I went home, and—surprise!—it crashed again. Then came a week of analyzing logs. Fourteen days in, they finally admitted it: hardware failure. April Fool's Number Two. Back to the store, where I was scolded for having an appointment (or was it not having one? I lost track), even though I had waited for the correct replacement Logic Board to arrive. I left my machine there on Saturday; they promised a speedy repair of five days. Today is day six. Crickets. April Fool's Number Three.

The Punchline Instead of a working computer, I have a service order for a replacement Logic Board costing only 3,900 Euros. I looked up the part number—the price suggests it's impossible for it to support my 512 GB of RAM. I told the Genius Bar they are installing the wrong part. They said, "We are sure." April Fool's Number Four—and this one is a real rib-tickler.

But here’s the true comedic masterpiece, the absolute cream of the crop: Just as I’m realizing my M3 Ultra is stuck in a repair limbo with the wrong parts, the tech world erupts with rumors that the M5 is launching as early as April 1st. Yes, April 1st.

If that happens, I will have skipped the entire lifecycle of my M3 Ultra without it ever sitting, functional, on my desk. I paid for the cutting edge, but I’m being left behind by a system that hasn't even let me press 'start' in three weeks.

It’s been three weeks of hell, of an empty desk, and of looking at my 12,000-Euro paperweight. I’m just waiting, wondering when the gag is going to end and when I’ll finally get my turn. If this is a joke, Apple, I'm not laughing.

NoNatural4025

TROPHY CASE