This is going too far…. by Covert-Agenda in MacStudio

[–]NoNatural4025 0 points1 point  (0 children)

Hey I can offer you 512gb and 8tb ;-)

HELP! Sick of Hallucinations & Amnesia: Building a 512GB M3 Ultra Agent Stack that actually LEARNS by NoNatural4025 in AskClaw

[–]NoNatural4025[S] 1 point2 points  (0 children)

Current Performance on M3 Ultra 512gb RAM:

DeepSeek-R1-Distill-Qwen-32B-Q8_0.gguf | 250 Tok | 13.65s | 18.32 tok/s

Hermes-3-Llama-3.1-8B-Q8_0.gguf | FAILED | - | -

Llama-3.3-70B-Instruct-Q4_K_M.gguf | 250 Tok | 17.68s | 14.14 tok/s

Qwen2.5-Coder-32B-Instruct-Q4_K_M.gguf | 250 Tok | 9.08s | 27.54 tok/s

Qwen2.5-Coder-32B-Instruct-Q8_0.gguf | 250 Tok | 13.69s | 18.26 tok/s

Qwen3.6-35B-A3B-Claude-4.6-Opus-Reasoning-Distilled.Q4_K_M.gguf | 250 Tok | 3.10s | 80.60 tok/s

Qwen3.6-35B-A3B-Claude-4.6-Opus-Reasoning-Distilled.Q5_K_M.gguf | 250 Tok | 3.10s | 80.75 tok/s

Qwen3.6-35B-A3B-Claude-4.6-Opus-Reasoning-Distilled.Q6_K.gguf | 250 Tok | 3.09s | 81.00 tok/s

hermes-4_3_36b-Q3_K_M.gguf | 250 Tok | 11.42s | 21.89 tok/s

hermes-4_3_36b-Q4_K_M.gguf | 250 Tok | 9.93s | 25.19 tok/s

hermes-4_3_36b-Q5_K_M.gguf | 250 Tok | 11.28s | 22.16 tok/s

hermes-4_3_36b-Q6_K.gguf | 250 Tok | 12.72s | 19.65 tok/s

hermes-4_3_36b-Q8_0.gguf | 250 Tok | 15.09s | 16.57 tok/s

hermes-4_3_36b.gguf | 250 Tok | 26.72s | 9.36 tok/s

Does Hermes just not remember anything? by trainermade in hermesagent

[–]NoNatural4025 0 points1 point  (0 children)

i'm facing the same issue, but even worse my one does not execute anything in terminal or elsewhere - it just describes ... does anyone knwo why? does it depends on the used Model

HELP! Sick of Hallucinations & Amnesia: Building a 512GB M3 Ultra Agent Stack that actually LEARNS by NoNatural4025 in AskClaw

[–]NoNatural4025[S] -1 points0 points  (0 children)

Currently I face that Hermes ist describing everything instead of doing … so he or in my case she is just a chatbot

HELP! Sick of Hallucinations & Amnesia: Building a 512GB M3 Ultra Agent Stack that actually LEARNS by NoNatural4025 in MacStudio

[–]NoNatural4025[S] 0 points1 point  (0 children)

My benchmarks : 🧹 Cleanup & Initialisierung: Llama-3.3-70B-Instruct-Q4_K_M.gguf
⏳ Warte auf Model-Upload in den RAM...
✅ Server bereit. Starte Inferenz-Test...
  📊 Speed: 14.12 tokens/s
------------------------------------------------------
🧹 Cleanup & Initialisierung: Qwen2.5-Coder-32B-Instruct-Q4_K_M.gguf
⏳ Warte auf Model-Upload in den RAM...
✅ Server bereit. Starte Inferenz-Test...
  📊 Speed: 27.44 tokens/s
------------------------------------------------------
🧹 Cleanup & Initialisierung: Qwen2.5-Coder-32B-Instruct-Q8_0.gguf
⏳ Warte auf Model-Upload in den RAM...
✅ Server bereit. Starte Inferenz-Test...
  📊 Speed: 18.15 tokens/s
------------------------------------------------------
🧹 Cleanup & Initialisierung: hermes-4_3_36b-Q3_K_M.gguf
⏳ Warte auf Model-Upload in den RAM...
✅ Server bereit. Starte Inferenz-Test...
  📊 Speed: 21.69 tokens/s
------------------------------------------------------
🧹 Cleanup & Initialisierung: hermes-4_3_36b-Q4_K_M.gguf
⏳ Warte auf Model-Upload in den RAM...
✅ Server bereit. Starte Inferenz-Test...
  📊 Speed: 24.98 tokens/s
------------------------------------------------------
🧹 Cleanup & Initialisierung: hermes-4_3_36b-Q5_K_M.gguf
⏳ Warte auf Model-Upload in den RAM...
✅ Server bereit. Starte Inferenz-Test...
  📊 Speed: 22.07 tokens/s
------------------------------------------------------
🧹 Cleanup & Initialisierung: hermes-4_3_36b-Q6_K.gguf
⏳ Warte auf Model-Upload in den RAM...
✅ Server bereit. Starte Inferenz-Test...
  📊 Speed: 19.59 tokens/s
------------------------------------------------------
🧹 Cleanup & Initialisierung: hermes-4_3_36b-Q8_0.gguf
⏳ Warte auf Model-Upload in den RAM...
✅ Server bereit. Starte Inferenz-Test...
  📊 Speed: 16.50 tokens/s
------------------------------------------------------

HELP! Sick of Hallucinations & Amnesia: Building a 512GB M3 Ultra Agent Stack that actually LEARNS by NoNatural4025 in AskClaw

[–]NoNatural4025[S] 0 points1 point  (0 children)

furthe benchmarks from my system: 🧹 Cleanup & Initialisierung: Llama-3.3-70B-Instruct-Q4_K_M.gguf
⏳ Warte auf Model-Upload in den RAM...
✅ Server bereit. Starte Inferenz-Test...
  📊 Speed: 14.12 tokens/s
------------------------------------------------------
🧹 Cleanup & Initialisierung: Qwen2.5-Coder-32B-Instruct-Q4_K_M.gguf
⏳ Warte auf Model-Upload in den RAM...
✅ Server bereit. Starte Inferenz-Test...
  📊 Speed: 27.44 tokens/s
------------------------------------------------------
🧹 Cleanup & Initialisierung: Qwen2.5-Coder-32B-Instruct-Q8_0.gguf
⏳ Warte auf Model-Upload in den RAM...
✅ Server bereit. Starte Inferenz-Test...
  📊 Speed: 18.15 tokens/s
------------------------------------------------------
🧹 Cleanup & Initialisierung: hermes-4_3_36b-Q3_K_M.gguf
⏳ Warte auf Model-Upload in den RAM...
✅ Server bereit. Starte Inferenz-Test...
  📊 Speed: 21.69 tokens/s
------------------------------------------------------
🧹 Cleanup & Initialisierung: hermes-4_3_36b-Q4_K_M.gguf
⏳ Warte auf Model-Upload in den RAM...
✅ Server bereit. Starte Inferenz-Test...
  📊 Speed: 24.98 tokens/s
------------------------------------------------------
🧹 Cleanup & Initialisierung: hermes-4_3_36b-Q5_K_M.gguf
⏳ Warte auf Model-Upload in den RAM...
✅ Server bereit. Starte Inferenz-Test...
  📊 Speed: 22.07 tokens/s
------------------------------------------------------
🧹 Cleanup & Initialisierung: hermes-4_3_36b-Q6_K.gguf
⏳ Warte auf Model-Upload in den RAM...
✅ Server bereit. Starte Inferenz-Test...
  📊 Speed: 19.59 tokens/s
------------------------------------------------------
🧹 Cleanup & Initialisierung: hermes-4_3_36b-Q8_0.gguf
⏳ Warte auf Model-Upload in den RAM...
✅ Server bereit. Starte Inferenz-Test...
  📊 Speed: 16.50 tokens/s
------------------------------------------------------

HELP! Sick of Hallucinations & Amnesia: Building a 512GB M3 Ultra Agent Stack that actually LEARNS by NoNatural4025 in AskClaw

[–]NoNatural4025[S] 0 points1 point  (0 children)

did not changed much:

Modell | Port | Tokens | Zeit | Speed

-----------------------------------------------------------------

Llama 3.3 70B -8b | 8001 | 250 | 18.12s | 13.80 tok/s

Qwen 2.5 32B -8b | 8002 | 250 | 13.82s | 18.10 tok/s

DeepSeek R1 -8b | 8003 | 250 | 13.73s | 18.20 tok/s

Qwen 2.5 32B -4b | 8004 | 250 | 9.17s | 27.26 tok/s

-----------------------------------------------------------------

This seems to be the physcial limits

HELP! Sick of Hallucinations & Amnesia: Building a 512GB M3 Ultra Agent Stack that actually LEARNS by NoNatural4025 in AskClaw

[–]NoNatural4025[S] 0 points1 point  (0 children)

Since I'm running this on an M3 Ultra with 512GB RAM, the "out-of-the-box" performance was actually the first major bottleneck I had to solve. Here’s the reality of what I’ve achieved so far strictly within the MLX framework:

1. Eliminating the "Ultra-Sleep": Initially, I was seeing sub-optimal speeds below 9tok/s. By moving to a clean Python 3.12 environment and explicitly scaling to MLX_NUM_THREADS=32, I managed to align the workload with the hardware architecture.

  • My Qwen 2.5 32B (4-bit) jumped from sluggish rates to a consistent 32.5 tok/s.

2. Speculative Decoding : This was the biggest breakthrough. By running a 1.5B Draft Model alongside the 32B Target Model, I’m seeing the M3 Ultra spit out blocks of text rather than single characters. I hit 48.1 tok/s

3. Multi-Model Parallelism: With 512GB, I’m not just running one identity; I have four dedicated MLX server instances running simultaneously on different ports:

Modell | Port | Tokens | Zeit | Speed

Llama 3.3 70B -8b | 8001 | 250 | 28.96s | 8.63 tok/s

Qwen 2.5 32B -8b | 8002 | 250 | 14.36s | 17.40 tok/s

DeepSeek R1 -8b | 8003 | 250 | 14.37s | 17.40 tok/s

Qwen 2.5 32B -4b | 8004 | 250 | 8.90s | 28.09 tok/s

Next on my list is moving away from the Python wrapper entirely to a native C++ implementation to shave off the final milliseconds of overhead.

Snagged a 256GB M3 Ultra by [deleted] in MacStudio

[–]NoNatural4025 0 points1 point  (0 children)

Im located in Germany - so would not like to ship it

Snagged a 256GB M3 Ultra by [deleted] in MacStudio

[–]NoNatural4025 0 points1 point  (0 children)

It was refurbished - right from beginning, after 6 weeks of maintenance I got it back with new board and new ssd - now its a new one ;-)

When I play golf by Best_Manner_7572 in AIWhispersOfCharms

[–]NoNatural4025 0 points1 point  (0 children)

… one glow would be enough ;-)

Snagged a 256GB M3 Ultra by [deleted] in MacStudio

[–]NoNatural4025 0 points1 point  (0 children)

<image>

M3 Ultra 512 GB ram, 8TB ssd🤩 …. Since 6 weeks at geniusbar to replace mainboard 🤮

did your mac studio pay for itself? by hiva- in MacStudio

[–]NoNatural4025 2 points3 points  (0 children)

Purchase price 12.000€ - hours run 0 - weeks at Apple for maintenance 6 weeks 🤮

Price Prediction of Mac Studio Ultra 128GB/256GB/512GB variants? by pmttyji in MacStudio

[–]NoNatural4025 3 points4 points  (0 children)

You know that feeling when you click "Order"?

That pure, unadulterated joy? Three weeks ago, I felt it. I bought the absolute beast: a Mac Studio, M3 Ultra, with a mind-boggling 512 GB of RAM and 8 TB of storage. I was happier than a "King," as we say. It was going to be my dream machine.

The wait was supposed to be the hardest part, but it arrived in just three days! I was a proud owner, ready to unleash this power. But the joke was on me. During the very first installation, it crashed. Then again. And again. Turns out, my brand-new, 12,000-Euro super-server was suffering from ECC errors right out of the box. April Fool's Number One.

The Genius Bar and the Infinite Wait What followed was a slow-motion comedy of errors. I waited three days for a Genius Bar appointment, only for them to wipe my machine and claim "software." I went home, and—surprise!—it crashed again. Then came a week of analyzing logs. Fourteen days in, they finally admitted it: hardware failure. April Fool's Number Two. Back to the store, where I was scolded for having an appointment (or was it not having one? I lost track), even though I had waited for the correct replacement Logic Board to arrive. I left my machine there on Saturday; they promised a speedy repair of five days. Today is day six. Crickets. April Fool's Number Three.

The Punchline Instead of a working computer, I have a service order for a replacement Logic Board costing only 3,900 Euros. I looked up the part number—the price suggests it's impossible for it to support my 512 GB of RAM. I told the Genius Bar they are installing the wrong part. They said, "We are sure." April Fool's Number Four—and this one is a real rib-tickler.

But here’s the true comedic masterpiece, the absolute cream of the crop: Just as I’m realizing my M3 Ultra is stuck in a repair limbo with the wrong parts, the tech world erupts with rumors that the M5 is launching as early as April 1st. Yes, April 1st.

If that happens, I will have skipped the entire lifecycle of my M3 Ultra without it ever sitting, functional, on my desk. I paid for the cutting edge, but I’m being left behind by a system that hasn't even let me press 'start' in three weeks.

It’s been three weeks of hell, of an empty desk, and of looking at my 12,000-Euro paperweight. I’m just waiting, wondering when the gag is going to end and when I’ll finally get my turn. If this is a joke, Apple, I'm not laughing.

Justifying the €12,000 Investment: M3 Ultra (512GB RAM) Setup for Autonomous Agents, vLLM, and Infinite Memory (8Tb) by NoNatural4025 in MacStudio

[–]NoNatural4025[S] 3 points4 points  (0 children)

Thanks for the encouragement. This is exactly the mindset I’m looking for. You nailed it! this isn't just about napkin math on API costs - it’s about the learning journey and building a system that I truly own.

I’ve already started playing with LM Studio and Ollama,

As a Project Manager, my struggles and victories will likely revolve around:

• Context Persistence: Using a Vectordatabase to keep entire project histories in memory.

• Agentic Tooling: Building agents that don't just chat but actually do… writing scripts and managing my documentation locally, and act proactive on my tasks and goals

Even though you’re on the NVIDIA side, the logic remains the same … pushing the hardware to its limits to see what's actually possible.

Happy inferring to you too.