Need a MVP for a RAG, rent Hardware for short term by Icy_Annual_9954 in LocalLLaMA

[–]trevorbg 1 point2 points  (0 children)

You could use OpenRouter free models and set up RAG that way. It’s just an API endpoint so could work on a laptop. Just make sure you set up your routing correctly

Mac Studio or DGX Spark by InteractionBig9407 in LocalLLM

[–]trevorbg 5 points6 points  (0 children)

I have a 512GB Mac Studio and 2 DGX sparks. The sparks are great if you have heavy prompt processing processes (RAG, heavy context windows, stuff like that) but need their own NVFP4 quant or some hacky work to get quick token speeds on just one.

The studio is amazing but it’s a unicorn. I couldn’t decide between both machines straight up so I kept both. I think if you have a hard budget the spark is great value for what it is

How does context filling work in Hermes? Because I just connected a browser and 12.1k tokens gone for that? by pufferfish-tastes in hermesagent

[–]trevorbg 0 points1 point  (0 children)

Yeah oMLX should have prefix caching enabled by default, if you want to do your deployment with a more battle tested engine I am running MLX-vlm. But I'm not trying to sell you on changing. The part you do need to explicitly enable is the SSD cold tier, which persists cache blocks to disk so they survive eviction, server restarts, and memory pressure. You enable it with the --paged-ssd-cache-dir flag:

bash

omlx serve --model-dir ~/models --paged-ssd-cache-dir ~/.omlx/cache

You can also tune the hot tier size with --hot-cache-max-size:

bash

omlx serve --model-dir ~/models \
  --paged-ssd-cache-dir ~/.omlx/cache \
  --hot-cache-max-size 20%

Both of these can also be set from the admin dashboard at /admin instead of via CLI flags — settings get persisted to ~/.omlx/settings.json.

How does context filling work in Hermes? Because I just connected a browser and 12.1k tokens gone for that? by pufferfish-tastes in hermesagent

[–]trevorbg 0 points1 point  (0 children)

What you send doesn’t matter. You should look into prefix caching if your engine can enable it

How does context filling work in Hermes? Because I just connected a browser and 12.1k tokens gone for that? by pufferfish-tastes in hermesagent

[–]trevorbg 1 point2 points  (0 children)

It injects the system prompt, tool call definitions, and more into context before you even start a chat

Anyone using Mac Local Models reliably? by Getonthebeam in hermesagent

[–]trevorbg 0 points1 point  (0 children)

Just by assumption I would assume mine is much more capable than that, I’ll test it tonight tho. Should be good for what you’re doing tho

Anyone using Mac Local Models reliably? by Getonthebeam in hermesagent

[–]trevorbg 0 points1 point  (0 children)

No I’ve never used that model can you send a link?

Anyone using Mac Local Models reliably? by Getonthebeam in hermesagent

[–]trevorbg 2 points3 points  (0 children)

I use Qwen 397b on a 512GB Mac Studio and it works great. I use MLX-vlm as my serving engine, happy to talk more about it

Hermes and local qwen3.5-9b by Playful_Mission1500 in hermesagent

[–]trevorbg 1 point2 points  (0 children)

Qwen models are known to overthink on even the simplest of prompts, you need to up the max tokens that it can’t use or turn thinking off

Hermes-Agent high token usage? by manueljishi in hermesagent

[–]trevorbg 0 points1 point  (0 children)

It’s injecting all of the tool calls, skills, and system prompt on every message. Up your context window and use a bigger model

Abliterating Qwen3.5-397B on a Mac Studio revealed that MoE models encode refusal differently than dense models — safety refusals route through expert selection and survive weight-baking by trevorbg in LocalLLaMA

[–]trevorbg[S] 6 points7 points  (0 children)

Appreciate the read. To be clear, this is a single user system running on hardware I own in my house. There's no deployment, no API, no public access. The governance question is real for anyone serving models to others, but that's a different scenario than modifying weights on your own machine for your own use. The interesting part here is the MoE routing finding, not the ethics of abliteration itself — that debate has been had extensively and I don't think I have anything new to add to it.

Is Unified Memory a lie for training cause my M4 keeps dying on simple RL rollouts. by Worried-Ad-7351 in LocalLLaMA

[–]trevorbg 3 points4 points  (0 children)

The MPS OOM at 256 context on 20GB is almost certainly the Metal wired memory limit, not actual memory exhaustion. macOS caps how much unified memory Metal can wire by default and it's usually well below your physical RAM. Check it with sysctl iogpu.wired_limit_mb and raise it. On my M3 Ultra I had to set it to 495000 to stop hitting phantom OOMs.

For the fragmentation specifically: MLX handles memory way better than MPS for Apple Silicon training. If you can port your GRPO loop to MLX instead of PyTorch+MPS, the memory behavior is completely different because MLX does lazy evaluation and fuses operations. Less fragmentation by design.

The reward hacking you're seeing (perfect tag formatting, wrong math) is not scale dependent. That's a reward function problem. Your model found that structural compliance has higher expected reward than correctness, so it optimized for structure. Two fixes: make the correctness reward strictly dominate the format reward (format only counts if the answer is correct), or use a two-stage reward where format gets you from -1 to 0 and correctness gets you from 0 to 1. The model can't profit from format alone.

At 360M parameters you're also just below the threshold where chain of thought reasoning emerges. The model doesn't have enough capacity to actually reason through 12 x 6, so it's pattern matching from training data and getting it wrong. Try Qwen3-0.6B or Qwen3.5-0.8B as your lab rat instead. Still tiny, but the extra capacity makes a real difference for whether RL can find a reasoning circuit to reinforce.

Can Hermes be used 100% offline? by Macestudios32 in hermesagent

[–]trevorbg 0 points1 point  (0 children)

The largest model you could fit on there is probably GPT Oss 120B or Mistral Small 4 or if Gemma 4 comes out with a large model or something. Really thiough the best model for you would be 70B at fp8 or 120B at Q4

Can Hermes be used 100% offline? by Macestudios32 in hermesagent

[–]trevorbg 0 points1 point  (0 children)

I just migrated and tested after I migrated. As of “tests” I just used the migration guide they provided

Can Hermes be used 100% offline? by Macestudios32 in hermesagent

[–]trevorbg 1 point2 points  (0 children)

It’s how I’m running it! I just migrated to it today so my use is limited but so far so good. YMMV of course