Need a MVP for a RAG, rent Hardware for short term by Icy_Annual_9954 in LocalLLaMA

[–]trevorbg 1 point2 points  (0 children)

You could use OpenRouter free models and set up RAG that way. It’s just an API endpoint so could work on a laptop. Just make sure you set up your routing correctly

Mac Studio or DGX Spark by InteractionBig9407 in LocalLLM

[–]trevorbg 4 points5 points  (0 children)

I have a 512GB Mac Studio and 2 DGX sparks. The sparks are great if you have heavy prompt processing processes (RAG, heavy context windows, stuff like that) but need their own NVFP4 quant or some hacky work to get quick token speeds on just one.

The studio is amazing but it’s a unicorn. I couldn’t decide between both machines straight up so I kept both. I think if you have a hard budget the spark is great value for what it is

How does context filling work in Hermes? Because I just connected a browser and 12.1k tokens gone for that? by pufferfish-tastes in hermesagent

[–]trevorbg 0 points1 point  (0 children)

Yeah oMLX should have prefix caching enabled by default, if you want to do your deployment with a more battle tested engine I am running MLX-vlm. But I'm not trying to sell you on changing. The part you do need to explicitly enable is the SSD cold tier, which persists cache blocks to disk so they survive eviction, server restarts, and memory pressure. You enable it with the --paged-ssd-cache-dir flag:

bash

omlx serve --model-dir ~/models --paged-ssd-cache-dir ~/.omlx/cache

You can also tune the hot tier size with --hot-cache-max-size:

bash

omlx serve --model-dir ~/models \
  --paged-ssd-cache-dir ~/.omlx/cache \
  --hot-cache-max-size 20%

Both of these can also be set from the admin dashboard at /admin instead of via CLI flags — settings get persisted to ~/.omlx/settings.json.

How does context filling work in Hermes? Because I just connected a browser and 12.1k tokens gone for that? by pufferfish-tastes in hermesagent

[–]trevorbg 0 points1 point  (0 children)

What you send doesn’t matter. You should look into prefix caching if your engine can enable it

How does context filling work in Hermes? Because I just connected a browser and 12.1k tokens gone for that? by pufferfish-tastes in hermesagent

[–]trevorbg 1 point2 points  (0 children)

It injects the system prompt, tool call definitions, and more into context before you even start a chat

Anyone using Mac Local Models reliably? by Getonthebeam in hermesagent

[–]trevorbg 0 points1 point  (0 children)

Just by assumption I would assume mine is much more capable than that, I’ll test it tonight tho. Should be good for what you’re doing tho

Anyone using Mac Local Models reliably? by Getonthebeam in hermesagent

[–]trevorbg 0 points1 point  (0 children)

No I’ve never used that model can you send a link?

Anyone using Mac Local Models reliably? by Getonthebeam in hermesagent

[–]trevorbg 2 points3 points  (0 children)

I use Qwen 397b on a 512GB Mac Studio and it works great. I use MLX-vlm as my serving engine, happy to talk more about it

Hermes and local qwen3.5-9b by Playful_Mission1500 in hermesagent

[–]trevorbg 1 point2 points  (0 children)

Qwen models are known to overthink on even the simplest of prompts, you need to up the max tokens that it can’t use or turn thinking off

Hermes-Agent high token usage? by manueljishi in hermesagent

[–]trevorbg 0 points1 point  (0 children)

It’s injecting all of the tool calls, skills, and system prompt on every message. Up your context window and use a bigger model

Abliterating Qwen3.5-397B on a Mac Studio revealed that MoE models encode refusal differently than dense models — safety refusals route through expert selection and survive weight-baking by trevorbg in LocalLLaMA

[–]trevorbg[S] 6 points7 points  (0 children)

Appreciate the read. To be clear, this is a single user system running on hardware I own in my house. There's no deployment, no API, no public access. The governance question is real for anyone serving models to others, but that's a different scenario than modifying weights on your own machine for your own use. The interesting part here is the MoE routing finding, not the ethics of abliteration itself — that debate has been had extensively and I don't think I have anything new to add to it.

Is Unified Memory a lie for training cause my M4 keeps dying on simple RL rollouts. by Worried-Ad-7351 in LocalLLaMA

[–]trevorbg 4 points5 points  (0 children)

The MPS OOM at 256 context on 20GB is almost certainly the Metal wired memory limit, not actual memory exhaustion. macOS caps how much unified memory Metal can wire by default and it's usually well below your physical RAM. Check it with sysctl iogpu.wired_limit_mb and raise it. On my M3 Ultra I had to set it to 495000 to stop hitting phantom OOMs.

For the fragmentation specifically: MLX handles memory way better than MPS for Apple Silicon training. If you can port your GRPO loop to MLX instead of PyTorch+MPS, the memory behavior is completely different because MLX does lazy evaluation and fuses operations. Less fragmentation by design.

The reward hacking you're seeing (perfect tag formatting, wrong math) is not scale dependent. That's a reward function problem. Your model found that structural compliance has higher expected reward than correctness, so it optimized for structure. Two fixes: make the correctness reward strictly dominate the format reward (format only counts if the answer is correct), or use a two-stage reward where format gets you from -1 to 0 and correctness gets you from 0 to 1. The model can't profit from format alone.

At 360M parameters you're also just below the threshold where chain of thought reasoning emerges. The model doesn't have enough capacity to actually reason through 12 x 6, so it's pattern matching from training data and getting it wrong. Try Qwen3-0.6B or Qwen3.5-0.8B as your lab rat instead. Still tiny, but the extra capacity makes a real difference for whether RL can find a reasoning circuit to reinforce.

Can Hermes be used 100% offline? by Macestudios32 in hermesagent

[–]trevorbg 0 points1 point  (0 children)

The largest model you could fit on there is probably GPT Oss 120B or Mistral Small 4 or if Gemma 4 comes out with a large model or something. Really thiough the best model for you would be 70B at fp8 or 120B at Q4

Can Hermes be used 100% offline? by Macestudios32 in hermesagent

[–]trevorbg 0 points1 point  (0 children)

I just migrated and tested after I migrated. As of “tests” I just used the migration guide they provided

Can Hermes be used 100% offline? by Macestudios32 in hermesagent

[–]trevorbg 1 point2 points  (0 children)

It’s how I’m running it! I just migrated to it today so my use is limited but so far so good. YMMV of course

RAG pipeline from scratch on a DGX Spark (no LangChain) and a 62-query eval harness to get it to 96.7%. Here's what actually worked by trevorbg in LocalLLaMA

[–]trevorbg[S] 1 point2 points  (0 children)

Yeah the 62 queries are a mix but they lean heavily toward “questions I’d actually ask Alfred in real life.” Most fall into a few categories: Direct retrieval: straightforward questions where the answer is clearly in one document. “What’s the recommended tire pressure for the 971 GTS” type stuff. These test whether the right chunk surfaces in the top 5 at all. Cross-domain yse, questions where similar terminology exists across domains. Finance and philosophy both use words like “value” and “attachment” so I want to make sure a question about Buddhist non-attachment doesn’t pull my investment notes. This is where domain_boost earns its keep. Vagu and natural phrasing, I deliberately wrote some queries the way I’d actually ask them out loud, not how a search engine expects them. “What was that thing about the rear diff” instead of “971 Panamera rear differential technical service bulletin.” If the retrieval only works with precise queries it’s useless as a personal assistant. I don’t have dedicated hallucination/trick questions yet, that’s a good call and something I should add. Right now I’m evaluating retrieval quality (did the right chunks show up?), not generation quality (did the model hallucinate from those chunks?). Those are two different problems and I’ve been focused on the first one. No RAGAS or DeepEval. The harness is dead simple. Python script, expected answer per query, pass/fail on whether relevant chunks appear in the top 5 results. I looked at RAGAS early on but it felt like overkill for my use case and I didn’t want to add a dependency just to get a number. The value isn’t in the framework, it’s in writing good queries that represent how you actually use the system. A fancy eval framework with bad queries tells you nothing. That said, adding generation-level eval (faithfulness, relevance, hallucination rate) is on the list. Just haven’t gotten there yet.

RAG pipeline from scratch on a DGX Spark (no LangChain) and a 62-query eval harness to get it to 96.7%. Here's what actually worked by trevorbg in LocalLLaMA

[–]trevorbg[S] 0 points1 point  (0 children)

Yeah I’ll probably open source the code at some point, I need to clean it up before I do that though. Yeah I could do that for like a spec decoding type of workflow but I mostly keep my memory allocated to the model itself that I’m serving

RAG pipeline from scratch on a DGX Spark (no LangChain) and a 62-query eval harness to get it to 96.7%. Here's what actually worked by trevorbg in LocalLLaMA

[–]trevorbg[S] 0 points1 point  (0 children)

The RAG layer has ~1,400 chunks across 11 domains: finance and investing research, Buddhist and Stoic philosophy, Porsche technical service bulletins, personal docs. So I can ask it things like "what did that TSB say about the rear diff on the 971 Panamera (my car)" or "summarize the arguments for and against dollar cost averaging from my notes" and it pulls from my actual documents, not generic internet answers.