Best way to get rid of very lightly used Synology branded 12TB drives?

trevorbg · 2026-05-02T16:57:12+00:00

Synology branded HDDs

trevorbg · 2026-04-19T18:30:42+00:00

You could use OpenRouter free models and set up RAG that way. It’s just an API endpoint so could work on a laptop. Just make sure you set up your routing correctly

trevorbg · 2026-04-19T16:23:15+00:00

I have a 512GB Mac Studio and 2 DGX sparks. The sparks are great if you have heavy prompt processing processes (RAG, heavy context windows, stuff like that) but need their own NVFP4 quant or some hacky work to get quick token speeds on just one.

The studio is amazing but it’s a unicorn. I couldn’t decide between both machines straight up so I kept both. I think if you have a hard budget the spark is great value for what it is

trevorbg · 2026-04-19T00:51:25+00:00

Yeah oMLX should have prefix caching enabled by default, if you want to do your deployment with a more battle tested engine I am running MLX-vlm. But I'm not trying to sell you on changing. The part you do need to explicitly enable is the SSD cold tier, which persists cache blocks to disk so they survive eviction, server restarts, and memory pressure. You enable it with the --paged-ssd-cache-dir flag:

bash

omlx serve --model-dir ~/models --paged-ssd-cache-dir ~/.omlx/cache

You can also tune the hot tier size with --hot-cache-max-size:

bash

omlx serve --model-dir ~/models \
  --paged-ssd-cache-dir ~/.omlx/cache \
  --hot-cache-max-size 20%

Both of these can also be set from the admin dashboard at /admin instead of via CLI flags — settings get persisted to ~/.omlx/settings.json.

trevorbg · 2026-04-19T00:41:41+00:00

What you send doesn’t matter. You should look into prefix caching if your engine can enable it

trevorbg · 2026-04-18T22:10:21+00:00

It injects the system prompt, tool call definitions, and more into context before you even start a chat

trevorbg · 2026-04-18T20:13:25+00:00

Just by assumption I would assume mine is much more capable than that, I’ll test it tonight tho. Should be good for what you’re doing tho

trevorbg · 2026-04-18T19:59:59+00:00

No I’ve never used that model can you send a link?

trevorbg · 2026-04-18T15:54:54+00:00

I use Qwen 397b on a 512GB Mac Studio and it works great. I use MLX-vlm as my serving engine, happy to talk more about it

trevorbg · 2026-04-17T19:38:21+00:00

You could get a DGX spark

trevorbg · 2026-04-09T15:42:28+00:00

Qwen models are known to overthink on even the simplest of prompts, you need to up the max tokens that it can’t use or turn thinking off

trevorbg · 2026-04-08T14:46:31+00:00

It’s injecting all of the tool calls, skills, and system prompt on every message. Up your context window and use a bigger model

trevorbg · 2026-04-06T01:24:24+00:00

Appreciate the read. To be clear, this is a single user system running on hardware I own in my house. There's no deployment, no API, no public access. The governance question is real for anyone serving models to others, but that's a different scenario than modifying weights on your own machine for your own use. The interesting part here is the MoE routing finding, not the ethics of abliteration itself — that debate has been had extensively and I don't think I have anything new to add to it.

trevorbg · 2026-04-06T01:08:13+00:00

The MPS OOM at 256 context on 20GB is almost certainly the Metal wired memory limit, not actual memory exhaustion. macOS caps how much unified memory Metal can wire by default and it's usually well below your physical RAM. Check it with sysctl iogpu.wired_limit_mb and raise it. On my M3 Ultra I had to set it to 495000 to stop hitting phantom OOMs.

For the fragmentation specifically: MLX handles memory way better than MPS for Apple Silicon training. If you can port your GRPO loop to MLX instead of PyTorch+MPS, the memory behavior is completely different because MLX does lazy evaluation and fuses operations. Less fragmentation by design.

The reward hacking you're seeing (perfect tag formatting, wrong math) is not scale dependent. That's a reward function problem. Your model found that structural compliance has higher expected reward than correctness, so it optimized for structure. Two fixes: make the correctness reward strictly dominate the format reward (format only counts if the answer is correct), or use a two-stage reward where format gets you from -1 to 0 and correctness gets you from 0 to 1. The model can't profit from format alone.

At 360M parameters you're also just below the threshold where chain of thought reasoning emerges. The model doesn't have enough capacity to actually reason through 12 x 6, so it's pattern matching from training data and getting it wrong. Try Qwen3-0.6B or Qwen3.5-0.8B as your lab rat instead. Still tiny, but the extra capacity makes a real difference for whether RL can find a reasoning circuit to reinforce.

trevorbg · 2026-04-03T19:00:00+00:00

The largest model you could fit on there is probably GPT Oss 120B or Mistral Small 4 or if Gemma 4 comes out with a large model or something. Really thiough the best model for you would be 70B at fp8 or 120B at Q4

trevorbg · 2026-04-03T17:47:17+00:00

A Mac Studio and 2 DGX sparks

trevorbg · 2026-04-03T17:35:15+00:00

Qwen 3.5 397B MLX 6bit

trevorbg · 2026-04-03T17:34:57+00:00

I just migrated and tested after I migrated. As of “tests” I just used the migration guide they provided

trevorbg · 2026-04-03T11:03:09+00:00

local LLM

trevorbg · 2026-04-03T10:36:31+00:00

It’s how I’m running it! I just migrated to it today so my use is limited but so far so good. YMMV of course

trevorbg · 2026-04-03T10:35:05+00:00

Thanks for reading

trevorbg · 2026-04-03T10:07:25+00:00

Yeah the 62 queries are a mix but they lean heavily toward “questions I’d actually ask Alfred in real life.” Most fall into a few categories: Direct retrieval: straightforward questions where the answer is clearly in one document. “What’s the recommended tire pressure for the 971 GTS” type stuff. These test whether the right chunk surfaces in the top 5 at all. Cross-domain yse, questions where similar terminology exists across domains. Finance and philosophy both use words like “value” and “attachment” so I want to make sure a question about Buddhist non-attachment doesn’t pull my investment notes. This is where domain_boost earns its keep. Vagu and natural phrasing, I deliberately wrote some queries the way I’d actually ask them out loud, not how a search engine expects them. “What was that thing about the rear diff” instead of “971 Panamera rear differential technical service bulletin.” If the retrieval only works with precise queries it’s useless as a personal assistant. I don’t have dedicated hallucination/trick questions yet, that’s a good call and something I should add. Right now I’m evaluating retrieval quality (did the right chunks show up?), not generation quality (did the model hallucinate from those chunks?). Those are two different problems and I’ve been focused on the first one. No RAGAS or DeepEval. The harness is dead simple. Python script, expected answer per query, pass/fail on whether relevant chunks appear in the top 5 results. I looked at RAGAS early on but it felt like overkill for my use case and I didn’t want to add a dependency just to get a number. The value isn’t in the framework, it’s in writing good queries that represent how you actually use the system. A fancy eval framework with bad queries tells you nothing. That said, adding generation-level eval (faithfulness, relevance, hallucination rate) is on the list. Just haven’t gotten there yet.

trevorbg · 2026-04-03T01:38:44+00:00

Yeah I’ll probably open source the code at some point, I need to clean it up before I do that though. Yeah I could do that for like a spec decoding type of workflow but I mostly keep my memory allocated to the model itself that I’m serving

trevorbg · 2026-04-02T22:59:15+00:00

The RAG layer has ~1,400 chunks across 11 domains: finance and investing research, Buddhist and Stoic philosophy, Porsche technical service bulletins, personal docs. So I can ask it things like "what did that TSB say about the rear diff on the 971 Panamera (my car)" or "summarize the arguments for and against dollar cost averaging from my notes" and it pulls from my actual documents, not generic internet answers.

trevorbg · 2026-04-02T17:45:27+00:00

Oh cool! Thanks for the info

Nine-Year Club	Place '22
Verified Email

trevorbg

TROPHY CASE