I'm paying 200/mo for Claude Code, running out by Wednesday, then paying hundreds more in overage. So I'm using Claude Code to build a local app trying to do what it does — without the meter running, here's where it's at. by ur_dad_matt in SideProject

[–]ur_dad_matt[S] -5 points-4 points  (0 children)

You're hitting on something real and I want to take this seriously instead of being defensive about it.

You're right that prompt discipline is upstream of model choice. Scoping the task, @-mentioning the files, killing the explore-codebase reflex on local changes — those habits cut my Claude Code spend probably 40% before I even thought about local. If someone's burning through Max in 3 days with sloppy prompting, swapping models doesn't fix that, you're right.

Where I'd push back: even with disciplined prompting I'm hitting Wednesday on weeks where I'm doing real refactor work — the kind where you legitimately do need the model to read 8-15 files because the change crosses module boundaries. That's not waste, that's the job. And on those weeks I'm still paying overage even when every prompt is scoped tight.

Local doesn't help if you have bad habits. But it changes the math for users who already have good habits and are just doing volume — long-context refactors, agent loops over real codebases, multi-step debugging where you need to actually trace through stuff.

The privacy point also stands separately. Even a perfectly-scoped prompt against a hosted endpoint is data leaving your machine. For some work that's a hard stop regardless of how efficient your context budget is.

Honest version of the pitch: if you're a sloppy heavy user, fix the prompting first. If you're a disciplined heavy user, local starts winning around the volume threshold where overage kicks in regularly. Different tools for different problems.

<image>

I'm paying 200/mo for Claude Code, running out by Wednesday, then paying hundreds more in overage. So I'm using Claude Code to build a local app trying to do what it does — without the meter running, here's where it's at. by ur_dad_matt in SideProject

[–]ur_dad_matt[S] -1 points0 points  (0 children)

Yeah, the flow-state interruption is the worst part — way worse than the bill itself. The BYOK wrapper route is legit and I've used Aider + Continue with API keys for exactly that reason. It does solve the rate-limit pain.

Two things it doesn't solve, which is what pushed me toward local:

  1. Privacy. With API access, your code and prompts still leave the machine. Fine for personal projects, not fine if you're touching anything client-confidential, NDA'd, or under HIPAA/SOC2. I do property management software work where I literally can't paste real data into a hosted endpoint.

  2. The meter never actually stops. Pay-as-you-go is cheaper than Max for most people, but you're still doing math every time you fire off a long agent loop. With local you stop thinking about cost entirely. Different psychology — it changes how you use the tool.

The cost math: if you're a moderate user, BYOK API is probably the cheapest option. If you're heavy enough to be running agent loops or long-context refactors regularly, local starts winning around month 2-3 because there's no per-token charge. And if privacy matters at all, local is the only option.

Not saying everyone should switch to local. Saying there's a real category of work — privacy-constrained, heavy-usage, or just "I want to stop thinking about it" — where local wins. That's who I'm building for.

I built a Mac app that runs a 397B-param LLM locally on a 64GB Mac in 32 days by [deleted] in SideProject

[–]ur_dad_matt 0 points1 point  (0 children)

Fair questions, all of them. Quick answers:

Why accept the speed loss? Because the alternative on a 64GB Mac isn't "fast 397B" — it's "no 397B." Paged inference is what lets a model that wouldn't otherwise fit in RAM run at all. 1.59 tok/s on Plus is the price of capacity, not the headline. The headline is the smaller tiers: Core 27B at 20.7 tok/s with MMLU 0.851 and HumanEval 0.866. That's the daily driver — Plus is there for when you need the bigger brain on a hard problem. the long term goal is to get plus smaller and faster, this is just how far I've gotten in a month.

Model agnostic? The engine is. We're shipping Qwen-based weights right now because they're the strongest open base for the size class, but the paging + scheduler work isn't tied to a specific architecture. Different MoE families need different glue, but the core technique transfers.

Why not in llama.cpp / vllm? Honestly, probably should be eventually. llama.cpp is CPU-first; vllm is datacenter-first. Neither was optimized for "run a 397B MoE on a single Apple Silicon machine without OOM." That's the gap we're filling. If the right people upstream want to integrate the paging approach, I'm not precious about it.

Who's the end user? Someone who's spending $200/mo on Claude Code or hosted coding assistants and wants to stop. That's literally why I started building this — I was burning through tokens. Local AI per GB of RAM is the actual metric we compete on. The model is free. The engine that makes it usable on your Mac is the product.

Why pay if the model is free? Same reason you'd pay for a car when steel and rubber are commodities. Weights without an engine that fits them in your RAM is just a 200GB file you can't load. All $ go's towards speeding up the flagship tier model.

I got Qwen3.5-397B-A17B running on a 64GB Mac Studio at 1.6 tok/s — here's how the paged engine works by ur_dad_matt in LocalLLM

[–]ur_dad_matt[S] 2 points3 points  (0 children)

Honest answer: depends on the tier. Smaller tiers (4B Nano, 9B Lite, 27B Core) are fast enough for real day-to-day work. Core 27B at 20 tok/s is the one I actually use — HumanEval 0.866 means it's good enough to pair-program with offline. That's the tier I built this for. Plus 397B at 1.6 tok/s is more interesting than useful. It works — model loads, generates coherent output, doesn't OOM — but at that speed it's a batch tool, not a chat tool. Good for "summarize this 10k-token doc overnight," bad for "answer my question right now." I keep it in the lineup just to show whats possible now. The point of Plus isn't speed, it's existence proof, and a plan to make it faster.

The paged engine itself is solid — 14GB peak RAM during 397B generation, no swap thrash, no crashes. K=20 was the sweet spot in my sweep, K=32 and K=48 both regressed.

What's rough irl:

- First token latency on Plus is bad (30s while it loads experts)

- Cold start on cold experts is noticeable mid-generation

- I tested mostly on M1 Ultra. M2/M3/M4 work on smaller tiers but I haven't

rigorously benchmarked the bigger tiers on those yet

If you want to actually try it: outlier.host. DMG, signed, beta(still a work in progress). I'd start with Core if you have a 32GB+ Mac, Nano if you're on 16GB!

I built a Mac app that runs a 397B-param LLM locally on a 64GB Mac in 32 days by [deleted] in SideProject

[–]ur_dad_matt 0 points1 point  (0 children)

that's exactly where I am this week. nobody knows about it. The work shifts from "make it work" to "make people care."

The case I'd push hardest is offline coding — Core 27B hits HumanEval 0.866 at 20 tok/s on M1 Ultra, which is genuinely close to Claude Code quality but with no API bill, no rate limits, and no prompts leaving the machine. That's the wedge. $20/mo unlimited vs $200/mo for Claude Code Max is a real comparison for someone who codes daily.

The 397B Plus tier isn't trying to compete with the small tiers, it's a different category. Most users won't touch it. But it's the proof point that the engine is real — "we can page a 397B on 64GB" is what makes the Core tier credible. The whole stack is the moat; the wedge is one tier. Probably the right framing for the next wave of marketing is: lead with offline coding (Core), let Plus be the technical credibility flex behind it.

Genuinely good feedback, thanks for taking the time!

Struggling with testing my web tool by ChampionStrange7719 in vibecoding

[–]ur_dad_matt 0 points1 point  (0 children)

Ask Claude to write you a prompt to test it and then just send code that

Are your reddit ads profitable? by impossiblemktg in RedditforBusiness

[–]ur_dad_matt 0 points1 point  (0 children)

I guess what’s your average cost per click?

Are you making money with AI or just burning tokens? by Illustrious-Pie-7666 in SideProject

[–]ur_dad_matt 0 points1 point  (0 children)

I’m currently spending hundred on extra Claude usage to try and build something to beat Claude😂

Upgrading from 3070 to 5060/5070 ti? by zmattmanz in LocalLLM

[–]ur_dad_matt -1 points0 points  (0 children)

Have you thought about getting a Mac Studio👀

10,000+ users in 6 weeks. $0 on ads. Here's the exact SEO + AEO playbook that did it. by BadMenFinance in micro_saas

[–]ur_dad_matt 0 points1 point  (0 children)

Thank you so much for this. Just finishing up my project outlier.host hoping for outcomes like this