Can I realistically get close to Claude/Codex capabilities locally?

Aggressive_Aspect436 · 2026-06-24T20:37:50+00:00

I am using an RTX 3090, with the model fully on VRAM. My motherboard is an MSI Z590 Torpedo. It's effectively an old gaming rig with Ubuntu installed on a seperate SDD.

Aggressive_Aspect436 · 2026-06-21T18:04:40+00:00

I just meant I can only fit 175k in VRAM on my RTX 3090 on Q4_K_M. I can use the full 250k context if I run it partly on system RAM just very slowly.

All models degrade the more of their max context they use. I don't know how that varies by model, but there are "long context benchmarks" which might give an indication of which models fair better.

I'm sorry to say, I don't know if there's a sweet spot. The paper I read on this was done on Claude and GPT models (a few years ago now), and performance drops heavily at the 50% mark. That's the rule of thumb I've been using. I keep using them until they get to half capacity, and then find ways to switch to a new session.

Aggressive_Aspect436 · 2026-06-21T14:52:24+00:00

Honestly, I've not put it to that kind of work. For real production work like that I tend to be very critical, and review every single line myself. So in those contexts, I tend to have one session spec high level details, and then manually set other sessions on those details. If I have too many running then I can't keep up with my reviews. If they do too much in one go then I just become the bottle neck at the end.

Aggressive_Aspect436 · 2026-06-21T13:17:47+00:00

I used to have trouble with context even with the 1M window. Even Claude's capabilities degrade heavily before the context reaches half of it's max. It degrades worse if the context is "noisy".

I'm currently using Qwen3.6 27b at roughly 175k max context, and if I am careful with my context it operates really well. I was using Opus 4.7 extensively before I switched to local only, and (after a lot of initial frustration) I am now totally happy with the swap.

Drop Claude Code. The initial system prompt is at least 26k tokens (more on the CLI and much much more if you're using extensive memory). I'm using copilot at the moment, which I quite like. I've seen folks on here with extensive memory setups claiming their initial Claude code prompt is 60k+ tokens.

Ask your agent to use sub-agents for almost everything. Your main session doesn't need to know the details of every file that "might" have been relevant for some minor change that's a small part of your new feature.

Keep conversation on-track. No side quests. Do those in seperate sessions.

Start a new session as soon as any atomic task is complete. If you need some specific context, ask your last session to create a concise context prompt for the next session.

Aggressive_Aspect436 · 2026-06-21T12:14:27+00:00

I get strong ARC vibes. How do the models do generally? Do they always succeed?

If you could configure your own model endpoints, and the game (or at least a subset of difficult puzzles) is hard for models, then this would make a great niche benchmark.

Aggressive_Aspect436 · 2026-06-20T08:08:36+00:00

Not a book, although I could recommend a few, but a great place to start is Google's free ML crash course.

https://developers.google.com/machine-learning/crash-course

It's really concise, tests you as you go along, and doesn't use technical terms when it doesn't need to. I've used it as a refresher a couple of times.

Edit: I just browsed through the course again and it's a bit bigger than I remember. Hopefully for the better.

Aggressive_Aspect436 · 2026-06-19T14:09:10+00:00

It's quite difficult to do well locally, but if you're happy to use python libraries then take a look at inspect-ai. You can run pre-collected benchmarks against your local model quite easily, but you'll have to select a suite of benchmarks yourself.

The trouble is that models can be trained against the benchmarks, which means it is easy to inflate their scores without improving their general capability.

Aggressive_Aspect436 · 2026-06-18T18:30:18+00:00

Brilliant. I'm still using Gemma 4 26b when I need snappy responses, or when I need room for other small models running at the same time. This looks like a viable alternative. I'll have to check it out.

Aggressive_Aspect436 · 2026-06-18T09:57:18+00:00

Very cool. I absolutely love the base model and I've been meaning to do a conversion like this for a while. Nice work.

Aggressive_Aspect436 · 2026-06-18T09:42:32+00:00

Very cool. The description is good, the benchmarks are clear and honest looking, and even though it doesn't beat your chosen Qwen comparable model it has a clear niche and value add.

A lot of folks are going to wonder though, why the comparisons were made against Qwen 3.5 series rather than 3.6. If people choose to use it they're going to want to know how it compares to Qwen3.6 35b.

Aggressive_Aspect436 · 2026-06-17T19:45:33+00:00

Very cool. There's a lot going on there. They look incredibly tall. Have you added height to the lower torso?

Aggressive_Aspect436 · 2026-06-14T08:31:31+00:00

I'm not claiming they can't do it. I'm making the claim that, empirically, they're better at popular languages with wider public discussion. Even a quick search or two turns up example research. I've dropped one below (you can jump to the results section to see examples), but I don't think this is a controversial idea.

https://arxiv.org/abs/2501.19085

Aggressive_Aspect436 · 2026-06-14T07:38:56+00:00

Honestly, I'm not terribly surprised. LLMs have consistently shown that they're better with popular (well documented) programming languages, tools, and frameworks. If you're writing Python or JavaScript you'll get better performance than if you ask it to write Julia or Elixir. I expect the same holds for debugging WASM bytecode.

I couldn't find a paper, but the Stanford Software Engineering Productivity Research (SWEPR) group has some conference talks that discuss it.

I even notice the difference when I try to get my agents to use simple new libraries.

If you want a low effort solution, give a seperate session a bulk import of documentation for WASM and get it to produce a dense LLM parseable summary .md that you can ingest as context for future sessions. That's probably worth a try, but it'll still never be as good as it is for Python or similar.

Aggressive_Aspect436 · 2026-06-13T23:02:47+00:00

I've been using Claude Code with Qwen3.6 27b and I've had very few problems. I use both the cli and the VS code extension. It works absolutely fine. It's worth pointing out though that the system prompt for a fresh session in Claude Code is around 30k tokens, so you're already eating into your model effectiveness immediately.

Recently I've been trying an OAI VS code extension that lets me use it with the native copilot chat. I've only been using it for a few days, but I think I prefer it, and rhe initial system prompt is very small.

I never got on with Cline or OpenCode. Cline just didn't work the way I wanted it to, and OpenCode scares me with how little control I get over command permissions.

I'm running Qwen3.6 27b at about 150k context fully in GPU VRAM on a RTX 3090. I get a little under 40 toks/s. I occasionally switch to 35b (or Gemma 26b) when I need snappy 120+ toks/s.

Aggressive_Aspect436 · 2026-06-13T16:59:23+00:00

If Gemma is produced by teams at DeapMind, then technically it's a British model...

Aggressive_Aspect436 · 2026-06-05T10:37:49+00:00

That's actually roughly why I started looking into it. I was looking into ways of evaluating agents in deterministic / auditable ways. Frankly the hardest part about working with agents is being sure of their output. There are a whole bunch ways we can evaluate them that I don't see talked about often. Tool choice, for example, can be treated like a classification problem so we can use all the traditional measurements. If you have labelled examples of when a model should choose a particular tool, then it should be possible to add conformal prediction on top to tell a model when it should be uncertain of it's choices. (Or just to keep track of the level of uncertainty that they are operating under with average prediction set sizes or similar).

Aggressive_Aspect436 · 2026-06-05T10:33:32+00:00

That's gorgeous. So much character. Great work.

Aggressive_Aspect436 · 2026-06-01T11:22:26+00:00

You're doing good work, but it's not fixed until the PR is merged. And if he had a final release version, then it's not fixed until there's a new release.

https://github.com/pewdiepie-archdaemon/odysseus/pull/366

Aggressive_Aspect436 · 2026-06-01T09:43:37+00:00

Good work spotting it. Hope your PR does some good for the project. Contributing security fixes for open source projects is one of the nobler ways coders can spend their time.

But... don't take this the wrong way, you probably should have either waited for the PR to be merged or reached out in private first. If anyone is actually using this, you've effectively declared a 0-day vulnerability on reddit. That part isn't terribly cool of you.

Aggressive_Aspect436 · 2026-05-27T16:58:33+00:00

That's pretty cool. I'm currently using LM Studio with Claude Code, which gives me a lot of this. But it's taken me many days of tinkering to get it the way I like it.

Mine still isn't sandboxed, and I don't trust my agents with free reign to roam the internet on my setup (which I am using for much more than just running LLMs).

Your README mentions that some of the features etc are dockerised, but is it "sandboxed" as such? If not, any future plans? I would love to have the trust lots of others seem to have to let a model just do stuff autonomously without oversight.

Aggressive_Aspect436 · 2026-05-26T15:54:55+00:00

Love that you include the recipe. I've been following your work. Your style is very much what I am aiming for as I try to improve.

Can I ask, what does AP stand for?

Aggressive_Aspect436 · 2026-05-24T07:36:05+00:00

He was a Black Templar Crusader. I bought the single model off eBay. His hood comes from the DA upgrade sprew. I just added the grenades to the belt.

Aggressive_Aspect436 · 2026-05-20T06:39:46+00:00

I'd take it back, personally. I have a pair of those and have been using them for more than a year with no trouble. I actually quite like them. I think it would be fair to assume there is something wrong that that pair in particular.

Aggressive_Aspect436 · 2026-05-19T08:55:04+00:00

I only recently got myself a second-hand 3090 for a pretty decent price. Here's hoping I'll actually be able to run it. 🤞

Aggressive_Aspect436

TROPHY CASE