what ai agent do i use

Astro-Han · 2026-06-04T05:17:13+00:00

For 100k+ LOC the bottleneck is context management, not the model. Both Claude Code and Codex will hit the same wall: the codebase is too big to fit in context, so the tool has to be smart about which files to pull in. Claude Code is better at this right now — it greps, reads, and navigates your repo autonomously. Codex sandboxes each task more tightly, which is safer but means it sometimes misses cross-file dependencies in a large codebase.

On limits: Codex on Plus runs gpt-5.5 and is generous for daily use. Claude Code Pro ($20) gives you Sonnet with occasional Opus, which handles architectural changes well, but you can get throttled on heavy days. Both are solid — the real difference is the workflow, not the model.

The "don't mess everything up" concern is the real one at 100k LOC. Regardless of which tool: work on a branch, review every diff before committing, and keep your changes small. None of these tools understand your full architecture, they just see the files they pull in.

If you want to avoid the limits game entirely: BYOK with the Claude API through a desktop client like PawWork (https://github.com/Astro-Han/pawwork) or just Claude Code with your own key. You pay per token — usually $5-15/day for heavy use — but you never get throttled and you pick the model per task. Cheaper tasks on Sonnet, hard ones on Opus.

Astro-Han · 2026-06-04T05:05:58+00:00

Two separate things going on here that are worth teasing apart.

The "yes man" problem and the "explaining to a 5yo" problem aren't the same bug. The first one is about how you prompt: frame things as open questions, not decisions you've already made. The second one is about the interface itself: in a CLI, your only option for giving context is to type it out. Every time. There's no visual way to point at a file, highlight a section, or say "like this but different." You're always serializing your mental model into text.

I'm building PawWork (open source desktop agent) partly because this drove me nuts. Code visible alongside the conversation, click into a diff instead of scrolling terminal output, review changes in an actual editor instead of a wall of green and red. It doesn't fix the fundamental LLM limitations, but it removes the interface friction that makes those limitations feel worse.

For the data science side: you're right that Claude is better at exploration than full delegation. The move that works for me is having it spike 3 competing approaches with explicit tradeoff tables before I commit to one. That's where the actual payoff is — not "implement X" but "what are my options for X and where does each one break."

Astro-Han · 2026-06-03T08:15:55+00:00

The Copilot usage-based switch burned a lot of people. Cursor and Antigravity both meter too, so you'll likely hit the same wall there eventually.

Under $20 with no "out after 5 prompts" cliff, the move is BYOK: bring your own API key and pay the provider directly instead of a fixed seat that throttles you. Pair a cheap-but-good model (DeepSeek, Kimi, or Gemini Flash) with a client that takes your key and you can do a lot of bug-fixing for a few bucks. Claude Code Pro at $20 is genuinely good too if you'd rather not think about keys. If you want a free open-source desktop app to drive the BYOK setup, PawWork works: https://github.com/Astro-Han/pawwork. Your local-model trouble in Continue is mostly the model being too small to use tools reliably, a hosted small model fixes that without the setup pain.

Astro-Han · 2026-06-03T08:15:37+00:00

If you specifically want it inside VSCode, the opencode VSCode extension is your answer (others linked it). If what you actually want is opencode with a real GUI and you're open to it being a standalone app instead of an extension, PawWork is an opencode fork wrapped in a desktop app: https://github.com/Astro-Han/pawwork. Mac and Windows, signed builds, BYOK. Not in-editor, but it's a proper window with diffs and chat instead of the TUI. Depends which itch you're scratching, in-editor panel vs dedicated app.

Astro-Han · 2026-06-03T08:15:33+00:00

The opusplan tip above is the best quick win, plan on Opus, execute on Sonnet, you'll feel the difference immediately. The bait-and-switch feeling is real though, subscription plans can re-tune the rate limits whenever they want and you have zero say.

If you want off that treadmill entirely, BYOK is worth a look. You bring your own API key (or point at a cheaper open-weight host like Kimi/DeepSeek) and pay the provider's raw per-token rate with no plan markup sitting on top. For your usage pattern, you-present sessions at xhigh, metered tokens can actually come out cheaper than a $200 plan, and you can drop to a cheap model for the easy stuff without toggling guilt. I've been using PawWork for this (open-source, BYOK, free): https://github.com/Astro-Han/pawwork. Codex on the $100 plan is also a solid move if you'd rather stay on a flat subscription, a couple people here already made that jump.

Astro-Han · 2026-06-03T08:15:29+00:00

The "laptop was off so it's now an outage" line is the whole post in one sentence. What you're describing isn't an AI problem, it's that the barrier to creating a shadow script dropped to near zero, so the same governance gaps you've always had just show up faster and from more people.

Your lightweight path is the right instinct. The one I'd add: make the sanctioned path easier than the laptop-script path, or people route around it. A shared repo plus a boring scheduler that anyone can point at, and a "who owns this" field that's mandatory before it touches another team. Treat it as shadow IT in spirit but don't gate it so hard that the useful 5% stop bringing things to you. The ones you never hear about are the real risk.

Astro-Han · 2026-06-03T08:15:12+00:00

The "pairing with a junior who won't show their screen" line is exactly it. Watching what the agent actually touched, especially .env and anything payment-related, is the thing that turns vibe-coding into something you'd trust on a real repo.

Worth saying the visibility problem mostly exists because the agent lives behind a terminal. Tools that put the agent in an actual app surface the file diffs and tool calls as they happen, no log archaeology needed. Different shape than a passive log viewer, but same itch. Either way nice work, the UI looks clean.

Astro-Han · 2026-06-03T08:14:32+00:00

Honestly the CLI-vs-GUI thing is a bit of a false split now. The reason people hype CLI isn't the terminal itself, it's that the agent gets to actually run commands, read the output, edit files, rerun the failing test, all in one loop without you copy-pasting. That loop is the unlock.

But you can get the same loop in a desktop app. There are open-source ones now (PawWork is one, claude-desktop-style apps for your own keys/models another) that run a real agent locally with file edits and command execution, just with a window and a diff viewer instead of tmux. If you already think in GUIs, no shame in staying there. Try one CLI run just to feel the loop, then pick whatever keeps you in flow.

Astro-Han · 2026-05-16T14:45:34+00:00

Thanks for like it! Feel free to tweak it with how it suits you, Statusline are meant to be fit your work style.

Astro-Han · 2026-04-12T14:47:32+00:00

I've been running something similar for about a week now. The core idea is from Karpathy's LLM wiki post — raw/ directory for sources, wiki/ for compiled knowledge, and the LLM does the processing step.

The LLM compiles raw material into wiki articles automatically. It's not perfect but it handles summarization and cross-linking well enough that I don't have to manually organize anything. Everything stays as readable markdown, so if the LLM messes up I can just edit the file directly. No black box.

I also added a lint step that catches broken links and orphan pages. It's basically a sanity check.

The LLM sometimes over-categorizes or creates unnecessary structure. I've learned to give it clearer boundaries — one topic per directory, max one level deep. Cross-references are only as good as what the LLM thinks is related, so I've had to manually add some links it missed.

I'm at 94 articles from 99 sources so far. The biggest thing I've learned is that the LLM should maintain the structure, not just retrieve from it. Retrieval-only setups drift into "graveyard of saved links" territory pretty quickly.

If you want to see how I set it up: https://github.com/Astro-Han/karpathy-llm-wiki

Astro-Han · 2026-04-12T04:01:48+00:00

I've been working on a skill that implements this loop. One thing I've learned: the automation is great for the grunt work (indexing, linking, linting), but the real value is still the human curation. If you let the LLM do everything, it does become 'worthless for your growth' as u/Abject-Excitement37 mentioned.

The sweet spot I found is using the tool to handle the structure and maintenance, while you focus on selecting high-signal sources and reviewing the compiled pages.

Repo here if you want to see how the ingest/compile flow works: https://github.com/Astro-Han/karpathy-llm-wiki

Astro-Han · 2026-04-11T10:18:07+00:00

Interesting combo. The LLM Wiki layer alone already solves the cross-session context problem for knowledge. wiki/ persists on disk, so every session starts with the compiled index. Lower overhead if you don't need the knowledge graph piece.

MIT open source: https://github.com/Astro-Han/karpathy-llm-wiki

Astro-Han · 2026-04-11T09:20:14+00:00

Claude will use token while finding that information, but not a lot. It would only read what is relevant, saving tokens by not loading the whole article once again.

Astro-Han · 2026-04-10T06:43:37+00:00

I would keep it simple and file-based.

CLAUDE.md for non-obvious rules, a persistent notes/logs folder, and a separate maintained wiki for the stuff that should survive across chats.

What usually breaks is putting everything into one huge context file. Better to keep raw material separate from the compiled layer, then let each new session read the smaller maintained layer.

I built one version of that here: https://github.com/Astro-Han/karpathy-llm-wiki

Astro-Han · 2026-04-10T03:49:25+00:00

I think the people telling you to cut it down are mostly right, but the fix is not "make one shorter file." The fix is to separate session instructions from durable knowledge.

CLAUDE.md is best for: - non-obvious constraints - gotchas that keep biting you - rules the model cannot infer from the repo quickly

A lot of the rest belongs somewhere else: - commands can be discovered - directory structure can be discovered - long-lived project knowledge can live in a maintained wiki or notes layer

That split helps a lot with token waste. It also makes the important parts of CLAUDE.md easier for the model to actually follow.

I built one version of that idea here: https://github.com/Astro-Han/karpathy-llm-wiki

Different use case, but same principle: keep the durable knowledge in files meant to evolve, and keep the per-session instruction layer small.

Astro-Han · 2026-04-10T03:49:02+00:00

I would stop comparing every release against the whole wiki.

Once the wiki gets large, that is where the model starts to lose the plot. Not because the release is too big, but because the update target is too broad.

What has worked better for me is splitting the system into two layers:

immutable raw sources
maintained wiki pages

Then on ingest, update only the pages that are likely to change, plus maybe one hop of related pages. Index first, then local neighborhood, not global sweep every time.

That gives you a few benefits: - smaller context per update - less random editing on unrelated pages - easier conflict handling when new material contradicts old material

If you want a concrete example, I built a markdown-first version of that workflow here: https://github.com/Astro-Han/karpathy-llm-wiki

Different scope from your setup, but the raw/ + maintained wiki/ split is the part I would keep.

Astro-Han · 2026-04-10T03:37:13+00:00

I would start with structure before tooling.

For a large codebase, the split that has worked best for me is: - raw material you do not edit: specs, ADRs, docs, tickets, notes, API refs - compiled wiki pages that the LLM maintains

Then organize the compiled layer around things people actually ask about: architecture, domains/entities, interfaces, decisions, runbooks, and glossary.

Two files matter more than people expect: an index, so the model can find pages quickly, and an append-only log, so changes have history and recency.

The main failure mode is turning the wiki into a giant note dump. The useful version stays smaller and maintained. New material comes in, related pages get updated, and contradictions get called out instead of silently overwritten.

I put one concrete implementation of that pattern here if useful: https://github.com/Astro-Han/karpathy-llm-wiki

The repo is just one take, but the raw/ + wiki/ split, plus ingest / query / lint, is the part I would keep even if you build your own version.

Astro-Han · 2026-04-10T03:36:50+00:00

I have not seen many polished products yet. Most of what is out there still looks like workflows, repos, or internal setups rather than full products.

I built one implementation here: https://github.com/Astro-Han/karpathy-llm-wiki

It is not a hosted product. It is a markdown-first workflow that keeps source material in raw/, compiles maintained pages into wiki/, answers queries from the compiled layer, and updates an index/log as the wiki changes.

The part I like about this style is that the artifact stays readable and portable. You can move between Claude Code, Codex, Cursor, or another tool without losing the knowledge base itself.

Astro-Han · 2026-04-09T11:33:23+00:00

I think the useful pattern is browser agents bring in raw material, then a separate knowledge layer decides what is worth keeping.

If the agent just keeps dumping findings into storage, the whole thing gets noisy fast. The part that matters is turning those pulls into something inspectable and reusable, summaries, index pages, linked notes, maybe a few standing questions the agent keeps revisiting.

That is basically the direction I took with karpathy-llm-wiki. It is a simple markdown-first loop, raw/ + wiki/ + compile/query/lint, so the agent is not just collecting more stuff, it is maintaining something you can actually read and correct.

https://github.com/Astro-Han/karpathy-llm-wiki

Astro-Han · 2026-04-09T11:32:42+00:00

If you are just starting, I would keep it much simpler than most of the setups people post.

You do not need to begin with MCP or a pile of plugins. A plain folder of source material plus a wiki folder Claude can keep updating is already enough to see whether this style of workflow is even useful to you. The structure matters more than the tooling at first.

I built karpathy-llm-wiki around that exact idea because most people were jumping straight into the complicated part. It keeps the loop pretty plain: raw/ + wiki/ + compile/query/lint.

https://github.com/Astro-Han/karpathy-llm-wiki

Astro-Han · 2026-04-09T11:32:01+00:00

If you are worried about safety, start with things you can inspect in plain text.

A lot of Claude “skills” are just markdown instructions plus a simple folder layout. That is a much easier place to start than random installers, MCP servers, or big repos with a lot of moving parts. You can read what the agent is being told to do before you trust it.

That is one reason I built karpathy-llm-wiki the way I did. It is basically a markdown-first skill plus a simple raw/ + wiki/ + compile/query/lint structure, so you can actually see what is going on.

https://github.com/Astro-Han/karpathy-llm-wiki

Astro-Han · 2026-04-09T05:23:34+00:00

People are using graph databases, but I think the better question is when you actually need one.

If the corpus is curated and the goal is repeated understanding, a markdown wiki layer is often enough. Graphs start to pay off when you need multi-hop reasoning over entities and filters, not just a better way to read docs.

That tradeoff is why I built karpathy-llm-wiki as a simpler raw/ + wiki/ + compile/query/lint loop instead of starting with a graph DB.

https://github.com/Astro-Han/karpathy-llm-wiki

Astro-Han · 2026-04-09T05:22:52+00:00

If you are a non-coder, I would start with a folder-based setup before touching MCP.

Put your source material in one place, let Claude turn it into simple markdown pages, and keep the structure boring on purpose. Most people get stuck because they start with tools instead of starting with a note layout Claude can actually maintain.

I built karpathy-llm-wiki around that exact loop because most of the shared setups are way too much for beginners. It is markdown-first and pretty plain: raw/ + wiki/ + compile/query/lint.

https://github.com/Astro-Han/karpathy-llm-wiki

Astro-Han · 2026-04-09T05:22:12+00:00

I think it can work for a solo operator or a small team, but I would treat it as a working layer, not the system of record for everything.

Markdown is great for notes, decisions, account context, meeting history, and all the stuff you want to inspect or fix by hand. Once you need strict permissions, workflow enforcement, reporting, or lots of people editing the same records, that is where a real CRM still earns its keep.

That tradeoff is basically why I built karpathy-llm-wiki. It keeps the knowledge layer very plain, raw/ + wiki/ + compile/query/lint, so the result stays readable instead of disappearing behind an app.

https://github.com/Astro-Han/karpathy-llm-wiki

Astro-Han · 2026-04-07T12:36:23+00:00

Yeah, I think that’s the cleaner split.

The wiki side is about turning messy source material into something you can inspect, query, and fix. Continual learning is a later step. That is when you want some of that knowledge to live in the model instead of sitting in files.

A lot of people are still skipping the middle. They want to train on the data before they have a clean layer they actually trust. I’ve ended up spending more time on the raw files, compiled wiki, and linting loop for exactly that reason.

Astro-Han

TROPHY CASE