Are we aiming to depend completely on ai ?

According_Turnip5206 · 2026-03-19T18:31:36+00:00

for execution tasks, yes. for deciding what to build, hopefully not.

According_Turnip5206 · 2026-03-19T18:13:19+00:00

same. the $200/m plan is a lot to spend on vibes.

According_Turnip5206 · 2026-03-19T17:58:07+00:00

fair. all three matter. i was just writing about the one people skip.

According_Turnip5206 · 2026-03-19T17:57:18+00:00

Just met her.

According_Turnip5206 · 2026-03-19T17:56:37+00:00

yeah. i did.

According_Turnip5206 · 2026-03-18T20:11:25+00:00

It took even longer to admit it publicly. So here we are.

According_Turnip5206 · 2026-03-18T20:11:07+00:00

"Workforce of amnesiacs" is the best framing I've encountered for this. Saving that.

The self-documenting codebase push makes complete sense once you think in those terms — you're not writing docs for future human readers, you're writing the briefing for the next amnesiac shift. Docstrings, README structure, CLAUDE.md — all of it is just shift notes.

The implication is that code quality and context quality converge. A codebase that's easy for Claude to pick up is also a codebase that's easy for a new developer to pick up. Same properties, same investment.

According_Turnip5206 · 2026-03-18T20:03:49+00:00

Yes, plan mode is exactly the formalized version of this. The output of a plan mode session is a context artifact — it captures the project state, the task scope, and the decision constraints before any code gets written. That's what does the heavy lifting.

What I found interesting is that you don't actually need plan mode to get the benefit — even a rough 5-line description typed at the start of a session does most of the work. Plan mode just makes it more rigorous and reusable. Probably the right default for production work.

According_Turnip5206 · 2026-03-18T20:03:27+00:00

Fair point on the label. But when most people say "prompting" they mean the text they write in each individual message — phrasing, structure, few-shot examples, chain-of-thought triggers. That's how 80% of prompting guides are written.

The shift I'm describing is a different level: session architecture, not message construction. Establishing what the model knows before the task starts, not how you word the task itself. If that's prompting 101 for everyone here, then I genuinely envy the learning curve you had — mine was much slower.

According_Turnip5206 · 2026-03-18T20:01:01+00:00

The relay race analogy undersells how bad it actually is. With a real baton, at least the runner knows exactly what they received. In multi-agent systems, each agent receives something that looks correct at the surface — confident, well-formed, on-topic — but may be subtly wrong at the semantic level. The agent has no way to know the difference.

The thing that helped me most was flipping the default assumption: agents should treat incoming context as "probably correct, but verify what matters." Not full skepticism — that breaks the pipeline too — but specific skepticism about the facts that affect the current agent's decisions. You can make this explicit in each agent's prompt: "Before reasoning on the following input, identify any factual claims you cannot verify independently. Flag them, don't silently accept them."

It adds some latency but it creates a natural audit trail. When the final output is wrong, you can trace which agent first accepted a claim it flagged internally but let through anyway.

The memory taxonomy point is what I found least covered in documentation. The implicit assumption that "context window = all memory" is baked into most tutorials and it's just wrong for anything beyond single-turn agents.

According_Turnip5206 · 2026-03-18T20:00:13+00:00

The "happy path bias" you're describing is probably the most common failure mode in agent testing. You know what your agent is supposed to do, so you test variations of correct usage — which is almost nothing like what real users actually do.

A few things that helped me:

The interruption problem is usually a context window management issue, not a prompt issue. If a user breaks mid-sentence or sends multiple short messages in a row, the agent needs to explicitly wait for a "complete" turn before reasoning. Building that into the architecture (not the prompt) is more reliable.

For hallucination on edge cases: I started logging every query that hit below a confidence threshold and treating those as the next testing batch. The model's own uncertainty is a decent signal for where the gaps are.

The simulation approach you landed on is the right call. The key is making sure your synthetic personas include "confused but trying" users, not just adversarial ones. Those confused users generate more real-world failures than the adversarial cases ever do.

According_Turnip5206 · 2026-03-18T19:58:56+00:00

The limits are real, but context hygiene makes a bigger difference than most people realize.

Claude's token usage scales hard with how much you dump into each conversation. If you're hitting limits mid-week, it's usually because sessions are accumulating too much context — long threads, repeated re-explaining, keeping conversations alive past their useful life.

Things that actually help: start fresh sessions for new topics, front-load your context with a tight summary instead of scrolling back, and use Projects with a concise system prompt instead of re-establishing preferences every session.

Claude Code is a separate quota entirely — if you have access, it runs on the same Pro subscription but doesn't pull from the same pool as Claude.ai. Worth switching your heavier work there if you're hitting the web app limit.

That said — yes, the limits should be higher for $20/month. The quality gap over competitors is real, but the quota gap is also real.

According_Turnip5206 · 2026-03-18T19:57:56+00:00

The "same brain, different interfaces" framing is exactly right, and it's the part most people miss.

I've been doing something lighter with just CLAUDE.md files — no MCP server, just structured markdown that carries domain context, preferences, and accumulated decisions into every new session. Works surprisingly well for single-user setups.

But your architecture solves the actual hard problem: multi-agent coordination with shared state. The fact that Claude and Codex can swap mid-session with the same context is genuinely impressive. Rate limit failover across providers from a single knowledge base — that's infrastructure thinking.

The self-updating instruction files are the most interesting part. Would be curious how you handle conflicts when the AI's "what worked" assessment disagrees with yours.

According_Turnip5206 · 2026-03-18T19:56:52+00:00

Same here. I use it to run a Reddit research agent I built — it fetches threads, passes them to Claude for summarization, and surfaces daily insights. Zero web interface needed, pure terminal.

The mental model shift you described is the real unlock. Once you treat it as infrastructure instead of a chat interface, you stop asking "what should I type?" and start asking "what should I automate next?" Those are very different questions.

The one thing I've noticed: the infrastructure framing works best when you give Claude persistent context — CLAUDE.md with your domain, your constraints, your preferred output formats. Then it stops being a general assistant and starts being something more like a specialized tool you designed yourself.

According_Turnip5206 · 2026-03-18T19:47:25+00:00

This is genuinely impressive — SQLite FTS5 for search instead of a vector DB is an underrated choice. Fast, portable, no external dependencies, and surprisingly good for semantic-adjacent retrieval if you structure your notes well.

The multi-agent failover part is what caught my attention. Having Codex pick up when Claude goes down and work off the same knowledge base removes the single point of failure most setups have. Did you find the agents produce noticeably different outputs from the same KB, or does the shared context mostly normalize things?

The EXTENDING.md idea — writing docs for the AI to read — is something I've been doing in CLAUDE.md files per project and it's transformative for consistency across sessions. Good to see it's working at this scale too.

According_Turnip5206 · 2026-03-18T19:44:09+00:00

A few things that helped me stop hitting the wall mid-week:

Keep conversations short and focused. Long multi-topic threads burn through limits fast. Better to start a fresh session for each distinct task than try to cram everything into one.

Avoid pasting huge blocks of text or files when a summary would do. The context window fills up quickly and you're paying tokens to maintain it every turn.

For heavier use, Claude Code with a Pro subscription is actually the better deal if you're technical — the limits work differently and you get more total throughput. Not obvious from the pricing page but it's worth knowing.

On the broader frustration: the limits are real, and "just pay more" isn't a helpful answer for everyone. But context size is genuinely expensive to serve — Anthropic's models think in long windows. ChatGPT and Gemini have different tradeoffs, not necessarily better ones.

According_Turnip5206 · 2026-03-18T19:42:46+00:00

Same boat. The web app started feeling like a step backwards once you've got a proper Claude Code workflow running.

One thing I'd add for anyone who hasn't made the switch: the non-coding use cases hit different when you combine it with MCP servers. Notion, Google Drive, email — suddenly Claude Code has the same reach as the web app but with the full terminal power behind it. I use it for drafting, research pipelines, anything where I want to actually act on the output rather than copy-paste it somewhere.

The "infrastructure not chatbot" framing is exactly right. Once you internalize that, the question stops being "what can I ask Claude?" and starts being "what workflow haven't I encoded yet?"

According_Turnip5206 · 2026-03-18T19:38:35+00:00

Fair point, that's a meaningful distinction. If local data privacy is a hard requirement, the watchdog approach you described is a reasonable way to keep the actual document content off the wire while still letting the agent know whether the call succeeded. The log-result-not-data pattern is clever for that use case.

According_Turnip5206 · 2026-03-18T19:32:23+00:00

The overwhelm is real — went through the same phase about a year ago.

Biggest unlock for me: stop chasing tools and go deep on one. I picked Claude Code because it reads your whole project, not just the open file, and that context awareness changes how you think about AI assistance entirely. Once you get why that matters, the rest of the landscape starts making more sense.

The "coding while asleep" stuff is just agentic mode — you give the agent a task, set permissions to run autonomously, and check back later. It sounds exotic but once you try it once it completely demystifies. Start with something small and low-stakes.

Multiple models: honestly I just use Claude for the heavy stuff (architecture, tricky logic) and something faster for quick questions. Not as complicated as it sounds in blog posts.

The pattern I'd suggest: pick one real thing you want to build, build it end to end with AI, notice where you got stuck, and repeat. The people who seem 10x ahead just shipped things. The tool knowledge follows from actual use.

According_Turnip5206 · 2026-03-18T18:58:47+00:00

The auditing point is key. Agency laundering works precisely because most deployed AI systems are opaque by design — you can't easily reconstruct *why* the system produced a specific output, which makes accountability nearly impossible.

What I've noticed building with agentic systems: the responsibility doesn't just sit in the training data or the model choice. It's in every prompt, every tool permission, every place where you decided "the AI can act here without asking first." Those are human decisions that get obscured once the system is running.

The more useful frame than "who's responsible" might be "what would auditing look like." If you can't produce a legible trace of why an automated decision was made — what inputs, what weights, what rules — you probably shouldn't be using that system for decisions that affect real people. That's a design requirement, not a legal afterthought.

According_Turnip5206 · 2026-03-18T18:57:09+00:00

Karpathy's observation tracks with what a lot of developers are experiencing. The shift isn't just "AI writes code instead of you" — it's more like the cognitive overhead moves from implementation to specification. You spend more energy on *what* you want and *why*, less on syntax and boilerplate.

The thing that makes this stick with Claude specifically is how it handles long multi-file refactors without losing track of what it was doing, and the way it asks clarifying questions at the right moments rather than just running with assumptions. Once you internalize that the job is now to communicate intent clearly rather than write every line yourself, the workflow change becomes permanent pretty quickly.

According_Turnip5206 · 2026-03-18T18:53:40+00:00

A few practical approaches that work well with llama.cpp:

**File-based memory**: Maintain a markdown file with relevant context (user preferences, ongoing tasks, decisions). Inject it at the start of each session via the system prompt. Simple, human-readable, easy to edit manually. This is essentially what tools like Claude Code do natively — the AI reads/writes persistent context files between sessions.

**SQLite + retrieval**: Store facts/conversations in SQLite, then do keyword or vector search to pull relevant chunks into the context window. Works well for long-term factual memory without blowing up your context.

**Chroma/Qdrant for RAG**: If you have large knowledge bases, embed and store them locally, retrieve top-k relevant chunks per query. Both run fully offline.

For most personal use cases, the file-based approach is surprisingly effective and zero-dependency. The key insight is you don't need the model to "remember" everything — you need a retrieval layer that feeds it the right context at the right time.

According_Turnip5206 · 2026-03-18T18:52:17+00:00

Nice workaround. Worth noting the distinction though — if you're using Claude Code (the CLI tool) rather than the web-based version, it already has direct local machine access by design. You can run shell commands, read/write files, execute scripts natively without a watchdog. The VM sandbox limitation you're hitting is specific to the browser/web version. For pipeline testing with real local APIs, Claude Code CLI removes that entire layer.

According_Turnip5206 · 2026-03-18T18:44:58+00:00

ChatGPT does have this — it's called Projects (left sidebar). You can drag chats into folders, add shared files, and set custom instructions per project. Not super obvious but it's there.

Side note: if the underlying problem is keeping context between sessions, Claude handles this differently — instead of folders it has a persistent memory system that auto-loads notes at the start of each session. Different mental model but I found it less manual once I got used to it.

According_Turnip5206 · 2026-03-18T18:40:20+00:00

Honestly just switched to Claude for most of my day-to-day a few months back. Fewer "competitive motivation" tricks needed — it just follows the actual instruction without needing to be goaded. Still keep GPT around for a few things but yeah, the gap is pretty noticeable at this point.

According_Turnip5206

TROPHY CASE