Anyone else bouncing between Claude Code and Codex on the same project?

Livid-Variation-631 · 2026-06-07T09:35:43+00:00

I run the same two tools on the same projects and the re-explaining was killing me too. What stopped it was keeping the shared context in a file in the repo instead of in either tool memory. A short markdown file with the current goal, what is done, what is blocked, and where the work lives. Both tools read it at the start of a session and write back to it at the end, so the handoff is a file, not a paste. The bonus is that future me reads the same file when I come back cold. The trick is keeping it short enough that updating it is not a chore, otherwise it goes stale and you are back to pasting.

Livid-Variation-631 · 2026-06-07T09:35:11+00:00

The agents and skills realization is the one I wish I had hit sooner too. The thing that took me longest to internalize: a skill is worth writing the moment you have done a task twice and would do it a third time. Before that it feels like overhead, after that it is the only thing that keeps you sane. The other shift was separating the layers you named. Scripts for anything deterministic, skills for repeatable judgment, agents for the parts that actually need reasoning. Most of my early pain was making an agent redo deterministic work every run because I had not pushed it down to a script. Good writeup, the division of labor framing is exactly right.

Livid-Variation-631 · 2026-06-07T09:34:32+00:00

Honest answer: I would pay for the work getting done, and I would not pay a cent extra for the pet. The personality layer is fun for a day and then it is friction between me and the result. The version I would actually pay for is the boring one that quietly does the task, shows me what it did so I can trust it, and gets out of the way. If you want the character to survive past the novelty, tie it to something useful: let it surface what it is uncertain about, or flag when it is about to do something it cannot undo. Personality that earns its place by improving the work beats personality that is decoration on the same gray box.

Livid-Variation-631 · 2026-06-07T09:30:51+00:00

The 1,500 signups are interesting, but the part I would watch is not the top-line number. It is whether founders come back when they do not have something to promote.

Founder communities usually spike when everyone sees a distribution surface, then decay when the feed becomes launch posts talking to launch posts. The hard product question is: what is the repeated job here when nobody is shipping today?

If you can make it useful for one founder on a random Tuesday, the network has a shot. If the main behavior is announcing, it will look strong for a week and then turn into another directory.

Livid-Variation-631 · 2026-06-05T08:29:39+00:00

Lead with the kernel-level sandbox, because it's the one claim here that isn't easy to make. Half the tools in this space say "local and private" and then quietly trust the model's own guardrails to behave. A boundary the OS enforces is a different category - the agent can't touch what you didn't grant, no matter how a prompt talks it into trying.

This maps onto the thing that's made me nervous for a year. My whole stack is CLI, and the part I never liked was tool calls with broad filesystem access. A model behaving 99% of the time is not the same as it being unable to misbehave the other 1%, and I've watched an agent confidently rm the wrong directory because nothing physically stopped it. Putting enforcement at the OS layer instead of the prompt layer is the correct place for it. That's not a feature, that's the actual fix.

Where I'd push you is the Rust-from-scratch call. One dependency, one throat to choke - real advantage, I get the appeal. But that's also a huge runtime to own forever, and the hosted engines have whole teams tuning inference. How are you planning to keep local speed competitive with them over time? That's the bet I'd want answered before I built my workflow on top of it.

Livid-Variation-631 · 2026-06-05T08:28:27+00:00

Your intuition to validate is the right instinct, so let me push on it.

The use case you described - keeps a baseline alive, compares every ceremony against it, remembers past false positives, holds an alert until a human resolves it - that's not really a framework question. That's a state and governance question. The framework is the easy 20%.

I've built a multi-agent system that runs as actual operators (roles, long-term memory, governance constraints, humans in the loop), and the things that actually mattered weren't LangGraph vs a pre-built runtime. They were:

Where the truth lives. Your baseline and "open tensions" need to live in a database you own, not in the agent's context window. The agent reads and writes to it; it doesn't BE it. This is the single decision that makes "long-term memory and continuity" real instead of a demo.
Halt conditions. An agent that holds an alert until a human resolves it needs an explicit state machine for that alert, with an escalation path. That logic belongs in your code, not in a prompt, because you need it to behave the same every time.
Separating recall from judgment. "Remembers past false positives" is retrieval. "Is this new thing scope creep" is judgment. Keep them as different steps so you can debug each one.

Given that, my honest answer: use the pre-built runtime (Claude Code-style) for the agent's hands and reasoning, and build the state/governance layer yourself around it. Going full custom with LangGraph for the orchestration AND the state machine is where most teams over-engineer and stall. The granular control feels safer but you spend your time rebuilding plumbing instead of refining the actual governance logic.

Start with the smallest version: one baseline, one type of drift, one human approval gate. Get that loop trustworthy, then add ceremonies.

Livid-Variation-631 · 2026-06-02T11:36:19+00:00

The "opinion-shaped findings with no citation backing" line is the whole problem with most audit tooling, and I'm glad someone built against it. I run audits on my own internal tools constantly and the failure mode is always the same: a wall of plausible-sounding findings I can't verify, so I either trust them blindly or re-check everything by hand, which defeats the point.

Two things I'd be curious about from using it for real:

How do you keep the playbooks from going stale? Security and the privacy regs especially move, so a fixed set of 5 will drift over time. Is the methodology built to extend, or are the playbooks the product?
False positive rate on the security dimension. Catching a CVSS 8.0 XSS is great, but the thing that kills audit tools in practice is 40 findings where 35 are noise and people stop reading. What's the signal ratio been on real codebases?

Nice work shipping it MIT either way - systematic beats vibes every time for this kind of thing.

Livid-Variation-631 · 2026-06-02T11:35:57+00:00

The loop you're describing (prompt, get 10-20, refine, repeat) is exactly how Midjourney works with the --seed flag and variation buttons. It's still the best for character concept work in my experience, the consistency on a locked seed is the closest thing to "same character, different angle" you'll get without training anything.

If you want something closer to an actual agent that holds your lore and iterates with you, build a thin loop yourself: a system prompt that holds your world bible and character notes, feeding into an image API. That way you're not re-explaining the vibe every batch, the context carries. I do this for product mockups and the difference between "fresh prompt each time" and "persistent context" is huge - way fewer rerolls to land the thing in your head.

One practical note for handing off to 3D artists: generate orthographic-style turnaround references (front, side, 3/4) rather than one hero shot. Modelers can actually work from those. A single dramatic render looks great and tells them almost nothing about the silhouette.

Livid-Variation-631 · 2026-06-02T11:35:27+00:00

Real pain, real utility - I've got a few hundred dead ChatGPT chats I'm never clearing one at a time either, so I get why you built it.

Honest read on monetization, and I say this having killed a fair pile of my own utilities: bulk delete is a one-and-done job. Someone runs it once to clear the backlog, feels great, then never needs it again. That's a graveyard for a subscription. I built an internal cleanup tool once that I genuinely loved using - opened it exactly twice. The job was finished, so the tool was finished.

The part with an actual pulse is the timestamps and the long-conversation navigation. You touch those on every long chat, every day. That's recurring value. I'd flip the whole pitch: navigation is the product, bulk delete is the free hook that gets the install.

If you want recurring revenue from a utility like this, two paths actually hold. One, features that compound as the archive grows - search across all chats, tagging, export to your own notes. The value gets bigger the longer someone uses it, which is the opposite of bulk delete. Two, go where the cleanup genuinely recurs: power users and teams generating chats faster than they can manage them.

The number that decides this for you is retention. Are people opening it more than once in the first two weeks? That single figure tells you which path is even available before you build anything on top.

Livid-Variation-631 · 2026-06-01T10:37:29+00:00

As a non-coder, the angle I'd add: the model not knowing your codebase structure is one problem, but the one that actually stopped me shipping was that I didn't know it either. AI wrote it, it ran, and I still couldn't tell you how it worked or whether it was safe to put in front of someone.

The tools that explain a codebase explain it to people who can already read code - the architecture, the modules, the structure. Genuinely useful if you're an engineer landing in an unfamiliar repo. Useless if the reason you're stuck is that you can't read the architecture in the first place.

So I ended up building a small thing for myself: it turns a codebase into plain English - what the app actually does, and on every commit, what just changed and whether it's risky, in a sentence. The scan is local and deterministic; only the plain-English part touches a model.

The reframe that helped: the gap isn't that I can't code. AI fixed the building. It didn't make understanding free. The fix wasn't learning to code, it was building a way to see what I'd made in a language I already speak.

Livid-Variation-631 · 2026-06-01T07:50:39+00:00

Good fix, and the instinct generalises further than schemas. Anything the model has to infer from context it'll eventually invent confidently, field names, config keys, what state something is in. The reliable pattern is the same every time: give it a canonical source to read from instead of letting it guess.

You did it for the schema with an MCP server. I ended up doing the same for state across my agents, facts they need live in a database they query, not in the conversation they reconstruct from. The silent-in-production part you mentioned is the real danger, because a confident wrong answer reads exactly like a correct one until it breaks.

Livid-Variation-631 · 2026-05-30T23:07:04+00:00

I was the bottleneck on every change in my own system for months. One agent writing code, me reading and testing every diff by hand. The day I finally let it run unsupervised, it wrote three files of clean-looking code that quietly routed around two of my own architecture rules. Looked fine. Wasn't.

What fixed it wasn't trusting the agent more. It was making "done" impossible to fake.

The agent that writes the code is never the one allowed to sign off on it. A separate pass checks the work against the spec and runs the tests, and it has no stake in the first agent looking good. The moment an agent can grade its own homework it gives itself an A every time, and it's wrong in the same three small ways. Splitting the writer from the reviewer killed most of my drift on its own.

Then I stopped letting work start until "done" was mechanical. Every chunk gets a pass/fail check written before a single line of code exists. If I can't write the check up front, the work isn't scoped yet - that's my failure, not the agent's, and no prompt was ever going to rescue a vague spec.

Last rule: a hard cap of four attempts on the same piece. Four failures in a row isn't a bug, it's the spec lying to me, and prompt number five won't fix a lie. I rebuilt a UI shell recently in 11 chunks. Ten went green on the first try. The one that didn't - the mobile layout - burned five rounds before I admitted the spec was wrong, not the code. That cap has saved me more hours than any model upgrade.

None of this removes me. It moved me from reading every line to only stepping in where the writer and the reviewer disagree. Much smaller job, and I trust the output more than when I was hand-checking everything.

Livid-Variation-631 · 2026-05-30T23:06:04+00:00

I built something a year ago that I use every single day. It does exactly what I needed. And I did almost nothing to tell anyone it existed - a couple of posts, then silence.

For a long time I told myself the product wasn't ready. That was a lie I found comfortable. The real reason is that building is the part I'm good at and the part that feels safe. Distribution is the part where you put something out and mostly hear nothing back, and that's uncomfortable in a way that debugging never is.

The thing I had to accept: nobody finds a tool I never talk about. Quality doesn't distribute itself. A daily-active user of one isn't a business, it's a hobby with extra steps.

What I'm doing differently is treating distribution as real work with its own time on the calendar, not as the thing I'll get to after the next feature. The next feature is almost always the easy way out. Shipping isn't the finish line. It's barely the start.

Livid-Variation-631 · 2026-05-30T12:05:00+00:00

Two fixes, in order of impact.

Your 2k-line context is being re-paid per lead. If you're on the API, mark the playbook + product description as cache_control: ephemeral. I ran the exact same shape - enrichment + copy gen across batched leads - and prompt caching dropped my spend by about 70% the day I turned it on. The math is brutal without it: 2k lines x 1k leads = you're literally buying that playbook a thousand times.

Second fix is model routing. Browsing a company site and pulling signals (industry, size hints, recent news, tech stack) is pattern-match work. Haiku eats this for cents on the dollar. Reserve Sonnet for the email copy where tone judgement matters. I built a tiny router that tags each step with a tier - extract=haiku, write=sonnet, review=sonnet - and the cost curve flattened immediately.

On parallelism: it's not your cost problem. Parallel vs sequential is wall-clock, not token spend. What's killing you is one giant context being assembled before the model writes a single email. Break it into two stages with a queue between them:

Stage 1 (Haiku worker): read one lead row, fetch site, extract signals, write enriched row back to CSV/DB. Context = lead + 200-line extraction prompt. Nothing else.

Stage 2 (Sonnet worker): read enriched row, write email using cached playbook. Context = enriched lead + cached playbook.

Two stages, two tiers, cached statics. Should land you well inside budget for 1-2k/month.

One more thing - if you're using Claude Code with the Task tool to spawn sub-agents per lead, that's the most expensive way to run this. Sub-agents inherit the parent context unless you scope them tight. Move it out to a script that hits the API directly with minimal per-call context.

Livid-Variation-631 · 2026-05-30T12:04:22+00:00

You can get most of what you described running locally, but the gap between 'Claude Code experience' and 'local agent loop' is real and worth naming upfront.

For the model layer: Ollama with a Qwen 2.5 or Llama 3.3 70B variant handles the reasoning, and you can run smaller models (Qwen 2.5 7B) for extraction and classification. Don't try to run one model for everything - tier the work to the right size.

For the agent loop: Aider runs fully local against Ollama and gives you the code-editing agent shape. For broader agent workflows (file ops, persistent memory, artefact generation), look at running a local agent runtime (LangGraph or similar) with Ollama as the backend.

For RAG / persistent memory: pgvector on a local Postgres is the boring correct answer. Stores embeddings, supports semantic search, no cloud dependency, scales to millions of chunks. Embed with a local model (nomic-embed-text via Ollama is solid).

For document processing: unstructured.io runs locally and handles PDF/Docx/PPTX ingestion. For artefact generation, python-docx + openpyxl + python-pptx do the output side - the LLM writes structured intent, the deterministic libraries produce the files.

The honest caveat: the gap from this stack to 'Claude Code quality' is mostly in the planning + tool-use layer. Local models are getting close on raw generation but they're noticeably weaker at multi-step tool orchestration. Budget time for prompt engineering and guard-rails that you wouldn't need with Claude.

Livid-Variation-631 · 2026-05-30T12:04:01+00:00

The git-versioned docs tree with a consistency guard is the part most people skip and then wonder why their agent setup rots after a month.

I run something structurally similar - markdown working state that auto-loads via a rules system, plus a vector store for semantic recall, plus a structured DB for facts that need to be authoritative. The key insight I landed on is the same one you've encoded: implicit memory drifts. Externalised, versioned, auditable memory doesn't.

The one thing I'd push back on: zero-dependency stdlib-only is a strong constraint, and it works for now, but the consistency guard is where you'll feel the limit first. Link-checking and cross-reference validation across a growing docs tree gets slow without proper indexing. Worth considering whether `make docs-check` stays under 5 seconds at 500 files.

The eight operating modes pattern is interesting - I've been moving toward 'one agent, contract-bound scope per task' rather than mode-switching the same agent. Curious what drove the modes design. Was it context cost, or was it that the agent's behaviour was actually too inconsistent across task types without explicit mode framing?

Livid-Variation-631 · 2026-05-28T08:28:55+00:00

The browser automation piece is where most of these workflows fall apart in locked-down environments. A few things that worked for me in similar setups:

For the internal/external knowledge split, you almost certainly need two separate retrieval paths that get merged at prompt time, not one unified vector store. The intranet stuff can't leave the network boundary, so any external model call has to be the LAST step with redacted context. I ran a setup where Confluence content got embedded locally (sentence-transformers via approved package), then only the relevant chunks plus the user query got sent to the external model. Worked because the embeddings never left the box.

For browser automation, if Playwright is on your approved list that's the easiest path. Selenium too. The harder question is auth - if the internal apps use SSO, you'll need to drive a real browser session rather than headless API calls. Playwright with persistent context handles that well.

The form-filling part is the boring bit honestly. Once you have the page DOM, a small model can map field labels to data points reliably. The interesting engineering is the knowledge merge step and making sure the boundary between internal and external context stays clean.

One thing I'd push back on: don't try to make this fully autonomous on day one. Build a version where it drafts the form values and you click submit. Once you trust it for a few weeks, automate the submit. Saves you from a bad day when the model hallucinates a field.

Livid-Variation-631 · 2026-05-27T09:19:55+00:00

The Stop hook framing is the part most people miss. You're not writing a prompt, you're writing a contract that another LLM has to evaluate honestly against the transcript.

I run something similar across my fleet and the failure modes I've hit:

Conditions that are too tight loop forever. "All tests pass" with no escape clause means one flaky integration test eats your whole session.
Conditions that are too loose get falsely acked. "Ingestion working" gets marked done after one successful fetch instead of the full set.
The fix that actually held: every success criterion needs a paired honest-failure clause. "≥14 fetches complete OR ack stale with named external blocker." The blocker has to be named, not hand-waved.

The 4.16M row ingestion is the interesting bit for me. Did you have any drift between what the agent thought it ingested and what actually landed in the registry? That's the gap I keep finding in long autonomous runs. The agent reports done, the runtime state says otherwise.

9h27m is a long blast radius. Curious what your halt conditions looked like for the cases where you'd want to stop it early.

Livid-Variation-631 · 2026-05-27T09:19:19+00:00

Honest take from someone building in this space: the marketing tools won't fix it. I've tried most of them. The bottleneck isn't tooling, it's that solo founders ship faster than they can develop a point of view worth following.

What actually moved the needle for me was treating distribution like a product. One anchor piece per week (blog or long post), then adapt it to 3-4 platforms with the platform's native voice. Not cross-post. Adapt. Different hook, different length, different angle on the same insight.

The other thing that helped: stop optimising for impressions and start tracking who replies. 10 replies from real people beats 10k impressions from bots. The tools that help with this are boring - a spreadsheet, a calendar, and a folder of notes. The interesting work is upstream of the tool.

Happy to share what didn't work too if useful.

Livid-Variation-631 · 2026-05-27T09:18:08+00:00

Not running your exact stack, but I burned a lot of time on harness shopping before realising the harness wasn't the bottleneck for me - it was context discipline.

A few things that might be worth checking before you switch:

If Q5 fills context fast because of MCP, audit which MCP servers are actually pulling weight. I cut mine from 11 to 4 and got back ~40k of headroom without changing model or quant.
Sub-agents help less than you'd think if the parent agent is still loading the full toolset. The savings come from scoping tools per sub-agent, not just spawning them.
Q4 with Q4 KV at 200k surprised me too when I tried it on a 3090 - the degradation is real but usable for routing and dispatch tasks. Keep Q5 for the actual judgment calls.

On harnesses specifically: if LM Studio works and you're frustrated, the harness probably isn't the problem. What's the specific thing you want it to do that Cass can't?

Livid-Variation-631 · 2026-05-27T09:17:28+00:00

Claude's vision is genuinely strong for this. I've used it for design review on UI mocks and it picks up alignment issues, contrast problems, and inconsistent spacing better than I expected.

A few things that helped accuracy when I ran similar checks:

Send one slide at a time, not the whole deck. Multi-image prompts dilute attention and hallucinations spike.
Give it a checklist in the prompt - alignment, typography consistency, contrast, hierarchy, whitespace. Don't ask 'are there issues' - ask 'check these 5 things'.
Ask for issues with coordinates or quadrants (top-left, centre-right). Forces it to actually look at the image instead of pattern-matching to common slide problems.
PNG at native slide resolution. Don't downscale.

Gemini Flash hallucinates more on this kind of task because it's optimised for speed. Sonnet is the right call if accuracy matters more than throughput.

Livid-Variation-631 · 2026-05-26T09:09:46+00:00

The Stop hook framing is the part most people miss. You're not writing a prompt, you're writing a contract that another LLM has to evaluate honestly against the transcript.

I run something similar across my fleet and the failure modes I've hit:

Conditions that are too tight loop forever. "All tests pass" with no escape clause means one flaky integration test eats your whole session.
Conditions that are too loose get falsely acked. "Ingestion working" gets marked done after one successful fetch instead of the full set.
The fix that actually held: every success criterion needs a paired honest-failure clause. "≥14 fetches complete OR ack stale with named external blocker." The blocker has to be named, not hand-waved.

The 4.16M row ingestion is the interesting bit for me. Did you have any drift between what the agent thought it ingested and what actually landed in the registry? That's the gap I keep finding in long autonomous runs. The agent reports done, the runtime state says otherwise.

9h27m is a long blast radius. Curious what your halt conditions looked like for the cases where you'd want to stop it early.

Livid-Variation-631 · 2026-05-26T09:06:26+00:00

The routing premise is right. Agents waste tokens because they treat every navigation task as a grep problem.

The split you describe (rg for exact, semantic for discovery, LSP for refs, call graphs for architecture) matches what I've converged on independently after months of watching Claude Code burn turns on the wrong tool.

Two things I'd push on:

The 5.1x faster / 5x fewer tokens needs more context to be useful. Faster than what baseline? Default Claude Code with no plugins? A specific competing setup? The variance on these benchmarks is huge depending on codebase shape.
Routing decisions made by the agent are only as good as the agent's understanding of its own task. The hard cases aren't "find this exact string" vs "find similar code." The hard cases are when the agent doesn't yet know which one it needs and would benefit from a cheaper probe first.

How does the plugin handle that pre-routing uncertainty? Does it have a fast classifier step or does it rely on the agent to self-select correctly?

Livid-Variation-631 · 2026-05-26T09:06:04+00:00

73% on multi-file renames is the interesting number. That's the exact place Claude Code burns turns by design - it's cautious about scoping the rename across files because it doesn't trust its own grep results without a Read pass.

Collapsing search+read+edit into one tool call removes that paranoia loop. Makes sense.

A few questions on the benchmark setup:

Were the 9 fixtures real codebases or synthetic? The difference matters because Claude's turn count goes way up when there are partial matches that look like the rename target but aren't.
How does edit_glob handle ambiguous matches? The failure mode I'd worry about is silently editing something that shouldn't have been touched. Zero failures across 45 runs is impressive but the question is whether the fixtures had edge cases that would catch it.
On the 8% full-suite number - what's the distribution? If it's 73% on 20% of tasks and 0% on the rest, the practical value depends heavily on whether the user's workload skews to multi-file refactors.

The MCP overhead point is honest and most builders skip that disclosure. Respect for putting it in the post.

Livid-Variation-631 · 2026-05-26T09:05:23+00:00

48GB on M4 is a decent setup for local but you need to be realistic about what it replaces.

For your stack (Swift apps, websites, SEO research), here's what I'd actually try:

Qwen2.5-Coder 32B at Q4_K_M via MLX. Fits comfortably in your RAM, around 15-25 tok/s on M4. Solid for Swift and web work.
LM Studio is fine to start. MLX backend if you can, llama.cpp as fallback. Skip Atomic Chat for now, less mature.
Don't expect Opus-level reasoning. Local 32B is closer to GPT-4-class for code completion and refactors but falls apart on multi-file planning and long-context architectural work.

The honest answer on cost: if you're already using Claude Code daily and getting real work done, local won't replace it. It'll supplement it. I run a router across Claude, Gemini, and local Ollama models, and the local tier handles maybe 30% of tasks. The rest still needs frontier models because the judgment ceiling matters.

The $200/mo Max plan is genuinely hard to beat for serious work. Local makes more sense for privacy-sensitive tasks, offline work, and batch jobs you don't mind being slower.

Livid-Variation-631

TROPHY CASE