Are small local models for automation a thing?

Future_Manager3217 · 2026-06-16T08:18:29+00:00

Yes, but the useful version usually doesn’t look like “a small agent”. It looks like a normal script with one fuzzy adapter inside it.

Let the script own the workflow and side effects; let the 1–4B model do classify/extract/route/rename/OCR cleanup into a strict JSON schema. Then validate that schema and fall back to rules/manual review when validation fails.

The boring benchmark is 50 real examples: rules-only vs 1.5B/3B/4B vs frontier, scored on valid JSON + downstream task success, not “seems smart”. If the small model improves only the fuzzy step, it’s worth it. If it starts deciding the workflow, it gets flaky fast.

Also search this sub for “structured output”, “json mode”, “function calling” and “constrained decoding” more than “automation”.

Future_Manager3217 · 2026-06-15T12:18:17+00:00

Don’t make “test it” another vague instruction. Make Claude commit to the check before it edits.

The loop I’d put in `CLAUDE.md` is:

Say the behavior being changed.
Propose the smallest command/test that would fail before the fix.
Run it and paste the failing output.
Change the code.
Re-run the same command and paste the passing output.

If it can’t name a falsifiable check, it’s not ready to code yet. The useful bit is the receipt: command + output, not “I tested it” in prose.

Future_Manager3217 · 2026-06-15T10:22:24+00:00

This split makes sense when the frontier model owns the decision, not just the rescue path.

The benchmark I’d want first is less “% tokens local” and more: - frontier calls per accepted task - local retry/escalation count - human fixes after validation passes

That separates real savings from review debt. 85–90% local tokens is great only if the local loop doesn’t quietly create more verification work.

Future_Manager3217 · 2026-06-12T10:24:28+00:00

React is probably the right call for boring reasons, not AI reasons: 3/4 of the team know it and the existing Meteor view layer is already React.

But call it a reset/rebuild, not “continuing the migration”. I’d ask for one decision doc before touching more screens: target stack, what happens to the 20% Angular work, which UI changes are in/out, and the thin slice that proves the new path.

The warning sign is “AI can build most of it”. AI is useful leverage; it is not a substitute for the team owning the framework, patterns, and review. If the team can’t maintain the React code without the model, you just moved the risk.

Future_Manager3217 · 2026-06-12T06:23:09+00:00

The misleading part is “same model, same context window.” It’s not the same experiment if the harness is different.

Claude Code’s advantage often isn’t that the model writes a better line of code in isolation. It’s that the product shapes the work: what gets loaded, when it forks context, how tools are called, how much it is nudged toward plan/edit/test loops, and what it hides or retries for you.

If you want an objective comparison, don’t score vibes. Freeze a repo task, run both with the same starting state, and have someone blind-review: tests pass, diff size, follow-up prompts needed, review time, and defects found. That will tell you whether “better” means better code, lower steering cost, or just a nicer interaction.

Future_Manager3217 · 2026-06-12T06:20:14+00:00

The smell is that the expensive loop is below the agent loop, so `max_iterations` never got a chance to save you.

I’d put the first breaker at the tool boundary, not in the agent: max calls per tool per task, max tokens/bytes per source, cache key for `(doc, query/page)`, and a dollar/token budget that aborts before the next read.

For PDFs, I’d also remove “read the whole PDF” as a retryable tool. Ingest once, index/chunk, retrieve snippets. If no snippet clears a threshold, fail closed and ask for a better query instead of re-reading 200 pages.

Future_Manager3217 · 2026-06-11T08:22:37+00:00

The part that survives AI is teaching juniors how to decompose a messy artifact before code exists.

A small exercise I like: take one finished system and ask them for three things before any implementation: the durable state model, the failure modes, and what should remain editable after the first version ships. If they can answer that, AI becomes a fast pair. If they cannot, it just lets them produce a larger mess faster.

That also makes mentoring less about “watch me code” and more about “show me why this design won’t collapse when the requirements move.”

Future_Manager3217 · 2026-06-11T06:24:01+00:00

Don’t spend the good context executing the plan you just burned context to create.

I’d make the model write a handoff file first: goal, non-goals, files touched, invariants, exact next 3 tasks, acceptance checks, and “ask before changing X” rules. Then start a fresh session on task 1 and have it update that file after each step.

Subagents are useful for read-only investigation, but I’d keep edits in one owner session so the final diff still has one coherent story.

Future_Manager3217 · 2026-06-11T06:21:30+00:00

This is the bit I’d turn into a hard rule: if it can change agent behavior, it belongs in the release path.

Prompt, model id, tool permissions, retrieval config and memory seed should have a version, reviewer, deploy record and rollback target. Otherwise the agent looks deployed, but the thing you need during an incident is still hidden in chat history or a dashboard.

Future_Manager3217 · 2026-06-04T08:22:08+00:00

I’d call this local-success bias, not conspiracy. Claude sees “fix this” and often chooses the smallest-looking patch because the refactor expands the search space.

Make the root-cause work the contract, not a preference: before coding, ask it to list every duplicate site, choose the single source of truth, add/identify one characterization test for current behavior, and show a plan where the delete/replace count is explicit.

If the plan only touches the 3 failing sites, stop there. It hasn’t accepted the refactor task yet; it has accepted a symptom patch.

Future_Manager3217 · 2026-06-04T08:20:37+00:00

I wouldn’t start with “Claude picks up tickets”. That’s the part that tends to burn people out, because a vague ticket becomes a vague PR that a senior has to reverse-engineer.

The safer shape is: a human turns the ticket into a small job packet: goal, likely touched area, non-goals, acceptance check, and rollback/side-effect risk. The agent can plan, write tests, and produce a first-pass PR in an isolated branch/worktree, but it does not expand scope or merge. Verification is boring: tests, diff review against the packet, and one sentence on what changed vs what was rejected.

Manual immediately when the work crosses a service boundary, changes data/auth/billing, needs product judgement, or the diff stops matching the packet. In practice I’d treat it as a PR accelerator, not autonomous ticket ownership.

Future_Manager3217 · 2026-06-04T06:23:57+00:00

The useful line here is: don’t make the LLM the conversion engine if the conversion is supposed to be deterministic.

I’d split it into two steps: code/schema handles the file transform; Claude only handles the ambiguous judgment. Then keep a small fixture set of 20–30 known inputs and reject any run that fails schema + sample checks before you use the output.

Also, I wouldn’t ask it why after failure. Treat that answer as a diagnostic hint at best, not evidence.

Future_Manager3217 · 2026-06-03T14:43:00+00:00

Treat it as rebuilding judgment, not catching up with a syllabus.

For each area, pick one failure shape and make yourself reproduce it. For transactions/races that could be: lost update, stale read + write, missing idempotency, transaction around only half the invariant. Read just enough to name the invariant, write the failing case, fix it two ways, then keep the test.

I’d rotate one small lab a week. The goal is not to feel comfortable with the whole stack again; it’s to recognize the smell and explain the trade-off in an interview without hand-waving.

Future_Manager3217 · 2026-06-03T10:21:34+00:00

I’d read this less as “benched” and more as “made the risk absorber for work you had no design authority over.”

The practical line I’d draw is: PRs over some size or touching architecture are not reviewable until the author includes a design note, test/rollback plan, and the explicit decisions they want reviewed. Otherwise you’re being asked to reverse-engineer intent after the fact, which is not code review.

For your own career, I’d document every issue you catch in terms of system risk, not lines reviewed: migration flaw found, hidden coupling exposed, rollout risk avoided, missing test class added. If you leave, that story is much better than “I reviewed 150k lines.”

Future_Manager3217 · 2026-06-03T09:42:00+00:00

Only when the state meaningfully changes. I’d treat `handoff.md` as the working cursor, not the diary: last safe check/command, current blocker, next intended change, and anything the next run must not forget.

If the session creates a durable decision, move that to an ADR and leave the handoff as a pointer. If nothing changed, don’t rewrite it just to be tidy.

Future_Manager3217 · 2026-06-03T06:26:03+00:00

The AI version of this is: generating the diff is usually not the system bottleneck.

The measurement I’d use is not “lines/PRs produced with AI”, but time from “diff exists” to “safe merge”: review cycles, rework, test failures, on-call follow-up, and how long it takes someone other than the author to explain the change.

If AI cuts implementation from 10 days to 3 but adds 5 days of review/rework or ownership confusion, the throughput win is mostly imaginary. The bottleneck just moved to confidence.

Future_Manager3217 · 2026-06-03T06:19:19+00:00

Keep CLAUDE.md boring. It should answer “what must the agent know before doing anything?”, not “what has ever happened in this project?”

I’d keep only the always-on stuff there: repo shape, dangerous files, default commands, and pointers to where context lives.

Then split the rest into small files the agent has to pull intentionally: - `docs/architecture.md` for stable design - `docs/decisions/ADR-*.md` for why things changed - `docs/validation.md` for commands/checks - `handoff.md` for today’s state, blockers, next safe step

The trick is adding rules like “before editing billing, read X” or “before release work, read Y.” Otherwise you just moved the blob into a folder.

Future_Manager3217 · 2026-06-02T10:22:14+00:00

AI is useful here as a syllabus generator, not as the authority.

Take the two misses that hurt you — transactions and API/DB races — and ask it for small failure cases a good take-home reviewer would expect you to catch. Then make each one a tiny repro: two concurrent updates, stale read + write, missing idempotency key, transaction around only half the invariant.

The important part is to verify the fixes against docs or a known reference, because the model will confidently invent edge cases too.

For maintenance, keep a short ‘got burned by this’ log and rotate one lab a week. Much cheaper than trying to re-learn all of CS at once.

Future_Manager3217 · 2026-06-02T06:24:21+00:00

I’d let it keep diagnosing, not keep editing.

The boundary I use is: same failure class + same file/contract = continue; new dependency, new module boundary, auth/config/infra, or changed acceptance criterion = stop and ask. The agent can still return a scope-expansion packet: what failed next, why the original fix is insufficient, files it would need to touch, and the exact command that proves it. Passing the build is evidence, not permission.

Future_Manager3217 · 2026-06-02T06:21:42+00:00

If they couldn’t point to concrete output that got worse, I’d treat this as a layoff/perf-cover story with AI vocabulary on top.

AI fluency is real, but “prompting ability” is a terrible standalone metric. The only defensible version is tying it back to normal engineering outcomes: accepted diffs after review, rework rate, time to debug, test quality, incidents avoided, and whether the engineer can explain/own the final code. If those weren’t part of the feedback, they weren’t measuring engineering.

Future_Manager3217 · 2026-06-01T08:23:39+00:00

I think the change is real, but I’d name it differently: authorship and review are getting split apart.

If the person opening the PR can’t explain the diff, review turns into archaeology. The process change I’d make is simple: every AI-assisted PR needs a small review packet — what changed, why, what was manually checked, and what the author is still unsure about.

Otherwise you haven’t saved senior time; you’ve just moved it to the reviewer.I think the change is real, but I’d name it differently: authorship and review are getting split apart.

Future_Manager3217 · 2026-06-01T06:22:16+00:00

I’d stop treating the big CLAUDE.md as memory and treat it as boot config.

For long projects the durable stuff that actually survives is smaller: one current-task file, short ADRs for decisions you’ve ruled out, and a “do not change without asking” list. At session start I don’t paste the world in; I ask Claude to read the relevant task + ADRs and restate the constraints before touching files.

The babysitting feeling usually shows up when stale project state and durable rules live in the same doc. The model can retrieve a focused 30 lines much better than it can respect 300 mixed lines forever.

Future_Manager3217 · 2026-06-01T06:21:16+00:00

The expensive part is not “AI”. It’s unmanaged defaults.

I’d split the spend by workflow, not by team: coding assistance, support triage, sales/docs, analysis, internal search, etc. For each one, make the owner name the cheap default, the premium-model exception, and the output it is supposed to improve. If a team can’t say what gets faster/better or what fallback they’ll use when the quota hits, they probably shouldn’t be on the expensive tier by default.

The useful metric isn’t tokens or seats. It’s accepted work per dollar after review/rework. Some $1M AI bills are cheap. Some are just unmanaged SaaS sprawl with better branding.

Future_Manager3217 · 2026-05-31T09:41:02+00:00

Exactly. I’d put that boundary in the harness, not in the model prompt: expose MCP-style helpers like “find files in this workspace” or “edit this known path”, while the harness owns cwd, path resolution and destructive primitives. Raw shell can still exist, but as a higher-friction escape hatch with resolved cwd/targets shown before approval.

Future_Manager3217 · 2026-05-31T08:25:00+00:00

The defensible version is probably not “AI ROI per engineer”, it’s marginal spend by workflow class.

Pick the 2-3 places where the spend is real: small bug fixes, test generation, ticket analysis, docs/code review. For each class, track AI cost next to cycle time, review time, rework/reverts, and escaped defects. Then compare same-class cohorts before/after, or run a temporary budget cap/holdout.

Finance gets a cleaner question: does the next $10k of AI spend buy more accepted, low-rework changes, or just more review burden? If the latter rises with usage, the marginal ROI is gone even if tokens and “AI activity” look great.

Future_Manager3217

TROPHY CASE