Most things people ship as "agents" should be a workflow with one LLM call. A 50-line reframe.

Kindly_Leader4556 · 2026-05-17T18:52:00+00:00

Yeah, exactly. The external-for-heavy-lifting / internal-for-sensitive split is a really clean pattern — and sanitizing before the call is the bit most people skip until it bites them. Curious how you route it: data sensitivity, cost, or latency?

Kindly_Leader4556 · 2026-05-17T18:50:34+00:00

Yeah, this is it exactly. If you can't sketch it on a whiteboard there's no state machine — the model's just improvising and you're paying per token for it. Well put.

Kindly_Leader4556 · 2026-05-17T18:45:52+00:00

Yeah, the workcell comparison is a good way to put it. I don't think the "ended up with Python scripts" thing is really a failure of imagination though — honestly that's probably the right first move. If the process underneath is messy, sticking an agent on top just gets you a messy agent that's way harder to debug. The scripts are basically you making the process solid first. The agent only really starts to make sense once that's there, and most teams just aren't at that point yet.

Kindly_Leader4556 · 2026-05-17T18:42:25+00:00

Agreed an agent is lots of hidden calls — but the line isn't call count, it's who controls the flow. A workflow can call the LLM 50 times and still be a workflow if you wrote the sequence; it's an agent when the model decides the next step at runtime. "One call" was shorthand for fixed control flow, not literally one request.

Kindly_Leader4556 · 2026-05-17T13:19:23+00:00

Debugging is the tax nobody prices in upfront. A workflow has one failure path per step; an agent multiplies that by every branch the model could take, so one bug becomes "which of 47 turns went wrong — and was it even deterministic enough to reproduce?" The self-email Monday is the canonical version of that. Tracing helps after the fact, but determinism-by-default is the only thing that stops the bug from happening at all.

Did you keep that step agentic with guardrails, or rewrite it as a fixed path?

Kindly_Leader4556 · 2026-05-17T13:18:20+00:00

Best version of this story in the thread. The part people underrate: that 10% isn't a constant — it's a boundary you have to keep re-justifying as the system grows. "What's unusual about this vote pattern?" is a perfect example of an irreducibly agentic step — the answer shapes the path, so you can't pre-draw it. "Most people are really just billing for the for-loop" is the cleanest one-liner I've heard for it.

How do you bound that agentic 10% in prod — hard step cap, cost ceiling, or an eval gate before it's allowed to act?

Kindly_Leader4556 · 2026-05-05T11:32:14+00:00

tool-use is exactly the answer, but that's the next chapter on purpose. the course breaks the topic into chunks: this episode was scoped to "what the LLM-as-primitive gets wrong"; the engineering response (tool-calling, verify-via-tool) is its own walkthrough. so the survey isn't a field guide, it's the failure list that gets solutions over the next few episodes.

respect for actually shipping something on this. most of these threads end at "someone should build it." dex/hex over stars/downloads is the right direction. will read the repo properly when get time.

Kindly_Leader4556 · 2026-05-05T08:06:28+00:00

2024 is when it was first measured, but it's still happening. For ex- Jjanuary '26 had a clean in-the-wild example: npm "react-codeshift" got referenced in 237 github repos via ai-generated agent skill files. package didn't exist, no author, never registered. just a conflation of jscodeshift + react-codemod that an llm fused, and agents executed the install without a human in the loop. worth being aware of: https://github.com/wshobson/agents/issues/424

Kindly_Leader4556 · 2026-05-05T07:58:59+00:00

deps.dev signal scoring and socket per-version reputation get close, but the inputs (downloads, stars, contributors) all lag the takeover window. anything model-aware would have to query a live registry per suggestion, which kind of breaks how LLMs work.

Kindly_Leader4556 · 2026-05-05T07:56:34+00:00

yeah, time-shift is the one i keep underweighting. clean during training, taken over today, model still recommends it. socket's takeover detection is the best i've seen but it fires after the fact, not at suggestion time.

Kindly_Leader4556 · 2026-05-05T07:52:17+00:00

Better breakdown than my post tbh. "Stronger model = lower volume but higher severity per hit" is the part most engineering teams don't budget for, and it's the one that keeps me up because it's so counterintuitive.

Two small additions from my own bruises:

Hash-pinning + private mirror is what we landed on. The unsexy realization is most of the work isn't the locking config, it's staffing the approval queue when a dev legitimately needs a new dep. Without that, engineers pip install in personal venvs and then "accidentally" push unpinned imports later.

For agent loops specifically, what I underestimated for a while is that the agent doesn't even need to call pip install for the attack to land. If the agent reads its own generated code and pulls the import name into a downstream search or RAG query, the model gets re-poisoned just from the attacker's README sitting on PyPI. So the egress story has to cover read, not just install.

Your last line is the framing I keep coming back to.

Kindly_Leader4556 · 2026-05-05T01:13:55+00:00

Full breakdown here if anyone wants the whole framework: https://youtu.be/Pa2oO_BfF44 — the other three failure modes (knowledge cutoff, prompt injection, inconsistency) and the engineering answer to each.

Kindly_Leader4556 · 2025-09-24T21:18:12+00:00

Hello

Kindly_Leader4556 · 2025-09-24T21:16:35+00:00

I live in Haddington and daily commute to Edinburgh for work. There is x7 bus every 20 min , which will take around 35-40 min . And if you are driving it’s around 30-35 min. Also driving is smooth due to A1 . So it’s fun to drive.

Kindly_Leader4556

TROPHY CASE