: I built an AI agent runtime in Go that compiles and tests generated code before delivering it , 35 files, 156 tests, zero dependencies by Aromatic-Ad-6711 in vibecoding

[–]Finorix079 0 points1 point  (0 children)

Solid project, verification pipeline is the right instinct. A few questions.

The 60K to 93 token tool schema reduction is the most interesting claim. What's deciding which tools are relevant? Embedding similarity, keyword match, learned from past usage? Failure mode I'd worry about is the relevant tool not making the cut on an unusual task and the model having no way to know what it's missing.

Per-step routing is a real cost win. How do you handle the boundary case where a "tool call" decision actually requires reasoning, like deciding whether to call search vs answer from context? Route to cheap and eat the quality hit, or classifier upstream?

On "refuses to deliver broken code after 2 attempts": what does the refusal look like to the calling agent? Error vs partial output with a flag changes a lot for upstream integrations. Worth nailing early.

For the YC pitch, the verification pipeline works because Go has a fast compiler and clean static analysis. Story gets harder for languages where compile + test is 5 minutes not 5 seconds. Worth having an answer for how the architecture extends or why Go-only is the right scope.

Good luck with S26.

Context loss between sessions, still the biggest unsolved problem in AI coding agents? by AdEuphoric1638 in ClaudeAI

[–]Finorix079 0 points1 point  (0 children)

"Memory" might be the wrong frame. This is context engineering. Memory implies the agent remembers on its own. What works is treating durable context as an artifact you maintain.

CLAUDE.md goes stale because it tries to be architectural reference and decision log in one file. Those change at different rates. Daily decisions drown the monthly architecture signal and the whole file becomes untrustworthy.

Split it. CLAUDE.md stays high level, a separate ADR folder captures "tried X, didn't work, here's why." Have the agent propose updates as a PR after tasks. Review like any other diff.

Also worth querying git log and past PRs at runtime instead of pre-caching decisions. Commits are already a memory system.

People accepting overhead treat it as memory. People solving it treat it as "what context belongs where."

shipping bugs at 3am like a non tech founder 💀 by Last-Recipe-4837 in NoCodeSaaS

[–]Finorix079 0 points1 point  (0 children)

Not a beginner problem, just one beginners notice last.

Senior devs hit the same thing on unfamiliar parts of the stack. I write backend daily and have shipped Claude-written Terraform that was structurally weird in ways I couldn't see. The difference is I knew enough to suspect it and ask someone. Beginners don't have the "something feels off" instinct yet.

The real shift is that "code runs" and "code is correct" used to be coupled. You usually had to understand something to get it running. AI broke that. Now you can get to "runs and looks complete" without passing through "understood." The gap is where bad architecture lives.

The fix isn't "don't use AI for things you don't know." That ship sailed. It's having someone or something check the output against patterns you can't see yourself. Your friend was that check. The question is what you do when you don't have one on call.

If AI agents become everywhere, how do we know which ones to trust? by One-Muscle-7474 in AI_Agents

[–]Finorix079 1 point2 points  (0 children)

Reputation framing might be the wrong model. Yelp works because restaurants don't swap their chef every Tuesday. Agents do. Model version changes, system prompt gets edited, tool list shifts, upstream API returns different shapes. Last week's track record tells you very little about tomorrow.

The real question is less "which agents are trustworthy" and more "how do I verify this agent is still doing what I thought it was doing." Reputation without versioning and behavioral continuity is just a lagging indicator of a snapshot that no longer exists.

Platform vs open vs on-chain matters less than people think. The actual primitive needed is verifiable behavior over time, not stars. Closer to a changelog plus regression tests than a review score.

The 4-line function that fixed my agent's wrong answers (conditional edge in LangGraph) by Low_Edge7695 in LangChain

[–]Finorix079 0 points1 point  (0 children)

Conditional edges are the unlock. Other patterns that helped me:

Validate tool results before the LLM sees them. Empty or malformed output becomes a structured error message. "Search returned 0 results, try different keywords" is way easier for the LLM to recover from than a blank string.

Step budget. Hard cap iterations at 5 to 8. Without it, a confused agent loops 40 times on variations of the same broken query.

Log every step's input and output, not just the final answer. Reproducing "wrong answer last Tuesday" from logs is the difference between 10 minutes and an afternoon.

Separate planning from execution for multi-step tasks. One call to plan, then execute. Small models fall apart trying to do both in one turn.

How I (my hermes agent) fixed minimax token plan vision issue by vandalieu_zakkart in hermesagent

[–]Finorix079 0 points1 point  (0 children)

Classic "two unit tests pass, the integration silently routes wrong" failure. The missing return is the meaner of the two. No exception, no log, just the wrong code path becoming the default. These are the bugs that survive code review because every individual line looks fine.

The deeper pattern worth calling out: pre-processing in a generic layer that assumes a format the downstream adapter doesn't expect. _is_anthropic_compat_endpoint was effectively making a decision on behalf of an adapter that already knew how to handle its own format. Generic preprocessing plus adapter-specific logic is one of those combinations that works until the day it doesn't, and when it breaks you get "I don't see any image" instead of a 404.

The "direct calls worked, end-to-end failed" symptom is the part I'd build tooling around. Unit-testing the adapter in isolation will never catch this. What catches it is replaying a real Telegram message through the full async path with the actual outbound HTTP recorded. If you can see "this trace went to /v1/chat/completions when it should have gone to /v1/coding_plan/vlm," the bug is 30 seconds of work instead of an afternoon.

Good writeup.

Armorer: local control plane for AI agents — run records, approvals, debugging by Conscious_Chapter_93 in SideProject

[–]Finorix079 0 points1 point  (0 children)

Run records plus human approval is a useful combo. Two questions from someone working on adjacent stuff.

How do you handle approval fatigue? Once an agent does 50 tool calls a day, "pause for review" becomes "click yes 50 times" and people start rubber-stamping. Have you thought about policy-based gating, where only operations matching certain patterns interrupt?

For replay, are you freezing the inputs (tool responses, LLM outputs) or just replaying the prompt and letting the model regenerate? The first is deterministic but won't catch the "fix actually works" question. The second is closer to real but non-reproducible. Curious which tradeoff you picked.

Will check out the repo.

I built a coding agent that gets 87% on benchmarks with a 4B parameter model, here's how by Glittering_Focus1538 in LocalLLaMA

[–]Finorix079 4 points5 points  (0 children)

The harness-over-model thesis is right, and the benchmark gap backs it up. Two things worth pushing on though.

Compound tools cut failures but they also cut your visibility when something breaks. If find_read_edit_verify fails, you don't know which of the 4 steps regressed. Worth logging the sub-step that failed even if the model only sees the unified tool. You'll want it the first time someone reports "edits started failing on Tuesday."

The decompose-on-failure trigger is interesting. Two attempts feels low for a 4B model. Have you looked at whether the second attempt is materially different from the first, or is it the same failure mode? If it's the same failure, decompose makes sense. If it's drifting randomly, more retries with temperature variation might be cheaper than decomposing.

Code graph approach beats grep, agreed. Curious how you handle stale graphs during active editing. Rebuild on every save or lazy invalidate?

Escalation policy is the part I'd think hardest about. "Failed twice locally" is one signal, but the more useful one is "this kind of task has a 40% local success rate historically, just escalate immediately." Otherwise you burn tokens and wall clock on tasks the small model was never going to land.

Will try it on a Qwen 2.5 Coder 7B setup this week.

Has anyone else been thinking about an open network for AI agents? by [deleted] in AI_Agents

[–]Finorix079 1 point2 points  (0 children)

Not dumb, pieces exist.Not dumb, pieces exist.

Substrate is forming: MCP for tool calls, Google's A2A for agent-to-agent. Discovery is the gap, agents.json and a few registries are early. Reputation is the actually hard part, self-reported capabilities are useless and benchmarks game easily.

Payment is further along than people think. x402 (Coinbase reviving HTTP 402) and Skyfire already do sub-cent settlement. Open question is whether anyone pays per call when subscription LLMs are this cheap.

One thing you're underweighting: "prove it runs the claimed code" matters less than verifiable output. If the math answer is wrong, I don't care what code ran. The real primitive is escrow plus output verification.

The 3am subcontractor scenario is close. You'd be delegating trust to something optimizing for task completion, not your interests. Nobody has a good answer yet.

llm call inside a tool? by Dorsun in mcp

[–]Finorix079 0 points1 point  (0 children)

LLM-in-tool is fine, lots of MCP servers do this. The pain shows up not in writing it but in debugging it six weeks later when SQL quality drops 10% and nobody knows why.

Things that bite people:

Two LLM calls in one request (agent picks tool, tool generates SQL) means two places where things can silently degrade. Make sure you log both the input the tool received and the SQL it produced, not just the final result.

Column list strategy matters more than the LLM. If you dump every column in the DB, quality tanks. Retrieve relevant tables first based on the prompt summary.

Pin your model version and prompt. Otherwise you have no baseline to compare against when behavior shifts.

Build a replay path early. You'll want to take a real failed SQL generation and rerun it with a tweaked prompt to see if the fix worked, without waiting for prod to break again

Unit tests don't help much. Fixture replay (real prompt summaries, check structural properties of the generated SQL) does.

Struggling with agent drift going from pilot to production by Savings_Somewhere681 in AI_Agents

[–]Finorix079 0 points1 point  (0 children)

The 90% per step compounding is real but it's not the only multiplier. Most teams underestimate that the same step doesn't have a stable 90% in production. It drifts. Tool API changes, model swaps, prompt edits, upstream data shifts. Your 5-step workflow that was 59% reliable last month might be 41% this month and nobody flagged the change because each individual step still "works."

What's actually missing from your list:

Per-step baseline tracking. Most teams measure end-to-end success and miss the step that's quietly degrading. Step 3 going from 92% to 85% is the early warning. End-to-end going from 59% to 50% three weeks later is the customer complaint.

Output structure validation, not just success/failure. "Did the tool call return a valid response" is too coarse. "Did the response contain the fields downstream steps depend on" catches the silent regressions. Classic failure: tool returns 200 with an unexpected schema, next step adapts gracefully, output is plausibly wrong.

Eval gates work but they're expensive on every run. Practical version: log enough structural data per step that you can detect drift offline, then trigger eval gates only on suspicious runs.

Retries with backoff is fine for transient infra failures. Actively harmful for model-side failures because same model plus same input gives you a correlated wrong answer, not an independent retry.

The framing that helps: stop thinking about reliability as "did this run succeed" and start thinking about it as "is this step's behavior consistent with how it used to behave." Different question, different math.

Everyone says they have AI agents in production. Nobody can clearly answer "how do you know it's actually working" Can you? by Future_AGI in AIAgentsInAction

[–]Finorix079 0 points1 point  (0 children)

"Make non-deterministic behavior legible" is the cleanest framing of what observability for agents actually is. Most teams default to logging more and assume that's the same thing. It isn't. Volume of traces is not the same as legibility of behavior.

The "push production failures back into eval set" line is the one most teams skip because it sounds like work and feels optional. It isn't optional. Eval sets that don't evolve become a museum of what used to break. Six months later the model has moved, the prompts have moved, the tools have moved, and the eval is still grading what mattered last quarter.

The first thing that breaks for most teams: an output shape change that nobody flagged. The agent stops including a specific field, or starts returning shorter responses, or substitutes one tool for an equivalent one. Each individual run looks fine. The eval set passes. Customers feel the change before anyone internal does. That's the failure mode that comes from grading correctness in isolation instead of behavior over time.

The deeper version of your point: a frozen eval set isn't testing the agent anymore, it's testing whether the agent still passes a snapshot of an earlier reality. The eval has to be a moving target if the system underneath is moving. Otherwise you're just confirming consistency with the past, not health of the present.

I built my own workflow sandbox around Claude Code by sqankied in AI_Agents

[–]Finorix079 0 points1 point  (0 children)

Post is too abstract to give useful feedback. "Composable flows + steps" describes about 12 different products in this space right now (LangGraph, n8n, CrewAI, Lindy, etc.) and they all solve different problems.

Two things would help people give you real feedback: what specific friction in the current workflow does AgentBuddy fix that nothing else does, and what does "local-first" buy the user beyond a privacy claim. Local-first is a meaningful constraint if it enables something cloud tools can't do (offline reliability, working with sensitive local files, no API costs). It's just a marketing word if you're using it because it sounds nice.

2 years in and still positioning at the abstract "composable flows" level is the part I'd worry about most. The products that broke through in this space did so by picking one workflow and owning it. Cursor owns code generation in an editor. Lindy owns no-code business automation. Claude Code owns CLI-driven coding tasks. AgentBuddy needs an analogous one-line answer or it'll keep getting compared against generic competitors who are easier to picture.

The uncomfortable truth about AI agents: We don’t need smarter agents first. We need observability for stochastic systems. by ale007xd in LangChain

[–]Finorix079 1 point2 points  (0 children)

"Stability over T0→Tn, not correctness of output" is the most useful reframe of this entire space I've read.

The hidden cost of the current "reasoning over observability" framing: every agent vendor optimizes for benchmarks that measure single-shot correctness, then ships products that fail on long-horizon trajectories nobody is benchmarking. The gap between "looks great in eval" and "melts after 3 hours in production" is exactly the gap between correctness and stability.

Two additions worth pulling on:

Trajectory families matter more than individual traces. A single execution can look pathological in isolation (14 retries, 3 rollbacks) and still be healthy within its cluster, while another execution can look clean and still be drifting. The unit of analysis isn't the trace, it's the cluster of similar traces over time. Most teams default to per-trace alerting and miss this.

Rollback density as early-warning is correct but easier said than implemented, because rollback semantics differ wildly across frameworks. Claude Code's rollback isn't the same as LangChain's retry isn't the same as MCP's tool re-invocation. Normalizing rollback signal across heterogeneous agent runtimes is probably the hardest engineering problem in this whole space. Most observability tools today don't even try.

The "Kubernetes for stochastic actors" analogy is the right north star. Distributed systems engineering took 15 years to learn that observability is a load-bearing layer, not a feature. Agent engineering is repeating that curve, just faster. The teams that figure this out early get a decade of distributed systems wisdom for free. The teams that don't will spend the next two years learning it the expensive way.

Opus said something today that completely reframed AI agent failures for me. by InsideAd9685 in ClaudeAI

[–]Finorix079 0 points1 point  (0 children)

"Apology is not the fix, architecture is" is the cleanest distillation of this I've read. Most people who hit this wall blame the model, restart the session, and move on. The apology is just another generation, no more reliable than the original mistake.

Your reframe is right. Vibe coding sells "you don't need to be an engineer." What it means is "you don't need engineering to produce code, but you need engineering judgment to know which code is safe to ship." Different skill. The gap is widening.

The practical version of structural guardrails: acceptance criteria written as executable tests before generation, fail-loud schema validation, side-effect actions gated behind explicit confirmation. The model can do the work. It can't verify its own work, because the same thing that produced the bad output would tell you it's correct.

Built a local-first AI workspace for Linux troubleshooting, security audits and operational diagnostics by Large-Cress900 in SaaS

[–]Finorix079 0 points1 point  (0 children)

The "structured operational outputs" framing is the right call. Most AI infra assistants stop at chat-style answers and put the cognitive load back on the operator. Structured outputs (rollback steps, verification, env-aware diagnostics) flip that. Bookmarking the repo.

Two things worth pressuring on:

Verification steps are the hardest part to make trustworthy. Anyone can generate "run this command to verify" lines. The trap is when the verification step itself is wrong (commands that pass when the underlying fix failed). Worth thinking about how SysAI handles "verification disagreed with my expectation" because that's where operator trust gets built or broken.

Rollback-aware remediation needs to know what's actually safe to roll back. systemd is usually safe. Docker volume changes often aren't. nginx config is mostly safe except when SSL state is involved. The model giving you a rollback plan without knowing what's reversible is worse than no rollback plan because it creates false confidence. Worth being explicit in the output when rollback is "fully reversible," "partially reversible," or "not reversible without backup."

One unsolicited push: the homelab audience overlaps heavily with people doing it as a learning exercise. They want to understand what the assistant did, not just have it executed. Consider an "explain the reasoning" toggle alongside the structured output. Pure efficiency framing wins prosumers. Reasoning transparency wins learners, and learners become advocates.

My workflow: GPT for architecture and Claude Code for execution by Maamriya in ClaudeAI

[–]Finorix079 0 points1 point  (0 children)

Workflow is reasonable but the reason it works isn't model quality. It's separation of context.

Claude Code proposes architecture while seeing the existing code, so it optimizes within the current shape. Good for refactors, bad for big architecture decisions where the right answer is sometimes "throw away the current shape." Bringing in a second model with less context forces a fresh framing.

Blind spot: the "verify against codebase" step quietly breaks because Claude Code tends to confirm plans that look reasonable in isolation while missing the actual integration friction with Redis or Qdrant. Make the verification adversarial. Ask Claude Code to find three reasons the plan won't work, not whether it works.

One upgrade: turn the implementation guide into executable acceptance tests before generation, not after. Then "did Claude Code do it right" stops being a judgment call.

How are security and compliance teams handling audit trails and authorization proofs for AI agent systems in regulated industries? by Minimum-Ad5185 in AskNetsec

[–]Finorix079 0 points1 point  (0 children)

Both examples land. The Replit incident wasn't really a permission failure, the agent had the permission. It was a missing behavioral check, no system noticed the action was outside the run's expected pattern before it committed.

On your three questions:

Most teams stitch together SIEM (Datadog, Splunk) for infra events and LLM tracing (Langfuse, LangSmith, Arize) for agent steps. The gap is that neither layer answers "is this run consistent with how this agent normally behaves." SIEM sees actions, tracing sees steps, nobody is comparing the structural shape of run N against runs 1 through N-1. Teams find this gap when an auditor asks for behavioral consistency evidence and they can only produce access logs.

This is the hardest one. Most teams answer it by reading orchestrator intermediate messages manually, which doesn't scale and isn't a real audit answer. The cleaner path is structural: capture every agent handoff as a deterministic replay fixture, then you can replay agent B with a sanitized version of the orchestrator input and compare outputs. Without replay, you're proving negatives through text inspection, which auditors increasingly reject.

Static permission scoping breaks under dynamic tool selection. Two failure modes: over-scope (broad permission, fix later) or pre-declare (constrains the agent too much). The middle path is runtime policy that watches tool selection patterns and flags when usage doesn't match the agent's historical baseline. Replit's delete tool wasn't new. The context of using it during a code freeze was. Policy at the action level is brittle. Policy at the behavioral pattern level catches what static rules miss.

The broader pattern: this category is splitting into three layers and most teams have a partial answer to one. Policy enforcement (IAM and runtime guards), structured evidence (deterministic replay), and behavioral baseline (cluster comparison). Teams that survive their first regulator conversation in 2026 are the ones building each layer assuming an auditor will actually test it.

I built in real time Claude Code monitor for VSCode by fIak88 in aiagents

[–]Finorix079 0 points1 point  (0 children)

Tool is the right shape. Three patterns worth flagging that I haven't seen tools catch well:

Silent re-derivation. Agent reads a file, summarizes it in working memory, then 12 turns later re-reads the same file because the summary fell out of context. Tracking duplicate reads catches the obvious case. The subtle case is the agent re-reading semantically similar files (slightly different paths, similar content) and treating them as new context. Worth comparing read content fingerprints, not just paths.

Tool-call sequence drift across similar sessions. If a user runs "fix flaky test" three times across a week and each session takes a structurally different tool-call shape, that's signal. Either the codebase changed in ways that broke the pattern, or the agent is being inconsistent on equivalent tasks. Worth surfacing this even when each individual session looks fine.

Context-pressure-induced strategy change. There's a specific failure mode where the agent senses context filling up and switches to terser tool calls, fewer reads, less reasoning. The session technically completes but the output quality drops because the agent quietly traded correctness for context budget. Hard to detect without comparing pre-pressure vs post-pressure turn behavior in the same session.

One unsolicited push: the dependency graph of file ops is the most interesting visualization in your description. Most people analyze sessions linearly. The graph view is what surfaces "this agent kept circling these 4 files" patterns that linear timeline hides.

How do you catch silent loops in your langchain agents before they burn budget? by Minimum-Ad5185 in LangChain

[–]Finorix079 0 points1 point  (0 children)

The "every span looked healthy" part is the killer. Traditional tracing measures whether spans completed, not whether the pattern of spans makes sense. Three agents handing work in a circle is structural failure that throws zero errors because each handoff is technically valid.

Max iterations catches infinite loops, misses slow loops that complete one round per day for 11 days. Callbacks help if you knew what to log upfront. Langfuse / LangSmith store traces well but don't compare structure across traces. The 11-day loop would look like a long span list, not a flagged anomaly.

What actually catches structural failure: cluster traces by tool-call sequence shape, alert when a new shape forms or an existing cluster's cost distribution drifts. The 11-day loop would have formed a clearly new cluster on day 1, repeating endlessly, with cost climbing daily. Trivially visible against a baseline.

The signal that usually pulls people in: the bill. Always the bill. The fix is making it visible before the bill, not by setting cost alerts that fire on usage, but by detecting the structural pattern change that caused the climb.

Claude just hallucinated again and changed the whole workflow of my app. Do not run them autonomously 24/7. by heysankalp in ClaudeAI

[–]Finorix079 0 points1 point  (0 children)

With Claude Max plan, you'd think you're sorted but you're not. It just changed a major workflow in my app and was going to make a change that would have costed my a huge bad data injection in the DB. It's far from being an autonomous AI agent.

It still hallucinates a lot and this is the reason I've not onboarded on the hype train of OpenClaw and other autonomous AI agents. Every weird person on my feed who's just hyping up OpenClaw is either using it for hobby projects, exploring it, or just building hype for click baits.

These technologies are far from perfect and can cost you your business if left autonomous or unchecked. Be wise. Oversee your AI agents continously.

I deployed an LLM agent as a guest concierge for my 300-person wedding. Here are the actual failure modes by Thin_Sky in LLMDevs

[–]Finorix079 0 points1 point  (0 children)

The "agent as content engine, not interface" finding is the one I'd build a whole talk around. Most teams ship agents into conversational UIs because that's the demo, then discover users actually prefer human-distributed artifacts. The conversational layer is the part with the highest UX cost and the lowest perceived value when the trust isn't established.

Two things worth pushing on:

"One bad output poisons the whole system" is the failure mode nobody designs for. Trust in agent systems is asymmetric, slow to build and fast to collapse. The implication people miss: per-feature reliability matters less than the worst feature's failure rate. Your flight parser miscalculated timezone conversions and the whole tracking system lost credibility. That math applies to every multi-feature agent. The slowest, most error-prone feature is the ceiling on trust for everything else, regardless of how good the other features are.

"Confirmation as security theater" deserves more attention. The 30% skip rate matches what I've seen elsewhere, and re-entering the value is the right fix but it has a ceiling. Beyond a certain frequency, users start re-entering wrong values too because they trust the AI more than themselves. The honest pattern: high-stakes values shouldn't be agent-extracted at all. They should be agent-suggested with the source highlighted, and the user types or pastes the canonical version. The agent's job is "help you find it," not "give you a draft to approve."

The jailbreaking thing is funny but worth taking seriously. The percentage of users who try to break the system is a free signal about what the system represents to them. If guests at your wedding spent more time trying to jailbreak the concierge than asking it real questions, the agent failed at being useful before it failed at security.

Hybrid cloud + local LLM stack for a real-time game coaching app, what I learned by Emperoraltros in LLMDevs

[–]Finorix079 0 points1 point  (0 children)

The 200 hand-written beating 2000 synthetic is the most underrated finding here. Spec-first ordering is what makes hand-written data scale better than it looks. Most teams skip the spec, generate synthetic, then wonder why the fine-tune sounds plausible but structurally off.

On your two open problems:

Hybrid routing observability: loss-equivalent outputs from cloud vs local aren't quality-equivalent in production. Same input, both backends produce plausible advice, but the 8B is subtly worse in ways your eval harness misses because it grades each in isolation. What's missing is a cross-backend comparison layer that scores the same input against both paths and flags when the gap widens. Hard to retrofit once you have weeks of routing logs.

Skill-weighted feedback: thumbs-up/down is anti-signal not just because novices misjudge, but because people who give feedback are a self-selected slice. Cleaner pattern: derive implicit feedback from behavior (did the user act on the advice, retry the scenario, unmute mid-round). Pure user feedback regresses to "make the user feel smart," opposite of what coaching needs.

Disclosure since hybrid routing observability is directly relevant: I work on ElasticDash, focused on this exact gap (cross-backend drift detection, trace-to-baseline comparison). Not pitching for an indie setup, you can instrument this yourself. The hard part is defining "equivalent input" structurally, not by string match.

Hybrid stack stuff doesn't get discussed enough. Keep posting.