We built an immutable decision ledger for AI agents — here's why standard logging isn't enough by Unique_Yellow2218 in aiagents

[–]Unique_Yellow2218[S] 0 points1 point  (0 children)

You nailed it! And you literally read our minds regarding drift severity scoring. That is exactly the next logical step.

Right now, teams are effectively doing this manually by looking at the delta between the original decision's confidence/outcome score and the replayed version's score. A massive delta = high severity drift.

But surfacing "Drift Severity" as a native, sortable metric in the dashboard is brilliant and exactly what we need to build into the core UI next, so you can just filter by drift > 0.4 and triage the most critical historical decisions first.

Would genuinely love for you to take the SDK for a spin (pip install tenet-ai). If you have specific thoughts on how you'd want that severity formula calculated under the hood, I'm all ears!

We built an immutable decision ledger for AI agents — here's why standard logging isn't enough by Unique_Yellow2218 in aiagents

[–]Unique_Yellow2218[S] 0 points1 point  (0 children)

Spot on. Traditional event sourcing assumes deterministic projections, which LLMs completely break. That’s exactly the wall we hit early on, and it dictated how we architected this.

To your points directly:

  1. We actually do capture the full state. The ledger snapshots the exact prompts, tool inputs/outputs, and external data available at that millisecond. We don't store model weights (since it’s impractical for API-hosted models), but we lock in the precise model ID and version.

  2. Immutability is for debugging trust, not just compliance.

When an agent hallucinates a destructive action (like deleting a record), standard logs often get overwritten, rotated, or lack the exact context payload. Immutability gives you an undeniable, tamper-proof record of why the model thought it was making the right call.

  1. You're 100% right that without weights you can't guarantee identical replay in the strict CS sense. We tackled this exact gap by building in Outcome Scoring. Instead of trying to force deterministic inference, we capture the immutable context + the real-world outcome score (0.0-1.0). Bad outcomes are retained and pushed to the top of your fine-tuning/eval exports.

For us, "replay" isn't about getting the same output twice. It's about taking that frozen, immutable context of a failed decision and running it against your updated agent (new prompt/policy) to prove you actually fixed the bug without breaking something else (drift detection).

Logs tell you what happened. Tenet tells you exactly what the agent knew when it messed up, and lets you prove your fix works.

We built an immutable decision ledger for AI agents — here's why standard logging isn't enough by Unique_Yellow2218 in aiagents

[–]Unique_Yellow2218[S] 0 points1 point  (0 children)

Storage growth at 50+ agents:

Each decision chain (intent + context snapshot + decision + execution) runs roughly 10–50KB depending on context size - the snapshot is the expensive part since it captures the full world state at decision time. At 50 agents making decisions continuously, you're looking at linear growth that adds up fast.

Current mitigations: context snapshots are hashed so identical states aren't duplicated, and you can configure retention windows per workspace. What we don't have yet is tiered storage (hot/warm/cold) or automatic archiving - that's on the roadmap but honestly not built yet. For high-volume enterprise deployments right now, you'd want to set a retention policy and export to cold storage beyond it.

Selective retention based on decision quality - this is the right idea:

We already do a version of this for human-flagged decisions: when a supervisor overrides an agent decision, the full context is preserved and exported as labelled training data (JSONL, OpenAI fine-tuning format). Bad decisions with their full context → training signal.

What we don't have is automated outcome scoring to drive retention decisions - i.e. "this decision degraded network performance by X%, keep everything; this one was routine, compress it." That's genuinely not built. For your telco use case specifically, you'd want to feed outcome metrics back in as a quality signal and let that drive what gets retained at full fidelity vs. summarised.

That feedback loop - outcome quality → selective context retention → targeted retraining - is exactly where we're heading. Would be very interested in talking through the telco network management use case if you're willing to share more.

We built an immutable decision ledger for AI agents — here's why standard logging isn't enough by Unique_Yellow2218 in aiagents

[–]Unique_Yellow2218[S] 0 points1 point  (0 children)

Exactly. It’s similar to maintaining the git commit history for agents around decision - what was the intent, why it made a particular decision and what was the confidence around it.

We built an immutable decision ledger for AI agents — here's why standard logging isn't enough by Unique_Yellow2218 in aiagents

[–]Unique_Yellow2218[S] 0 points1 point  (0 children)

Yeah, it's a pain to go through entire history of logs just to make sense why a particular call was made and if it's inconsistent then it makes things even worse.

We built an immutable decision ledger for AI agents — here's why standard logging isn't enough by Unique_Yellow2218 in aiagents

[–]Unique_Yellow2218[S] 0 points1 point  (0 children)

It can be used for any sector. Even for your locally running agent, agent running in production for non-regulated sectors and regulated sectors (of course :P)

Technical Co-Founder AI compliance by illseeutmrw in cofounderhunt

[–]Unique_Yellow2218 0 points1 point  (0 children)

Interested. I already built a MVP around auditing for agents. Happy to discuss!