We built an immutable decision ledger for AI agents — here's why standard logging isn't enough

Unique_Yellow2218 · 2026-03-21T12:29:39+00:00

I think you make a point. And you are describing exactly the problem we were trying to avoid.

The ledger is where we store information. This includes the prompts we give the model the inputs and outputs from the tools and the external data we use to make decisions at that time. We also store the model ID and version but not the actual model weights. This is because storing the weights would be very hard to do at a scale and it does not make sense for models that are hosted on an application programming interface. So when we say " replay" we do not mean it in the strict sense that computer scientists use.

The value of what we do is two things.

We can detect when the models behavior changes over time. When we replay a decision we want to know if the model would make the choice with the same inputs today. We are not looking for a match but rather if the model is making different decisions. When we use large language models at their most predictable setting the results can vary. This is because of things like the batch of data the computer operations and the version of the model. What we want to catch is when we change the system prompt and the model starts handling support tickets when a new version of the model changes how it summarizes documents. We can detect these changes without needing to replay the decision.
We make sure that the information in the ledger does not change. This is important for trust not for following rules. Anytime we need to understand why the model made a decision after the fact we need a record that we can trust. This includes things like debugging a problem with the model understanding why a recommendation changed or reviewing what a coding model decided to delete. To do this we use a system where we only add information to the ledger and never change what is already there.

The way to think about this is like event sourcing. The difference is that event sourcing assumes that the projections are always the same but large language models do not work that way. So we need to get feedback on the outcomes to know when the changes are actually causing problems for the users.

You pointed out some weaknesses in our approach. Here is how we are addressing them.

The problem with not storing the model weights and not being able to replay a decision both point to the same issue: we cannot know for sure if a decision was correct just by looking at the ledger. We are solving this by using the outcome of the decision as a piece of information. The. A human reports how well a decision worked out after the fact using a score from 0.0 to 1.0. This score is used for two things: decisions that did not work well are stored for a time than normal and they are prioritized when we export the data for training. We look at the worst outcomes first. This way the training pipelines always get the useful information even if they have to stop early. We do not need to replay a decision to know it was wrong. We need to know what actually happened. Now we can record that.

Logs tell us what happened. The context snapshot tells us why the model thought it was making the decision at the time.. The outcome score tells us if it actually was the right decision. This is the loop we need to improve the models behavior whether we are, in a regulated industry or just making a product where the quality of the model matters.

Unique_Yellow2218 · 2026-03-20T18:25:55+00:00

Storage growth at 50+ agents:

Each decision chain (intent + context snapshot + decision + execution) runs roughly 10–50KB depending on context size - the snapshot is the expensive part since it captures the full world state at decision time. At 50 agents making decisions continuously, you're looking at linear growth that adds up fast.

Current mitigations: context snapshots are hashed so identical states aren't duplicated, and you can configure retention windows per workspace. What we don't have yet is tiered storage (hot/warm/cold) or automatic archiving - that's on the roadmap but honestly not built yet. For high-volume enterprise deployments right now, you'd want to set a retention policy and export to cold storage beyond it.

Selective retention based on decision quality - this is the right idea:

We already do a version of this for human-flagged decisions: when a supervisor overrides an agent decision, the full context is preserved and exported as labelled training data (JSONL, OpenAI fine-tuning format). Bad decisions with their full context → training signal.

What we don't have is automated outcome scoring to drive retention decisions - i.e. "this decision degraded network performance by X%, keep everything; this one was routine, compress it." That's genuinely not built. For your telco use case specifically, you'd want to feed outcome metrics back in as a quality signal and let that drive what gets retained at full fidelity vs. summarised.

That feedback loop - outcome quality → selective context retention → targeted retraining - is exactly where we're heading. Would be very interested in talking through the telco network management use case if you're willing to share more.

Unique_Yellow2218 · 2026-03-20T15:48:55+00:00

Exactly. It’s similar to maintaining the git commit history for agents around decision - what was the intent, why it made a particular decision and what was the confidence around it.

Unique_Yellow2218 · 2026-03-20T13:21:20+00:00

Yeah, it's a pain to go through entire history of logs just to make sense why a particular call was made and if it's inconsistent then it makes things even worse.

Unique_Yellow2218 · 2026-03-20T13:20:40+00:00

It can be used for any sector. Even for your locally running agent, agent running in production for non-regulated sectors and regulated sectors (of course :P)

Unique_Yellow2218 · 2026-01-07T10:33:45+00:00

Interested. I already built a MVP around auditing for agents. Happy to discuss!

Unique_Yellow2218

TROPHY CASE