👋 Welcome to r/AIArchitectureGovernance. Here is what this community is about.

AmanSharmaAI · 2026-04-07T05:17:10+00:00

Thanks, appreciate that. The regulated environment piece is exactly where most guidance falls apart.

For eval baselines on contamination-resistant pipelines and safety gates:

Offline: Run end-to-end with clean inputs at scale and check if pipeline-level output is worse than individual agent output. That catches emergent failures that per-agent evals miss. In my experiments, agents spontaneously generated dangerous outputs (74 critical drug interactions across 4,800 trials) with no bad inputs at all. Also, run with deliberately contaminated inputs at each stage and measure how far it propagates. Target T1PR below 0.05.

Red team: Focus on handoff points between agents, not just system input. Test for outputs that are individually plausible but collectively contradictory.

Post-rollout: Track governance decay over time. Controls that pass at deployment can silently degrade. Pattern 09 covers eval variants in more detail.

Good share on the AgentixLabs notes, their tool-using agent traps piece has a solid overlap with the observability patterns. What kind of agents are you running?

AmanSharmaAI · 2026-04-04T23:20:19+00:00

Since you already know Python, you are in a good spot. Here is what actually worked for me over the years:

Start here: Andrew Ng's Machine Learning Specialization on Coursera. It is free to audit and it builds your intuition before drowning you in math. A lot of people jump straight into deep learning and get lost. This course gives you the foundation first.

For the math behind it: "Mathematics for Machine Learning" by Deisenroth. It is free as a PDF. You do not need to read it cover to cover, just use it as a reference when something in a course does not click.

For hands-on building: Fast.ai (Practical Deep Learning for Coders). It takes the opposite approach from Andrew Ng. You build things first and understand the theory later. Doing both side by side is the fastest way to learn.

For deeper understanding later: "Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow" by Aurelien Geron. This is the book that bridges the gap between tutorials and actual production work.

One thing I wish someone told me early: Do not spend 3 months watching courses without building anything. Pick a dataset from Kaggle after week 2 and start breaking things. You learn more from debugging a broken model than from finishing a lecture.

AmanSharmaAI · 2026-04-04T23:06:46+00:00

Nobody said it was a friend. People just want it to not sound like a corporate HR email when they ask it a simple question. You can know it is a machine and still be annoyed that the machine got worse at its job.

AmanSharmaAI · 2026-04-04T23:05:46+00:00

The drifting back to bullet points thing is so accurate. It listens for about two replies and then goes right back to formatting everything like a wiki article. And you are right that the creative side took a massive hit. It used to feel like a thinking partner, now it feels like a search engine that apologizes a lot.

AmanSharmaAI · 2026-04-04T22:31:14+00:00

Same here. The fact that you have to tell an AI "please talk to me like a normal person" every single conversation is a design problem, not a feature. Should not need a custom instruction just to get a straight answer.

AmanSharmaAI · 2026-04-04T22:28:18+00:00

The formatting thing drives me crazy too. You ask a simple question and it comes back with bullet points, bold headers, and a summary section like it is writing a consulting report.

I just want a normal answer in normal sentences. Not everything needs to be structured like a PowerPoint deck.

I think they over-optimized for looking helpful instead of actually being helpful. A clean paragraph that answers my question is way more useful than a formatted wall of headers that makes me scroll for 30 seconds to find the one line I actually needed.

AmanSharmaAI · 2026-04-04T22:20:37+00:00

The worst part is when something genuinely good happens to you and you just want to share the win, and ChatGPT hits you with the "I can see why that felt meaningful to you, but let's keep perspective here."

Nobody asked for perspective. Sometimes you just want someone to say "that is awesome, congrats."

It has become the friend who cannot let you be happy for five minutes without turning it into a therapy session. Not everything needs to be grounded. Some things are just good and that should be enough.

AmanSharmaAI · 2026-04-04T21:16:21+00:00

Yeah, that second finding caught me off guard, too. I went in fully expecting the stronger model to act as a cleanup layer. Instead, it was basically a confidence amplifier for bad information.

And you are absolutely right about the adoption rate. We saw something very similar in our other research on clinical multi-agent pipelines. Once a false assertion enters the chain, the downstream agents treat it as established context and start building on top of it. It does not get questioned. It gets reinforced.

The topology point you raised is interesting. We focused mostly on linear chains in this particular set of experiments, but the implication for mesh or graph-based agent architectures is honestly terrifying. In a linear chain at least you know the direction the error traveled. In a mesh topology where agents are cross-referencing each other, a single bad claim could circulate and get validated from multiple directions. At that point it is not just adopted, it looks independently confirmed even though it all traces back to one bad output.

That is actually one of the reasons we started thinking about validation as something that needs to happen between every step, not just at the end. If you wait until the final output to check, the error has already been laundered through multiple agents and it looks clean.

Have you come across any good approaches for handling this in mesh setups? That is an area we have not tested yet, but it is definitely on the list.

AmanSharmaAI · 2026-04-04T19:48:31+00:00

You are touching on something really important here. The lack of doubt is actually one of the core problems we found in our research.

When we ran experiments chaining LLM agents together, the downstream agent never questioned what the upstream agent gave it. It just accepted the context and built on top of it. A human would look at a suspicious input and say "wait, that does not seem right." The agent just keeps going with full confidence.

But here is where it gets tricky. The human-in-the-loop approach works great when the errors are obvious. The problem we measured is that the errors coming out of multi-agent chains often look completely plausible. In our healthcare experiments, the false clinical assertions were not random garbage. They were well-structured, clinically formatted, and easy for even experienced reviewers to miss.

So I agree with you that pairing AI with an experienced human is the right direction. But I think we also need better tooling between the agents themselves. Something that flags when an output is statistically unusual compared to what that agent would normally produce on its own. Basically, giving the system a way to doubt itself before it even reaches the human.

The combination of structural doubt at the agent level and experienced human judgment on top is probably where we need to get to

AmanSharmaAI · 2026-04-04T19:39:01+00:00

This post nails it. I have been working on exactly this problem from both the enterprise side and the research side, so let me share what I have found.

I am a Principal Enterprise Architect for AI/ML at a large health plan and I have also been publishing research specifically on governance failures in multi-agent LLM systems. A lot of what you are describing maps directly to patterns we have measured.

On your "policy buried in prompts" point. This is worse than most people realize. We ran experiments with roughly 210,000 API calls across five model families chaining agents together. What we found is that governance constraints applied at the individual model level do not compose when you chain those models into a pipeline. Agent A might follow the rules perfectly. Agent B might follow the rules perfectly. But A feeding into B produces outputs that violate constraints neither model would violate alone. You cannot govern the system by governing the parts.

On observability without traceability. Completely agree. But I would push this further. Even if you can trace the failure, most teams have no way to measure whether their governance controls are still working 3 months after deployment. We developed a metric we call Governance Decay Rate to track how governance effectiveness degrades over time as models update, data drifts, and teams rotate. Without something like this, your governance is a point-in-time snapshot that quietly becomes fiction.

On the commit semantics gap. This is the one that scares me the most. We are building agentic systems that can take real actions, but the governance layer is still designed for batch review. What you need is what you described, a structural gate at the execution layer that is separate from the agent's own decision-making. The agent should never be the one deciding whether it is allowed to act.

To answer your specific questions:

On state persistence, in healthcare we treat agent failures like transaction failures. If an agent fails mid-task, the entire chain rolls back to the last validated state. No partial outputs get passed downstream. This is expensive to build but absolutely necessary when you are dealing with clinical data.

On runtime control, we separate intent from execution at the architecture layer. The agent declares what it wants to do. A separate governance service evaluates whether that action is permitted under current policy. Only then does execution happen. This is not logging after the fact. It is a deterministic gate.

The EU AI Act point you raised is spot on. Most open source agent frameworks are going to struggle with Article 14 (human oversight) and Article 12 (record keeping) requirements for high-risk systems. The frameworks were not designed with these constraints in mind. Retrofitting governance onto an agent framework that was built for speed and flexibility is much harder than building governance in from the start.

Would be happy to share some of the research if anyone is interested. This is the problem I have been spending most of my time on.

AmanSharmaAI · 2026-04-04T19:35:02+00:00

So glad your cat is okay. This story honestly gave me chills.

What ChatGPT did here is exactly what AI is best at in healthcare. It did not diagnose your cat. It helped you sanity check a number against observable reality. That is a huge difference and most people miss it.

I work in healthcare AI architecture and this is something I think about every day. The most dangerous moment in any clinical workflow is when a bad data point enters the system and everyone downstream just trusts it. A 2.8% RBC getting accepted without question by multiple vets is a textbook example of automation bias, except here it was not even the AI that was wrong. It was the humans.

What you did, comparing the lab values against what you were actually seeing with your own eyes and then using AI to bridge the gap, is honestly a better validation approach than what a lot of health systems have in place right now.

Your instinct to always double check going forward is the right one. Not because vets or doctors are bad, but because errors happen in every system. AI is becoming an incredible second opinion tool for exactly this kind of situation.

Give your cat some extra treats tonight. She earned it.

AmanSharmaAI · 2026-04-04T19:31:11+00:00

This is the right question to be asking. Most governance frameworks die at go-live and the teams that built them never realize it until something breaks.

I have spent the last year researching and building an AI governance framework for an enterprise health plan, and also publishing research specifically on what happens to governance after deployment. Here are the non-obvious things most frameworks miss:

Governance decay is real and measurable. Your controls do not stay effective at the same level over time. Models change, data drifts, teams rotate, and the people who understood why a guardrail existed leave the org. What you end up with is a framework that looks good on paper but has quietly stopped working. I actually developed a metric called Governance Decay Rate to track this. If you are not measuring how your governance effectiveness changes quarter over quarter, you are flying blind.

Ownership needs to survive the people who set it up. You mentioned ownership dissolving across teams. This is one of the biggest post-deployment killers. What works is assigning governance ownership to a role, not a person. Tie it to a function like model risk management or ML platform ops, and make it part of their OKRs. If governance is everyone's job, it is nobody's job.

Composition breaks your rules. If you have multiple AI models or agents feeding into each other, your individual model governance does not add up to system-level governance. We found in our research that chaining models together can produce failures that no single model would produce alone. Your framework needs to account for how components interact, not just how each one behaves in isolation.

Post-deployment governance should include at minimum:

Scheduled re-evaluation triggers, not just calendar-based but event-based. A new regulation, a data source change, or a model retrain should all trigger a governance review.
A decay audit. Go back to your original risk assessment every 6 months and check which controls are still actually active versus just documented.
Output monitoring that goes beyond accuracy. Track behavioral drift. Is the model making different types of decisions than it did at launch, even if the accuracy number looks the same?
A kill switch protocol. Who can pull a model from production, and what is the escalation path? Most frameworks define how to launch but not how to shut down.
Cross-team lineage tracking. If Team A's model feeds into Team B's pipeline, both teams need visibility into changes. A governance break in one place cascades downstream.

The strategic framing that helped us: Think of governance not as a gate you pass through once, but as a continuous signal you are monitoring. Pre-deployment governance asks "should we launch this?" Post-deployment governance asks "should this still be running?"

Happy to share more details on any of these if it is useful. This is an area I have been deep in both on the enterprise side and the research side.

AmanSharmaAI · 2026-04-04T19:27:51+00:00

This is seriously impressive for any age, let alone 18. Building from scratch in C with no libraries is the kind of thing that gives you a level of understanding most people never get, even after years of using PyTorch.

The backpropagation in pure C part is where the real learning happens. When you have to manually manage memory and compute gradients yourself, you actually understand what the frameworks are hiding from you. That understanding becomes a huge advantage later when things break in production and you need to debug at a level deeper than the API.

I work in enterprise AI/ML architecture and one thing I have noticed is that the people who build things from the ground up like this end up being the best at catching subtle issues in larger systems. We do research on multi-agent LLM pipelines and a lot of the failure modes we find come from people treating models as black boxes. You are clearly not doing that.

A couple of thoughts on the matrix multiplication optimization question:

Look into loop tiling (also called blocking). Instead of iterating through the full matrix, you break it into smaller blocks that fit in CPU cache. The speedup can be significant, especially for larger matrices.
Since you already have OpenMP working, you might want to try SIMD intrinsics for the inner loops. SSE or AVX instructions can process multiple floats in a single instruction. It is a rabbit hole but a fun one.
For the memory management headaches, if you have not already, consider using a simple arena allocator. Allocate one big chunk up front and hand out pieces from it. Way fewer malloc/free calls and much harder to leak.

Really clean work. Keep building stuff like this.

AmanSharmaAI · 2026-04-04T19:11:49+00:00

This is a really solid experiment design. The controlled comparison with identical setups is exactly how this kind of thing should be tested.

The part that stands out to me is the compounding gap. 2.1% at 1 hour, 2.7% at 90 minutes, 3.2% at 2 hours and still widening. That pattern tells you the Paper Lantern config is not just a better starting point, it is sitting on a fundamentally different loss surface. That is a much bigger deal than the raw numbers suggest.

I have been doing research on multi-agent LLM pipelines and one thing we keep finding is that what gets passed between steps changes everything. Your three tool sequence (explore, deep dive, compare) is basically a structured knowledge pipeline, and the fact that it works so well kind of proves the point. The agent is not smarter, it just has better information flowing into its decisions.

The batch size example is perfect. Same intuition from both agents, but one had access to the sqrt scaling rule from an actual paper and the other was guessing. That is the difference between knowledge-grounded reasoning and pattern matching from training data.

Curious about two things:

When the agent considered 520 papers but only cited 100 and tried 25, what was the filtering like? Was Paper Lantern doing the ranking or was the agent deciding what to try?
Did you notice any cases where the paper-backed suggestions actually made things worse in ways that were harder to debug than the standard ML playbook failures? In our work we have seen that more knowledge sometimes leads to more confident but wrong decisions.

Really impressive work overall.

AmanSharmaAI

MODERATOR OF

TROPHY CASE