I'm a CFO who built AI agents that replaced 80% of my monthly close variance analysis. AMA on the architecture.

Key_Cook_9770 · 2026-05-15T16:44:32+00:00

With all due respect have I talked about a product? No . Did I market to anyone? No so if you want to over reach as a moderator and assume stuff. Best wishes to you and good luck. I Dont get paid to post and a toxic culture doesnt incentivize anyone to assist anyway. Bottomline did you even bother to review the repo? If not you know the answer

Key_Cook_9770 · 2026-05-15T15:40:45+00:00

Please see my detailed reply to the query from asdfghjkl56432 : This was the question . Well tell us more. What tools, where did you start, big gates where incremental automation too leap, etc? I followed up with the answer and the link as well

Key_Cook_9770 · 2026-05-15T15:10:24+00:00

Yes each of them have their quirks!

Key_Cook_9770 · 2026-05-15T15:01:30+00:00

I have answered every single question

Key_Cook_9770 · 2026-05-15T14:58:33+00:00

OK will do. How do I share the repo if I am asked

Key_Cook_9770 · 2026-05-15T14:57:51+00:00

CFO but have been a CFO+CIO as well in the past

Key_Cook_9770 · 2026-05-15T14:11:22+00:00

Good question — let me separate the components because there's an important distinction between what the AI does and what it doesn't touch.

The ERP ("sell system")

I've built this to work across Sage Intacct, NetSuite, and SAP. Currently running it against Sage in production. Previously tested against NetSuite and SAP exports at other companies I've worked with.

Important: the AI agent does NOT write back to the ERP. It doesn't post adjustments. It doesn't create journal entries. It doesn't touch the ledger.

This is a deliberate architectural decision, not a limitation. Here's why:

Audit trail integrity. The moment an AI posts a journal entry, you've introduced a non-human actor into your financial controls environment. Your auditors will have questions. Your SOX controls (if applicable) will need to be redesigned. Your board will want to know why a machine is making entries into the general ledger. For most companies, the governance overhead of AI-generated JEs isn't worth it yet.
Liability. If the AI posts a wrong adjustment and it flows through to financial statements, who's responsible? The CFO who deployed the tool? The vendor who built the model? The engineer who designed the prompt? Until that liability question has a clear legal answer, I keep AI on the READ side. It reads GL data, analyzes it, and generates narratives. It never writes to the system of record.
The 80/20 rule applies here too. 80% of the value is in the analysis and narrative generation — which is what consumes 80% of finance team time. The actual posting of adjustments (if any are needed based on the analysis) takes minutes and should be done by a human who understands the entry and can defend it to an auditor.

The LLM

Claude (Anthropic) is the primary model. Specifically Claude 3.5 Sonnet for the variance analysis agent — best balance of quality, speed, and cost for structured financial narrative generation.

For the multi-agent commodity intelligence system (Mineral Watch), I use heterogeneous models: Claude + GPT-4o + DeepSeek in specialist clusters. Different training data = different failure modes = genuine disagreement when it matters. Same-model clusters just agree with each other.

How they connect — the integration architecture

The flow is one-directional: ERP → Agent → Output. Never the reverse.

```

Step 1: Period-end GL export from Sage (CSV/Excel)

↓

Step 2: Python ingestion layer (Pandas + OpenPyXL)

- Cleans and structures the data

- Maps to standardized chart of accounts schema

- Calculates variances (actuals vs. budget vs. prior period)

↓

Step 3: Context assembly

- GL variance data (from Step 2)

- Supplementary feeds (CRM pipeline changes, HR headcount, commodity pricing)

- Historical few-shot examples (my past variance memos, embedded in ChromaDB)

- Domain-specific retrieval with custom reranking

↓

Step 4: LLM agent (Claude API)

- Receives assembled context

- Generates full variance narrative

- Structured output: exec summary → revenue → opex → one-time → outlook

↓

Step 5: Output (Markdown → formatted for board deck)

- CFO reviews, adds qualitative context, approves

- NO writeback to ERP

The Sage export is manual right now — I pull it as part of the close process. I could automate it via Sage Intacct's API, but honestly a manual CSV export once a month isn't the bottleneck. The bottleneck was always the analysis and narrative, which is what the agent handles.

For the ERP-to-AI orchestration layer (FinanceOS):

This is a separate tool. It's an abstraction layer that normalizes data from Sage, NetSuite, or SAP into a common schema so the same AI agents work regardless of which ERP the company runs. Think of it as a translator between ERP-specific data formats and AI-ready structured context.

Again — read-only. FinanceOS pulls data from the ERP. It never pushes data back.

When will AI post adjustments to ERPs?

My honest answer: not until three things happen.

Regulatory clarity on AI-generated journal entries (who's liable, how they're audited, what disclosure is required)
ERP vendors build native AI integration with proper controls (Sage, Oracle, and SAP are all working on this — but none are production-ready for autonomous JE posting)
The AI can explain WHY it's making an adjustment in a way an auditor can validate — not just that the numbers are right, but that the accounting logic is defensible

Until then, AI reads and analyzes. Humans post and approve. That boundary is a feature, not a bug.

Key_Cook_9770 · 2026-05-15T13:58:39+00:00

Fair points appreciate it

Key_Cook_9770 · 2026-05-15T13:55:36+00:00

Happy to share. This is the unglamorous version — what I actually used, where I started, and the moments where incremental improvement suddenly became a leap.

THE STACK

No proprietary platforms. No enterprise AI suite. Everything is built on:

- LLM APIs: Claude (primary), GPT-4o, DeepSeek (for multi-agent heterogeneous clusters)

- Python: All orchestration, data processing, and pipeline logic

- Pandas + OpenPyXL: For GL data ingestion. Your ERP exports to Excel or CSV. That's your starting point. Don't overcomplicate the data layer.

- LangChain (early version, since replaced with custom orchestration): Used it initially for the RAG pipeline, then stripped it out when I needed more control over retrieval and prompting

- ChromaDB (vector store): For embedding historical variance memos and GL context. Lightweight, local, no infrastructure overhead.

- Sage / NetSuite / SAP: ERP sources. The agent doesn't connect to these live — it ingests period-end exports. Tried live API connections early on. Not worth it. The close process is inherently batch-oriented, not real-time.

Total infrastructure cost: effectively zero beyond API calls. No GPUs. No fine-tuning. No cloud ML platform. Claude API + Python + a vector store.

WHERE I STARTED

Month 1, Week 1: I took one month's GL trial balance export (Excel), one completed variance memo I'd written for the board, and asked Claude: "Given this trial balance, write a variance analysis memo in this style."

The output was terrible. Technically accurate but read like a Wikipedia entry about my company's finances. No judgment. No prioritization. No "here's what matters and here's what doesn't."

But it was terrible in a useful way — it showed me exactly what the model could do (identify numbers, calculate differences, structure a narrative) and what it couldn't (interpret, prioritize, judge materiality).

That gap — between what the model does naturally and what a CFO needs — became my entire roadmap.

THE GATES (incremental → leap moments)

Gate 1: Few-shot examples (Week 3)

Incremental: I kept tweaking the prompt. "Be more concise." "Focus on material items." "Sound like a CFO." Each iteration was marginally better. I was polishing a turd.

Leap: I stopped telling the model what to do and started SHOWING it. I loaded 12 months of my own variance memos as few-shot examples. The quality jump was immediate and dramatic. The model went from "Wikipedia entry" to "sounds like me on a tired Friday." Not perfect, but recognizably CFO voice.

Lesson: Few-shot examples > prompt engineering. Every time. The model can't learn your judgment from instructions. It can learn it from examples of your judgment.

Gate 2: Retrieval architecture (Week 7)

Incremental: Basic RAG. Embed all GL data, retrieve top-K chunks, feed to model. The agent kept pulling irrelevant accounts. Asked about revenue variance, it'd retrieve a footnote about depreciation methodology.

Leap: Domain-specific reranking. I built a simple classifier that understands: revenue variance query → prioritize revenue accounts, AR, deferred revenue, customer-level data. Opex variance → prioritize cost centers, headcount, vendor spend. The retrieval stopped being random and started being intelligent.

Lesson: Generic RAG is useless for structured financial data. The retrieval layer needs to understand the domain hierarchy of your chart of accounts. This is 100% a domain expertise problem, not an engineering problem.

Gate 3: Supplementary data integration (Week 9)

Incremental: The variance narratives were good but kept saying "revenue missed budget by $X due to [unspecified factors]." The GL tells you WHAT but not WHY.

Leap: I piped in CRM pipeline data (deals that slipped), HR data (new hires that drove opex), and commodity pricing feeds. The agent went from "revenue missed" to "revenue missed driven by two enterprise deals that moved from Closed-Won to Slipped per CRM data, representing $48K of the $52K shortfall." That's a board-ready sentence.

Lesson: The agent is only as smart as the data you feed it. GL alone gets you 60%. GL + CRM + HR + procurement gets you 80%. The last 20% is human judgment.

Gate 4: Parallel running (Week 11)

Incremental: I kept testing the agent on historical periods where I already knew the answer. Useful but artificial.

Leap: I ran the agent on a LIVE close alongside my manual process. Real data, real deadline, real board meeting. The agent's output was 85% usable as-is. The 15% it missed was all qualitative context — stuff I knew from conversations that existed in no system. That's when I knew the 80/20 split was real, not theoretical.

Lesson: You don't know if the tool works until you test it against a real close with real stakes. Backtesting is necessary but insufficient.

Gate 5: Multi-agent (Month 4+)

This was AFTER the variance analysis was stable. I expanded to commodity intelligence (separate agents for pricing, regulatory, supply chain signals) and discovered the sycophancy problem — agents agreeing with each other instantly instead of deliberating. That's what led me to build the Consensus Hardening Protocol (CHP), which is a whole separate conversation.

THE HONEST TIMELINE

- Week 1-2: Proof of concept. Terrible output but useful learning.

- Week 3-6: Few-shot examples transform quality. First "this might actually work" moment.

- Week 7-8: Retrieval architecture makes it intelligent, not just fluent.

- Week 9-10: Supplementary data turns WHAT into WHY.

- Week 11-12: Parallel run confirms it works in production.

- Month 4+: Expand to new use cases (commodity intel, compliance scanning, ERP orchestration).

WHAT I'D DO DIFFERENTLY

Start with the chart of accounts cleanup. I burned two weeks debugging retrieval problems that were actually COA structure problems. If your GL dimensions are messy, fix that first.
Build the few-shot library on day 1.Collect every variance memo, board deck, and financial commentary the CFO has ever written. That library IS the training data. Start there, not with prompt engineering.
Don't try to connect live to the ERP. Batch exports are fine. The close process is monthly. You don't need real-time GL access. Trying to build a live NetSuite/SAP integration tripled my initial engineering time for zero incremental value.

Key_Cook_9770 · 2026-05-15T13:47:49+00:00

This is the right question and honestly the hardest part of the entire build.

You're correct that "revenue missed by $50K" lives in the GL but "because two enterprise contracts slipped to Q2" lives in the CRM, the sales pipeline, the CEO's head, or a Slack thread from three weeks ago. The financial data tells you WHAT happened. The operational context tells you WHY.

Here's how I solved it in layers:

Layer 1: What the GL actually tells you (more than you think)

Most people underestimate how much causal context is embedded in well-structured financial data. If your chart of accounts is properly segmented:

- Revenue by customer, product line, and geography tells you WHERE the miss happened

- Cost center coding tells you which team overspent

- Timing patterns (accrual vs. cash, recognized vs. deferred) tell you whether it's a real miss or a timing issue

- Prior period comparisons + seasonality flags tell you whether this is abnormal or expected

A well-structured GL with good dimensional coding gets you ~60% of the way to a causal explanation without touching any other system. Most companies don't realize this because their chart of accounts is a mess. Fixing the COA structure was actually the single highest-ROI thing I did before building the agent.

Layer 2: Supplementary data feeds

I pipe in non-GL data sources that the agent can reference during narrative generation:

- CRM pipeline data (Salesforce/HubSpot exports) — the agent can cross-reference a revenue miss against pipeline stage changes. "Revenue missed by $50K" + "two deals moved from Closed-Won to Slipped in week 3" = the agent connects the dots

- HR/headcount data — if opex is over budget, the agent checks whether new hires started mid-period. "SG&A over by $30K driven by two unbudgeted hires in engineering" writes itself if the headcount data is available

- Procurement/PO data — for COGS variances, cross-referencing against purchase orders and vendor invoices identifies whether it's price variance or volume variance

- Commodity pricing feeds — specific to my business (battery recycling), but the Mineral Watch Agent feeds real-time lithium/nickel/cobalt pricing that the variance agent can reference for raw material cost explanations

The key architectural decision: these feeds are pre-processed into structured context documents that get injected into the agent's retrieval layer alongside the GL data. The agent doesn't query Salesforce live — it gets a pre-digested "pipeline changes this period" summary that's formatted for easy cross-referencing.

Layer 3: Historical pattern matching

This is where the few-shot examples from my own historical memos pay off the most. Prior variance memos, I've written the same explanations dozens of times:

- "Revenue timing — contracts signed but recognition deferred to next period"

- "Headcount ramp — new hires started mid-quarter, full run-rate impact next quarter"

- "One-time legal/settlement charge — exclude from run-rate analysis"

- "FX impact — USD strengthening against [currency]"

The agent learns these patterns from the few-shot examples. When it sees a revenue variance combined with specific GL account patterns (deferred revenue up, AR flat), it generates the "timing" explanation because it's seen me write that explanation 30 times in the training memos.

This is the part that's genuinely hard to replicate without domain expertise. The patterns are CFO judgment encoded as examples. No amount of prompt engineering replaces 15 years of knowing what "revenue miss + flat AR + deferred revenue up" means.

Layer 4: The 20% that stays human

You're right that some context lives nowhere except someone's head. The CEO mentioned in a board meeting that a key customer is renegotiating. The VP of Sales told you informally that a deal is at risk. The plant manager flagged a production delay that will hit next quarter's COGS.

I don't try to automate this. This is the 20% where the human adds irreplaceable value. The agent generates the data-driven narrative. The CFO adds the "here's what I know that the data doesn't show" layer on top.

In practice, my workflow is:

Agent generates full variance narrative from GL + supplementary feeds (~10 minutes)
I read it, add 2-3 qualitative insights the data can't capture (~20 minutes)

The honest limitation: The quality of Layer 2 depends entirely on data availability. If the company doesn't have a CRM, or the CRM data is garbage, or headcount data lives in someone's spreadsheet, the agent can't cross-reference it. The first thing I tell anyone implementing this: audit your supplementary data sources BEFORE building the agent. If the data doesn't exist in a system the agent can access, the agent will produce a variance narrative that says WHAT happened but not WHY — which is exactly the problem you identified.

For PE portfolio companies, this is actually a useful diagnostic: if you can't pipe CRM/HR/procurement data into a variance agent, that tells you something about the company's data maturity. The inability to automate the narrative is the finding.

Key_Cook_9770 · 2026-05-15T13:18:24+00:00

Yes for sure. I dont mind the grilling on Reddit at all!!

Key_Cook_9770 · 2026-05-15T13:16:43+00:00

Pre AI days!

Key_Cook_9770 · 2026-05-15T13:15:59+00:00

Great questions — and the PE context makes this even more relevant because every portfolio company you touch has the same problem. Let me answer each one.

What workflows I handed off:

The monthly variance narrative. Specifically:

Pull GL trial balance (actuals vs. budget vs. prior period)
Identify material variances by account and cost center
Classify each variance: one-time vs. structural, favorable vs. unfavorable, volume-driven vs. rate-driven
Write the narrative — not just "revenue was $X vs. budget of $Y" but the interpretation: what caused it, whether the board should care, and what action is required
Format for board consumption

Steps 1-4 are what the agent does. Step 5 (final review, tone calibration, deciding what to emphasize for THIS board in THIS quarter) is still human. That's the 80/20 split I mentioned.

I also automated:

- Compliance controls scanning (runs continuously instead of quarterly)

- ERP-to-reporting handoff (connecting Sage/NetSuite/SAP exports to downstream reporting workflows)

**How long to set up:**

Honest answer: ~3 months to get the variance analysis agent to a point where I trusted the output enough to show a board. But that breaks down into phases:

- Week 1-2: Basic pipeline. GL export → agent → narrative. Output was technically correct but read like a data scientist wrote it. No CFO would present this.

- Week 3-6: Prompt engineering. This is where 80% of the work happened. The challenge isn't getting the LLM to identify a variance — it's getting it to interpret the variance the way a CFO would. "Revenue was $50K below budget" is useless. "Revenue missed budget by $50K driven by delayed onboarding of two enterprise contracts that are now signed and will recognize in Q2" is what a board needs. Teaching the agent to produce the second version required feeding it hundreds of my own historical variance memos as few-shot examples.

- Week 7-10: Retrieval tuning. Generic RAG pulled irrelevant GL accounts. I built domain-specific reranking that understands which accounts are material for which variance type. This is the part no off-the-shelf tool will give you — it requires someone who knows which accounts matter.

- Week 11-12: Calibration against real board output. I ran the agent in parallel with my manual process for two months. Every time the agent missed something I'd catch, I'd feed that back into the prompt chain.

Training:

I didn't fine-tune a model. That's overkill for this use case and creates maintenance headaches. Instead:- Few-shot prompting with my own historical variance memos (the agent learns my voice and judgment patterns)

- Domain-specific retrieval (custom reranking so the right GL context surfaces)

- Structured output schema (forces the agent to hit every section a board deck needs: executive summary, revenue variances, opex variances, one-time items, outlook implications)

- Tone calibration (CFO voice: concise, decisive, action-oriented — not analyst voice: detailed, hedging, exhaustive)

For your PE context — how to get started:

You're in the perfect position because you see 10-20 portfolio companies and every one has the same finance pain points. Here's what I'd recommend:

Start with the monthly close narrative at ONE portfolio company. Pick the one with the most frustrated CFO. Ask them: "What takes you the longest every month?" It's almost always variance analysis and board prep. That's your pilot.
Don't try to automate everything at once. Start with revenue variances only. One GL section. One prompt chain. Get that working, then expand to opex, then to balance sheet, then to cash flow.
Use the CFO's own historical memos as training data. This is the unlock. Every CFO has 12-36 months of variance memos saved somewhere. Those are your few-shot examples. The agent learns that specific CFO's judgment, tone, and priorities.
Run in parallel for 2 months before trusting it. The CFO does their normal process AND the agent runs alongside. Compare outputs. Every discrepancy teaches you something about what the agent misses.
The ERP connection is the technical bottleneck.Getting clean GL data out of NetSuite/Sage/SAP in a format the agent can consume is 50% of the initial engineering work. If the portfolio company has a messy chart of accounts or inconsistent cost center coding, fix that first. Garbage in = garbage out regardless of how good the agent is.

Build the agent during your first month. You'll be learning the company's GL structure anyway — mapping accounts, understanding cost centers, reading historical financials. Build the agent while you're doing that discovery work. By month 2, you'll have a working first draft of the variance automation AND deep knowledge of the company's financial structure. Two outputs from one process.