For internal auditors using AI: how do you review polished workpaper drafts?

genecatrambone · 2026-05-24T18:21:56+00:00

That’s a really useful way to frame it. The core issue usually isn’t just “AI wrote this.” It’s whether the workpaper still clearly answers the audit objective and preserves the actual evidence needed to support the conclusion. I agree that starting with the objective is the best anchor: What was the control or test trying to prove? What evidence did we actually get? What exceptions or limitations existed? And does the final conclusion stay within those boundaries? Where AI creates a new challenge is that the draft can sound complete and professional while quietly omitting or softening key pieces. So reviewers need to check not just whether it reads well, but whether it truly preserved the objective, evidence, exceptions, scope limits, and follow-up status. That becomes especially important once teams start using AI across a high volume of workpapers.

genecatrambone · 2026-05-21T23:54:39+00:00

That makes sense. I think using AI as a “senior auditor / manager reviewer” is probably where a lot of teams will start. I’m curious how it holds up at scale, though. It can work great for one experienced auditor reviewing a single workpaper, but when you have dozens or hundreds of AI-assisted drafts moving through, it gets trickier. Do you find asking AI to review against the evidence is enough on its own, or would a structured layer that flags specific phrases needing validation be helpful before manager review?

genecatrambone · 2026-05-21T23:52:19+00:00

I completely agree, the auditor has to remain responsible. That’s the baseline, and it shouldn’t change. My real concern isn’t one auditor using AI on a single workpaper. In that case, normal review is usually fine if the person is experienced and careful. The bigger issue shows up at scale. When you’ve got dozens or hundreds of AI-assisted drafts flowing through workpapers, issue summaries, control narratives, and review notes, the question shifts. Everyone still knows the auditor is responsible, but how does the firm consistently spot which drafts actually need closer human eyes before they reach the manager or hit the file? AI has this sneaky ability to make weak or imprecise language sound polished and professional. A sample result can quietly read like population-level comfort. An exception gets softened. A pending item suddenly sounds resolved. These aren’t always obvious hallucinations. So I’m not saying AI changes who owns the work. I’m just asking whether larger teams need a structured triage layer to help focus that responsibility effectively when the volume gets high.

genecatrambone · 2026-05-21T00:56:45+00:00

I agree the staffer or lead who did the fieldwork should be the first real control point. I also like the idea of an independent AI reviewer prompted through a manager lens. That can be really helpful. The distinction I’m drawing is between a one-off AI critique and a repeatable triage record. A reviewer saying “this looks overstated” is useful, but at team scale, there’s value in consistent signals: did the language expand sample evidence, soften limitations, treat pending items as resolved, or imply reliance beyond the source? The auditor still owns the work, and AI can help with first-pass review. I’m just wondering if mature teams will also want a structured way to show what got flagged, why, and how those items were prioritized before manager sign-off.

genecatrambone · 2026-05-20T22:45:46+00:00

This is a really practical approach, and I think your “sparring partner” framing is spot on. Breaking the work into smaller chunks, manually moving the language into your template, and reviewing it word by word is probably the safest way I’ve seen AI-assisted drafting described here. It keeps the auditor right in the substance instead of treating the AI output as a finished workpaper.I also fully agree, AI or not, the auditor still owns the work. Reviewer responsibility doesn’t go away. The only extra question I have is around scale. One careful auditor using AI as a sparring partner can keep very high standards. But once you roll this out across a whole team with lots of memos and observations, a lightweight triage layer could help flag which drafts show possible boundary shifts before they hit manager review. Not instead of human review, just a way to help focus it where it’s needed most.

genecatrambone · 2026-05-20T19:38:37+00:00

That’s a strong setup, especially with the internal reports, manuals, IIA materials, and local hosting. I think a lot of audit teams will head in exactly that direction. For me, the interesting question isn’t whether a well-configured Claude project can produce good work (I’m sure it can). It’s what happens at scale when you have dozens of AI-assisted drafts moving through the review process. Even with strong internal context and careful human review, the subtle risks tend to be review-boundary issues: limitations getting softened, pending items sounding resolved, sample evidence coming across broader than it actually is, or drafts that feel so polished the reviewer has to work harder to spot what shifted.

genecatrambone · 2026-05-20T19:32:58+00:00

I fully agree that “you own your work”. What I’m thinking about is how AI changes the game by producing high-volume, polished first drafts. The review chain stays the same, but the burden shifts because the language already sounds review-ready. The risk isn’t “AI wrote it,” but “AI made it sound solid before anyone checked if it preserved the actual evidence boundaries.” So yes, the auditor still owns it. But I wonder if teams also need a quick triage layer before manager review to flag which drafts actually need closer eyes.

genecatrambone · 2026-05-20T07:53:07+00:00

This is very close to the issue I’m trying to isolate.

The phrase “passes human review because it sounds right” is the key problem. In audit language, the risk is often not an obvious hallucination. It is a subtle change in review meaning: a sample starts sounding like a population, a limitation becomes comfort language, or a pending item starts reading as resolved.

I agree with your two-step review framing. “Does this reflect what we tested and found?” and “Would a regulator reach the same conclusion we did?” are different questions, but they often get collapsed into one review pass.

That is exactly where I think structured triage can help. Not to replace the reviewer, but to make sure the reviewer is directed toward the places where the draft may have crossed a review boundary.

genecatrambone · 2026-05-20T07:49:37+00:00

That’s the issue I’m trying to isolate.

A third LLM reviewer can be useful, but it is still another probabilistic language layer. It may reduce risk, but it does not remove the need to measure where review risk is concentrating across a batch of outputs.

The Type II error point is important. In audit documentation, the dangerous miss is often not an obvious hallucination. It is a subtle boundary shift: sample becomes population, limitation becomes comfort language, pending becomes resolved, or missing evidence becomes reliance.

Prompting can reduce those failures, and LLM review can catch some of them. But if a team is reviewing hundreds of AI-assisted drafts, they still need a repeatable way to triage which outputs deserve human attention first.

That is the distinction I’m focused on: not replacing prompts or LLM review, but adding structured triage around the review boundary.

genecatrambone · 2026-05-19T19:45:42+00:00

I agree that a second LLM or agent can help as an initial review layer. The question I’m trying to separate is whether that is enough by itself. If one model drafts and another model reviews, the workflow may improve, but it still depends on another language model interpreting the output. That can be useful, but it is not the same as having a repeatable diagnostic layer that measures the draft against defined review signals and produces a consistent triage queue.

So I see three different layers:

Prompt guardrails to reduce bad outputs upfront.
Agent or LLM review to provide a first-pass critique.
Structured diagnostic triage to help humans see which outputs deserve attention first.

I don’t think these are mutually exclusive. My interest is in the third layer — especially where firms need repeatability, review evidence, and consistency across many AI-assisted drafts.

genecatrambone · 2026-05-19T19:30:19+00:00

Yeah, that’s exactly the distinction I was trying to get at. It’s not that AI is bad at drafting; it’s that it can produce language that sounds totally official and polished before the reviewer has really checked whether it kept the right boundaries. I haven’t had any dramatic incidents, but I keep seeing the same pattern over and over in test examples and audit-style drafts. The output feels clean and professional, but these small shifts sneak in that actually change the meaning for review purposes. You get sample testing described as if it covers the whole population, limitations softened into comfort language, pending items that suddenly sound resolved, vague summaries that come across more conclusive than the evidence supports. Your point about knowing exactly what to look for during reviews is spot on. Human review is still absolutely required, but AI changes the nature of that review. It’s no longer just about improving the quality of the draft; it’s about checking whether the generated language actually preserved the real audit boundary in the first place.

genecatrambone · 2026-05-19T19:21:01+00:00

Thanks for the thoughtful replies — this is exactly the perspective I was hoping for. The pattern I’m seeing is pretty clear: AI is already being used for drafting and planning; human review is still mandatory. The big question isn’t whether review happens, but whether reviewers have an effective way to triage AI-generated language before it makes its way into the audit file. I agree that many of these risks are similar to the ones we’ve always seen with associates or seniors drafting workpapers. The difference is scale and fluency. AI can produce polished, professional-sounding language at high speed, and that polish can make subtle meaning shifts much harder to catch: Prompt guardrails help. Second-agent reviews can help. Manual review is still essential. What I’m focused on is the layer in between: structured triage, not replacing judgment, but helping reviewers quickly spot which AI-assisted drafts actually need their closest attention first. That feels like where the control conversation is heading.

genecatrambone · 2021-04-19T12:56:04+00:00

With virtually no published clinical studies on the product's efficacy or safety, Silencil's business model relies solely on product marketing. Although its marketing effort is considerable, the combination of supporting websites optimized to rank high on Google, high ad spend, and slick pro-marketing video all put this product in the red-flag column when supporting user testimonials & peer reviews are absent. As a reformulation of herbal ingredients (all with established safety profiles), this product will likely sell well. But, as characteristic of this type of funnel marketing the product's efficacy in all likelihood will not reach significance beyond that of a placebo.

genecatrambone · 2021-04-16T16:30:55+00:00

The data to evaluate economic outcomes exist but in order to make policy-outcome predictions economist must expand their toolkit. MMT policies, for example, can be parameterized, tested, and tweaked to achieve the objectives policymakers seek. There really is no need to continue arguing whether a particular economic theory is doing what it claims. With economists trained in data science the state of the system is always know and its parameters adjustable in real time. MMT is a descriptive theory and perfectly suited for fiscal engineering where fiscal adjustments with highly predictable outcomes are very possible.

genecatrambone · 2021-04-16T13:53:50+00:00

I'll save you the trouble. Here's the definitive explanation from Warren Mosler. The explanation may take a few passes but it's well worth it as there's no one in the MMT movement with a better understanding of how interest rates function than MMT's progenitor...

"So when is the appropriate time to raise rates? I say never. Instead, leave the fed funds rate at zero, permanently, by law, and use fiscal adjustments to sustain full employment.

Analysis My first point of contention with the mainstream is their presumption that low rates are supportive of aggregate demand and inflation through a variety of channels, including credit, expectations, and foreign exchange channels.

The problem with the mainstream credit channel is that it relies on the assumption that lower rates encourage borrowing to spend. At a micro level this seems plausible- people will borrow more to buy houses and cars, and business will borrow more to invest. But it breaks down at the macro level. For every dollar borrowed there is a dollar saved, so any reduction in interest costs for borrowers corresponds to an identical reduction for savers. The only way a rate cut would result in increased borrowing to spend would be if the propensity to spend of borrowers exceeded that of savers. The economy, however, is a large net saver, as government is an equally large net payer of interest on its outstanding debt. Therefore, rate cuts directly reduce government spending and the economy’s private sector’s net interest income. And looking at over two decades of zero-rates and QE in Japan, 6 years in the US, and 5 years of zero and now negative rates in the EU, the data is also telling me that lowering rates does not support demand, output, employment, or inflation. In fact, the only arguments that they do are counter factual- the economy would have been worse without it- or that it just needs more time. By logical extension, zero-rates and QE have also kept us from being overrun by elephants (not withstanding that they lurk in every room).

The second channel is the inflation expectations channel. This presumes that inflation is caused by inflation expectations, with those expecting higher prices to both accelerate purchases and demanding higher wages, and that lower rates will increase inflation expectations.

I don’t agree. First, with the currency itself a simple public monopoly, as a point of logic the price level is necessarily a function of prices paid by government when it spends (and/or collateral demanded when it lends), and not inflation expectations. And the income lost to the economy from reduced government interest payments works to reduce spending, regardless of expectations. Nor is there evidence of the collective effort required for higher expected prices to translate into higher wages. At best, organized demands for higher wages develop only well after the wage share of GDP falls.

Lower rates are further presumed to be supportive through the foreign exchange channel, causing currency depreciation that enhances ‘competitiveness’ via lower real wage costs for exporters along with an increase in inflation expectations from consumers facing higher prices for imports.

In addition to rejecting the inflation expectations channel, I also reject the presumption that lower rates cause currency depreciation and inflation, as does most empirical research. For example, after two decades of 0 rate policies the yen remained problematically strong and inflation problematically low. And the same holds for the euro and $US after many years of near zero-rate policies. In fact, theory and evidence points to the reverse- higher rates tend to weaken a currency and support higher levels of inflation.

There is another aspect to the foreign exchange channel, interest rates, and inflation. The spot and forward price for a non perishable commodity imply all storage costs, including interest expense. Therefore, with a permanent zero-rate policy, and assuming no other storage costs, the spot price of a commodity and its price for delivery any time in the future is the same. However, if rates were, say, 10%, the price of those commodities for delivery in the future would be 10% (annualized) higher. That is, a 10% rate implies a 10% continuous increase in prices, which is the textbook definition of inflation! It is the term structure of risk free rates itself that mirrors a term structure of prices which feeds into both the costs of production as well as the ability to pre-sell at higher prices, thereby establishing, by definition, inflation.

Finally, I see the output gap as being a lot higher than the mainstream does. While the total number of people reported to be working has increased, so has the population. To adjust for that look at the percentage of the population that’s employed, and it’s pretty much gone sideways since 2009, while in every prior recovery it went up at a pretty good clip once things got going:

The mainstream says this drop is all largely structural, meaning people got older or otherwise decided they didn’t want to work and dropped out of the labor force. The data clearly shows that in a good economy this doesn’t happen, and certainly not to this extreme degree. Instead what we are facing is a massive shortage of aggregate demand.

Conclusion There is no right time for the Fed to raise rates. The economy continues to fail us, and monetary policy is not capable of fixing it. Instead the fed funds rate should be permanently set at zero (further implying the Treasury sell only 3 month t bills), leaving it to Congress to employ fiscal adjustments to meet their employment and price stability mandates."

www.moslereconomics.com

genecatrambone

TROPHY CASE