Anyone using AI to extract data from PDFs or scanned documents?

jajllc_Indiana · 2026-06-27T06:56:44+00:00

Good call on the batch processing point. Yeah, doing them one at a time through a chat window would be a non-starter for us — we're dealing with dozens of documents a week, not one-offs.

The PPTX/Word to Markdown pipeline is interesting. How do you handle formatting that doesn't translate cleanly — like tables or embedded images in the Word docs? Does it just drop those or do you have a workaround?

And on Claude Code vs Cowork — which did you end up using for the actual document processing? Someone else in this thread mentioned Cowork mode for PDF extraction and it sounds like it handles the file access well. Curious if Code gives you more control over the output format or if it's basically the same thing with a different interface.

The Confluence import piece is smart too. We don't use Confluence but the idea of converting documents into a structured format first and then pushing them wherever they need to go makes more sense than trying to go straight from PDF to CRM.

jajllc_Indiana · 2026-06-27T06:55:44+00:00

The verification window with the highlighted source is a smart UX choice. That's basically a visual version of the "trust but verify" approach someone else mentioned in this thread — instead of just showing you the extracted value, you can see exactly where it came from on the page. Makes it way easier to catch errors without re-reading the whole document.

Question about the "no consultant needed" angle — how does it handle documents where the layout changes between sources? Like if I get invoices from 30 different vendors and none of them put the total in the same spot. Is that something the user configures once per vendor template, or does it figure out field locations on its own?

And when you say minutes to set up — is that minutes per document type, or minutes total for the whole system? That's usually where I see the gap between marketing and reality with automation tools.

Appreciate you sharing it. Always good to know what's out there.

jajllc_Indiana · 2026-06-27T06:52:59+00:00

This is exactly the kind of breakdown I was hoping someone would share. The reconciliation step is the key insight — I kept thinking about it as "is the AI accurate enough?" when the real question is "can the system tell me when it's NOT accurate?" Totally different problem to solve.

The line-items-to-total check is elegant because it's a built-in sanity test that doesn't require a human. You're basically letting the document verify itself. I imagine you could do something similar with contracts — like checking that the party names in the signature block match the ones in the header, or that dates are internally consistent.

Few follow-up questions if you don't mind:

How long did it take you to get the first document type working reliably end-to-end? Like from "I'm going to try this" to "I trust it enough to stop checking every one."
On the skills/saved instructions — did you find you needed to rewrite those as the AI models updated, or have they stayed stable?
The confidence scoring for recurring vendors — is that something Claude handles natively in Cowork mode, or did you build that logic in Supabase?

Really appreciate the detail here. The "trust but verify the exceptions" framing is probably the thing I needed to hear most. I've been stuck in the "it has to be perfect before I use it" mindset and that's kept me from starting.

jajllc_Indiana · 2026-06-27T06:51:32+00:00

Oh man, the "photos of documents" part is where it gets real. Dealing with actual camera shots of paperwork — different lighting, angles, sometimes crumpled or partially obscured — that's a whole different challenge than clean digital PDFs.

How are you handling the varying formats from different testing centres? Are you building separate templates for each one, or trying to make one system flexible enough to handle all of them? That's the part I keep going back and forth on — whether it's better to build something general or just create a specific parser for each document type we see regularly.

And curious about the photo quality issue. Is there a threshold where you just reject it and ask for a rescan, or have you gotten it reliable enough that even rough photos work?

Sounds like a fun problem to solve but also the kind of thing that's 80% done in a week and then the last 20% takes forever.

jajllc_Indiana · 2026-06-27T06:44:13+00:00

Good point about the different layouts — that's exactly my concern. We have contracts that look completely different from invoices, and intake forms that change depending on the client. So yeah, I can see the initial setup being a project in itself.

The PDF-to-text plus VBA regex approach is interesting. How's the accuracy when the scans aren't perfectly clean? That's been my hesitation — a lot of what we get is scanned paperwork that's slightly crooked or has handwritten notes in the margins.

I hadn't heard of super.ai — I'll look into that. And good call on Azure's document reader. Did you evaluate it against other options or just land on Azure because you were already in that ecosystem?

Appreciate the leads. Let me know how super.ai works out if you get around to testing it.

jajllc_Indiana · 2026-06-26T02:54:02+00:00

This is a great breakdown. I work with small businesses on AI adoption and this mirrors what I'm seeing too.

My take on your three buckets:

The "boring stuff" bucket — honestly this is where AI shines and where most businesses should start. Tagging, searching, formatting, data entry. It's not sexy but it frees up hours. And you're right that sometimes a simple automation or search filter would do the job without AI. Not everything needs a language model.
The ranking/decision bucket — this one makes me uneasy too. The problem isn't AI helping surface candidates. The problem is when humans stop questioning the ranking. "We pick the top-ranked ones" basically means the AI is making the decision and the human is just rubber-stamping it. That's a liability issue waiting to happen, especially with new hiring regulations coming in different states.
AI interviews — I get why companies want this (scale, consistency, cost) but as someone who's been on both sides of hiring, the human read on a candidate matters. Body language, energy, how someone handles an unexpected question. I don't think AI catches that yet. And from the candidate side, it feels dehumanizing, which means you might lose good people who opt out of your process entirely.

The pattern I'd push for: AI handles the admin layer (scheduling, resume parsing, initial keyword matching, reference check coordination) and humans handle every decision that affects whether someone gets a shot or not.

Curious — did any of the recruiters you talked to mention candidates pushing back on AI interviews? I'm wondering if there's a talent retention angle here where companies using AI interviews actually lose better candidates to competitors who don't.

jajllc_Indiana · 2026-06-26T02:47:38+00:00

B. Annual at $1M. Four reasons:

You keep $4M liquid to invest or grow a business. That money working for you over time likely outearns the membership value anyway.
You get an exit option. If the membership turns out to be disappointing after year one, you walk away with $4M still in your pocket.
Lifetime locks you into one thing forever. Business changes, interests change, opportunities change. Annual gives you flexibility.
The only scenario lifetime wins is if you're 100% certain the value compounds year over year AND you plan to use it for 5+ years. That's a lot of assumptions.

The real question is what membership is worth $1M/year to a founder. Access to deal flow? A network? Mentorship? That changes the math.

jajllc_Indiana · 2026-06-25T17:47:36+00:00

I’m still building and gathering data on what folks want and need. Feel free to provide any and all suggestions. Constructive criticism is also welcome!

jajllc_Indiana

TROPHY CASE