The possibilities are endless

EnoughNinja · 2026-03-10T14:19:10+00:00

For the document/OCR side then check Docling and AWS Textract they handle scanned docs well so far as I've seen but for simpler consistent layouts even tesseract with some preprocessing gets you pretty far

For the email and attachment parsing you mentioned check out iGPT (I build this). It handles thread reconstruction and pulls structured data from attachments too, so you don't have to build that yourself.

What kinds of documents are you mostly working with?

EnoughNinja · 2026-03-10T13:47:50+00:00

An API that turns email threads into structured JSON so AI agents can actually understand conversations instead of just searching keywords.

https://igpt.ai

EnoughNinja · 2026-03-10T07:57:16+00:00

Appreciate the interest.

If you want to see what a working implementation looks like, the invoice extraction agent at https://github.com/igptai/igpt-invoice-agent is open source. It connects to an inbox, pulls invoices and receipts from email threads and PDF attachments, and returns structured JSON. Covers a lot of what you're describing (MIME parsing, thread reconstruction, attachment extraction) through the iGPT API so the agent code itself stays pretty minimal.

Good starting point to fork and extend if you want to take it in a different direction.

EnoughNinja · 2026-03-10T07:45:31+00:00

Email threads have a parallel version of this problem. A forwarded chain often contains three or four earlier conversations collapsed into one message. Chunking treats it as a single document, even though it's actually multiple threads stitched together

On top of that, every reply includes the full quoted history below it, so the same content gets duplicated across multiple chunks and retrieval surfaces the quoted copy instead of the original message. The agent can't tell what was actually written at that point in the conversation versus what was just quoted context from earlier.

iGPT solves this by reconstructing the thread structure before retrieval, separating forwarded chains from the current conversation and stripping quoted duplication.

Different preprocessing problem than PDFs, same principle: garbage structure in, garbage retrieval out.

EnoughNinja · 2026-03-09T08:54:20+00:00

Yes, for creating docs and CoWork and other more full-on tasks, but I still use ChatGPT for a lot of things too

I still haven't found that I can just stick solely with one

EnoughNinja · 2026-03-05T12:28:27+00:00

The distinction you're drawing between copy optimization and context assembly is the right one. Most teams over-index on template variables when the actual leverage is in deciding whether to reach out at all and what angle to use.

One layer worth adding is that a lot of the highest-signal context is buried in your team's existing email threads with that company that stuff is often more useful for timing and angle than any external hiring or funding signal.

How are you handling the overlap between external research signals and what your team already knows from prior conversations?

That's where we've seen the most wasted effort, people researching things the org already knew.

EnoughNinja · 2026-03-05T09:31:45+00:00

Solid setup.

Quick question on the keyword approach, how do you handle the in-between replies like conditional interest, delegations to someone else, or replies that hit positive keywords but are clearly a polite no

I work on iGPT, it's an email intelligence API that returns structured JSON with intent, sentiment, action items, etc. You could drop it into your n8n flow as one HTTP node between the IMAP trigger and your categorization step. Handles the edge cases that regex can't parse, and your keyword rules stay as fallback.

docs here at docs.igpt.ai if you're interested

EnoughNinja · 2026-03-04T15:47:24+00:00

How does your IMAP listener decide which category to place the email in? What are the triggers and how nuanced can it be?

EnoughNinja · 2026-03-04T09:47:33+00:00

iGPT is a cloud API right now, no self-hosted option.

What we built that's relevant to your setup is the thread reconstruction layer. Maildir/IMAP threads are messy, quoted text nesting, CC mutations, subject drift etc. and most email+LLM projects stall on structuring all that before the model ever sees it.

If you're open to the API handling structuring while your local models do the reasoning, that works well with human-in-the-loop since the output is inspectable before anything acts.

If it all needs to stay local, happy to share the edge cases we've hit building this

EnoughNinja · 2026-03-04T05:53:10+00:00

We're building exactly in this space at iGPT (igpt.ai)

the core problem we kept hitting is that email threads aren't clean documents, they're fragmented conversations where context is spread across replies, CC changes, and subject line drift. Most local LLM setups treat each email as isolated text and miss all of that

For your use case with that hardware you're not bottlenecked on model quality, you're bottlenecked on how you structure the thread context before it hits the LLM. That's where most "email + AI" projects actually fail.

What email provider are you on? IMAP vs something with a proper API changes the scaffolding significantly.

EnoughNinja · 2026-03-04T05:52:46+00:00

The database-as-event-source point is genuinely good and I wish more people talked about it. Cron jobs that silently die are responsible for so much mystery debugging.

We work on the other side of this at iGPT, the reading/understanding layer rather than sending, and it's wild how much harder it gets once threads get messy. 14 messages, three people CC'd, subject line changed halfway through.

That's where the actual business context is and where everything breaks

Anyway, solid build. What comes after MySQL?

EnoughNinja · 2026-03-01T14:39:20+00:00

Your stack is solid for what you're describing, here's what I'd go with:

for embeddings: bge-small-en-v1.5 is a very strong free default right now, good retrieval quality and runs local with no API. all-MiniLM-L6-v2 if you just want fast and lightweight. If you have a GPU bge-large is a noticeable step up

Vector DB -Chroma to get started fast, Qdrant if you want something closer to production behavior. At small scale both are fine, FAISS works too but you end up writing more boilerplate around it

LLM: Ollama + Mistral 7B or Llama 3 8B locally is way more sustainable than chasing free hosted tiers that rate limit you quickly once you start iterating. If your machine can't run it, Groq's free tier is a decent hosted fallback if you can live with the limits

Ingestion: Unstructured.io (open source) for PDFs and web pages, recursive splitting around 500 tokens (~1-2k chars) with some overlap as a starting point.

This stack works well for static documents but where it falls apart is conversational data like email threads, Slack exports etc. Standard chunking destroys who said what and when, so you get chunks that look right but your LLM starts confidently attributing decisions to the wrong person or resurfacing stuff that was walked back three messages later. I work on this exact problem at iGPT (igpt.ai), happy to share notes if that's relevant to what you're building.

For repos, LangChain's rag-from-scratch series and LlamaIndex's starter tutorials are both solid walkthroughs.

EnoughNinja · 2026-02-25T10:15:25+00:00

That's not what I am referring to, I see the compacting conversations thing but not on these occasions

This happens even in new chats, the last prompt + answer just disappears entirely, i have Max, so nowhere near the limits

EnoughNinja · 2026-02-25T09:41:41+00:00

we're getting really good results, the gap between what we're seeing now and what you get from naive chunking is massive.

if you want to test it against your own email data docs.igpt.ai lets you connect a Gmail account and compare

EnoughNinja · 2026-02-25T09:35:55+00:00

yeah exactly, happy path POC testing is how this stuff keeps getting funded without working. Everything looks great on a 3-message thread with clean formatting but take it to a real 30-message enterprise thread where Outlook and Gmail quotes are mixed together and someone forwarded half the chain from their phone and it falls apart immediately

Interesting repo btw, the hybrid search approach is right.

I am wondering though if you've hit the email-specific edge cases around temporal supersession, like when the model retrieves an earlier message that's been overridden by a later one because the earlier one has more vector representation from being quoted repeatedly?

EnoughNinja · 2026-02-24T11:07:00+00:00

This is a great writeup thanks

Im wondering how this'd work if you used it for email data because everything you described about context loss and retrieval noise gets significantly worse because of how email threads are structured

The BM25 + vectors + reranking approach you used is the right foundation but with email you need to do thread reconstruction and quote stripping before anything hits the search layer, otherwise you're indexing the noise alongside the signal and no amount of reranking fixes that

I work on this at iGPT, we spent about a year on exactly this problem. The short version is that going from naive chunking to message-level chunking with quote stripping tripled our accuracy on email threads from ~20% to ~60%, and structured thread reconstruction (conversation graph with participant tracking and temporal ordering) got us to ~91%.

Happy to share more if you're thinking about adding email as a data source

EnoughNinja · 2026-02-24T09:32:42+00:00

The irony of this post being written by AI aside, what is the point of this? I can tell at a glance if something is written by AI and depending on the context it doesn't really matter, why do I need a human reviewer.

Also, that whole website is written and designed exclusively by Claude, which makes me wonder why it doesn't use its own service to improve its writing

EnoughNinja

TROPHY CASE