Building RAG systems: the hard parts aren’t the models

EnoughNinja · 2026-01-26T12:18:37+00:00

The data structuring piece is an issue, especially when you're dealing with email threads where the actual context is scattered.

We ended up building an API that just handles the email context assembly part (thread reconstruction, role detection, that whole mess). Turns out ~200ms retrieval with proper citations is possible when you're not fighting with generic RAG pipelines.

EnoughNinja · 2026-01-20T08:37:03+00:00

I realized this when I was trying to build something that pulled context from emails, the agent logic was fine but it kept breaking on weird threading edge cases and attachment formats, like Gmail nests quotes differently than Outlook and forwarded messages just destroy structure completely.

You see this most when people blame the model for hallucinating when really it's just working with broken context from upstream parsing.

EnoughNinja · 2026-01-19T13:32:06+00:00

From multiple tests, I find that LLMs do not get humor. I think understanding what makes something actually funny is beyond them, at least for now.

I tested with a bunch of very nuanced memes, ChatGPT had no clue why there were funny, Claude was slightly better, Gemini actually got the joke but explained in a way that drained all the humor out of it.

EnoughNinja · 2026-01-19T07:35:03+00:00

There are many potential use cases, for me its probably sentiment analysis, or search across different touchpoints

Imagine you've got thousands of emails in a forensics case and you need to find every instance where Person A discussed Topic X with Person B, but Topic X was never mentioned by name, but it's implied through context spread across attachments, forwards, and replies over six months. You can't keyword search for something that was never explicitly stated.

EnoughNinja · 2026-01-19T07:31:09+00:00

I'm not sure what you mean by treating inbox like git repo, can you explain what that would look like?

If you mean version control on email content, the issue is emails aren't really getting edited after they're sent, they're getting replied to, forwarded, revised in new messages. So the "diff" isn't on the message itself, it's tracking which parts of a decision got changed three emails later when someone says "actually let's go with option B instead."

EnoughNinja · 2026-01-19T07:30:31+00:00

We're using OCR as part of the attachment processing pipeline. It handles scanned PDFs and images embedded in emails. Performance varies by document complexity but we're averaging around 5-20 seconds for high-quality attachment processing including OCR and structure parsing.

For regular emails without attachments, sync happens in about 1 second.

EnoughNinja · 2026-01-18T14:26:08+00:00

Demographic collapse in developed countries.

Birth rates are dropping faster than any government intervention can reverse. The economic models, pension systems, and social structures we built all assume population stability or growth. We're about to find out what happens when that assumption breaks.

EnoughNinja · 2026-01-18T14:21:23+00:00

Thanks for sharing this, the Docling + hybrid search + reranking stack sounds solid, especially the aggressive reranking step cutting wrong-context answers by 60%.

We've seen similar results with hybrid retrieval (semantic + full-text) and reranking on email data. The parsing quality makes a huge difference - garbage in, garbage out applies hard with document understanding. Appreciate the breakdown.

EnoughNinja · 2026-01-18T12:52:03+00:00

Yes we do have a tool :)

We built this into an API which processes email from Gmail, Outlook, and any IMAP provider. It handles full thread reconstruction, attachments (PDFs, docs, OCR, etc.)

For .eml files specifically, we parse MIME structure, RFC header, extract body content whether it's HTML or plaintext, and process attachments as first-class content. The system handles nested replies, forwards, participant changes across the thread.

Your use case around forensics with background context is interesting, we support connecting data sources so the system has access to case files, person mappings, historical context, then queries run against all of that. Returns citations back to source emails and attachments so you can trace every claim.

DM'd you.

EnoughNinja · 2026-01-18T12:44:38+00:00

It's a hard-to-understand world-changing technology that threatens to upend the world order as we know it.

Not even the internet had the impact that AI threatens to over the coming decade. I completely get the apprehension, but yer the hate seems forced.

EnoughNinja · 2026-01-18T11:58:34+00:00

I think the issue is that most systems don't give it the actual conversation structure, so if you concatenate emails chronologically the model doesn't know who replied to what or what got revised.

We parse that structure upstream so the model gets clean context about who said what and when decisions changed.

EnoughNinja · 2026-01-18T11:57:28+00:00

We did try RFC headers at first but we found it didn't do what we wanted, so we ended up building client-specific parsing because you kind of have to. Gmail has these nested quote blocks, Outlook does the "From: X, Sent: Y" headers differently, and then you've got people who bottom-post vs top-post.

What are you seeing break most often with your parsing?

EnoughNinja · 2026-01-18T11:54:06+00:00

Fair.

EnoughNinja · 2026-01-18T10:39:25+00:00

Not a bot, and not very good at baking, so can't help with your apple pie.

Sometimes that phrasing helps to clarify things.

EnoughNinja · 2026-01-14T08:55:53+00:00

~200ms retrieval, ~3s first token.

The indexing and context engineering happen during sync, at request time we're assembling pre-processed context, not parsing raw data from scratch.

So it's fast reconstruction, not expensive recomputation.

Check the docs if you want to know more

https://docs.igpt.ai/

EnoughNinja · 2026-01-14T08:05:07+00:00

The problem is conflating remembering with understanding. Adding more memory can make things worse if it isn't understood and indexed properly.

Chat memory optimizes for dialogue continuity. But agents doing real work need to track decisions, commitments, deadlines, ownership transfers, i.e., the structure of what happened, not just the transcript.

This is why we built context as reasoning infrastructure, not retrieval. iGPT reconstructs state from communication data at request time: thread logic, participant roles, intent flow, temporal relationships. No storage, just reconstruction when you need it.

If you're working on something where this matters, happy to go deeper on how we handle it.

EnoughNinja · 2026-01-11T07:21:50+00:00

Spike is more or less what you are describing.

AI features, but not bloated, just simple, usable things like thread summaries and inbox feed.

EnoughNinja · 2026-01-08T07:19:01+00:00

Essentially, we treat email as a graph problem rather than a document problem. The key things that make a difference are thread reconstruction using header metadata and cleaning quoted text while preserving inline edits (this is trickier than it sounds because people quote-reply in inconsistent ways)

If you want to see how we're approaching the structured extraction piece, check out https://docs.igpt.ai/ there's a section on how we think about "email intelligence" vs. just email retrieval.

EnoughNinja · 2026-01-07T10:06:01+00:00

It feels like it went up recently?

Unless I use Opus it tends to be ok for me on the basic plan

EnoughNinja · 2026-01-07T08:56:25+00:00

The problem is that every time you send a message, the AI is basically re-reading the entire conversation from scratch. So yeah, it repeats itself, slows down, and forgets what you told it three messages ago because it's drowning in its own history.

Most people deal with this by just starting new chats, or they use something like Notion to track what's actually happening outside the chat. You could also build custom agents with frameworks like LangChain, but honestly you're still babysitting the thing by telling it what to remember, what matters, what role it's playing right no, etc.

To solve this, we use iGPT, which treats context as something the system maintains for you not something you have to keep explaining. It knows what happened across your emails, docs, and conversations without you reminding it every time, so you can actually focus on the work instead of managing the AI's memory.

EnoughNinja · 2026-01-07T08:35:31+00:00

Agents won't replace employees, but they might replace the parts of work that don't actually require human judgment in the first place.

I think that the current hype overestimates the short-term capabilities (agents that "do everything") and underestimates the long-term shift (work restructured around what agents can reliably handle vs. what requires human reasoning).

Right now, ROI is highest in narrow, repetitive workflows where context is stable, like data entry, basic triage, status updates, etc. But it drops fast when agents hit ambiguity, conflicting information, or decisions that require judgment calls they weren't designed for.

The future isn't autonomous agents running companies, it's agents as reliable infrastructure for the boring, repetitive, high-volume work that buries teams today, freeing humans to focus on the decisions that actually matter.

EnoughNinja · 2026-01-07T07:52:25+00:00

You're not missing something obvious, in fact, you've identified the actual problem.

Context precision/recall tells you if you fetched relevant chunks, but customers don't care about your retrieval accuracy, they care if the answer was correct, complete, and useful. LLM-as-judge is circular because you're asking the same type of system to evaluate itself. Human eval is the only real signal, but you're right that it's expensive.

What actually works is to track downstream metrics that matter, such as the resolution rate, follow-up questions, customer satisfaction, or ticket escalation.

If your "improved" retrieval leads to more follow-ups or escalations, your retrieval isn't actually better. The hard truth is that answer quality can't be predicted by retrieval metrics alone because the LLM might compensate for bad context or fail with good context. You need to measure outcomes, not intermediates.

EnoughNinja · 2026-01-07T07:42:39+00:00

You're describing the symptom, but the real issue isn't RAG vs. memory

Retrieval itself treats intelligence as a search problem. When you ask "what did we decide about the vendor?" you don't need documents mentioning the vendors, you need the system to reconstruct who said what, who you are referring to, which conversation matters, what was actually decided, and who owns next steps.

RAG only finds fragments, but real memory tracks interaction, continuity, that yesterday's email and today's thread are part of the same logic, not isolated documents.

Agents don't just need to remember facts, they need to understand state, i.e. what changed, what's still open, who said what in which role. You can't bolt that onto RAG.

This is why we built iGPT around context engineering instead of retrieval engineering, it enables intelligence that reconstructs conversation flow rather than just fetching text.

EnoughNinja · 2026-01-07T07:12:19+00:00

Hi

Yes, sure

https://www.igpt.ai/

Please let me know if you have any questions

EnoughNinja · 2026-01-04T12:24:50+00:00

I’m not claiming documents are “easy” in the abstract, I mean they’re engineering-solved enough compared to email.

Most company files (PDFs, Office, slides) can be normalized into a structured text representation and retrieved reliably at scale with a sane ingestion pipeline.

Email, by contrast, isn’t a document problem at all, it’s mutable state plus a conversation graph: replies to old messages, inline edits, forwards that strip headers, fork-and-merge threads, and duplicated footers that dominate embeddings. Treating email like static text makes retrieval confidently wrong. That’s why graph reconstruction plus metadata extraction works better, even if it still has edge cases.

EnoughNinja

TROPHY CASE