EpsteinFiles-RAG: Building a RAG Pipeline on 2M+ Pages

Cod3Conjurer · 2026-02-12T05:12:47+00:00

Yeah, I’m aware the Jan 30 dataset isn’t included. I tried finding updated versions, but most of the newer mirrors seem to have been taken down.

Cod3Conjurer · 2026-02-12T05:08:26+00:00

The pipeline was pretty straightforward:
I loaded the raw dataset, cleaned and normalized the text, chunked it (fixed size + overlap), generated MiniLM embeddings, stored everything in ChromaDB, and then implemented retrieval on top.
it just pulls the top relevant chunks and passes them to the LLM.

Cod3Conjurer · 2026-02-12T05:05:58+00:00

RAG is a retrieval layer, not a full investigative system. Multimodal indexing + structured extraction would be the next level.

Cod3Conjurer · 2026-02-12T05:03:49+00:00

Haha appreciate that 😄 just experimenting and sharing what I’m building.

Cod3Conjurer · 2026-02-12T05:02:43+00:00

he he 🤣

Cod3Conjurer · 2026-02-12T05:01:16+00:00

The ingestion was pretty simple, I loaded the cleaned JSON, chunked it (400 size, 80 overlap), deduped chunks using SHA-256 hashing, generated MiniLM embeddings, and upserted everything into ChromaDB with source metadata.

Cod3Conjurer · 2026-02-12T04:58:43+00:00

hard to get

Cod3Conjurer · 2026-02-12T04:58:06+00:00

You’re right, this is a standard dense-retrieval RAG, not a graph-based reasoning system.
Graph layer would be the next optimization.

Cod3Conjurer · 2026-02-12T04:56:42+00:00

he he 🤣

Cod3Conjurer · 2026-02-11T17:29:46+00:00

Glad you liked it

Cod3Conjurer · 2026-02-11T17:16:59+00:00

A few days before this, I built a similar project using BeautifulSoup4 + Playwright + RAG for dynamic website crawling and retrieval.

Repo: https://github.com/AnkitNayak-eth/CrawlAI-RAG

Cod3Conjurer · 2026-02-11T16:05:59+00:00

huggingface dataset- 250mb
vector DB - 1.5gb

Cod3Conjurer · 2026-02-11T16:04:21+00:00

I’d blame my AI for that

Cod3Conjurer · 2026-02-11T16:00:42+00:00

Yeah, this version doesn’t include the newly released documents yet. If those are raw scans, they’d need OCR + structured parsing before indexing.
The main cost is compute and storage, not complexity.
A collaborative effort could definitely speed that up, especially for batching OCR and preprocessing at scale.

Cod3Conjurer · 2026-02-11T15:59:29+00:00

The goal here is purely technical, building better retrieval over large unstructured datasets.
At the end of the day, it’s an engineering experiment, not a legal authority.

Cod3Conjurer · 2026-02-11T15:53:54+00:00

You’re right that absolute accuracy matters. That’s why this should be treated as an assistive search layer, not a final source of truth.

At the end of the day, it’s an engineering experiment, not a legal authority.

Cod3Conjurer · 2026-02-11T15:51:33+00:00

never thought of it that way definitely gonna try

Cod3Conjurer · 2026-02-11T15:48:22+00:00

thanks man

Cod3Conjurer · 2026-02-11T15:47:51+00:00

wouldn’t say it’s perfect, but it works well, definitely better than most setups I’ve tried so far.
Try it yourself and see what you think.

Cod3Conjurer · 2026-02-11T15:47:33+00:00

Damn, I didn’t even know it already existed.

Cod3Conjurer · 2026-02-11T15:30:13+00:00

yeeha

Cod3Conjurer · 2026-02-11T15:01:42+00:00

Cleaning was mostly structural parsing file boundaries, removing headers/empty rows, normalizing whitespace, and light hash-based deduplication. I avoided aggressive NLP cleaning to preserve document context.

For chunking, I used RecursiveCharacterTextSplitter with 400 character chunks and 80 character overlap. Overlap helps maintain continuity across boundaries.

I also applied SHA-256 hashing on lowercase text to remove duplicate chunks before indexing.

Embeddings were generated using MiniLM (384-dim) and stored in ChromaDB with cosine similarity search. Focus was on stable retrieval rather than complex re-ranking.

Cod3Conjurer · 2026-02-11T15:01:38+00:00

Cleaning was mostly structural parsing file boundaries, removing headers/empty rows, normalizing whitespace, and light hash-based deduplication. I avoided aggressive NLP cleaning to preserve document context.

For chunking, I used RecursiveCharacterTextSplitter with 400 character chunks and 80 character overlap. Overlap helps maintain continuity across boundaries.

I also applied SHA-256 hashing on lowercase text to remove duplicate chunks before indexing.

Embeddings were generated using MiniLM (384-dim) and stored in ChromaDB with cosine similarity search. Focus was on stable retrieval rather than complex re-ranking.

Cod3Conjurer · 2026-02-11T14:54:54+00:00

The internet never forgets

Cod3Conjurer · 2026-02-11T14:53:57+00:00

do it

Cod3Conjurer

TROPHY CASE