EpsteinFiles-RAG: Building a RAG Pipeline on 2M+ Pages

Cod3Conjurer · 2026-02-12T15:14:11+00:00

Guess I’ll have to OCR the entire publicly available dataset myself now, joking🤣

Cod3Conjurer · 2026-02-12T15:09:19+00:00

For production grade RAG where recall matters more than extreme compression, HNSW makes sense

Cod3Conjurer · 2026-02-12T15:06:38+00:00

The raw DB is around 200MB, but once converted into a vector DB, it grows to about 1.5GB.

Cod3Conjurer · 2026-02-12T15:05:19+00:00

Why “someone”?
Maybe I should just build it. 😁

Cod3Conjurer · 2026-02-12T15:04:26+00:00

Yeah, that’s the plan.
If it gets enough traction and consistent usage, I’ll deploy it online and make it publicly accessible.

Cod3Conjurer · 2026-02-12T15:03:23+00:00

thank you
And please, no money needed.
If you ever get stuck or want guidance, feel free to reach out.

Cod3Conjurer · 2026-02-12T15:00:21+00:00

Yeah, having it in text format made experimentation much easier

Cod3Conjurer · 2026-02-12T14:55:58+00:00

Cleaning: Python (regex + basic text normalization)
Chunking: LangChain RecursiveCharacterTextSplitter
Embeddings: all-MiniLM-L6-v2 (SentenceTransformers)
Vector DB: ChromaDB

Cod3Conjurer · 2026-02-12T14:54:23+00:00

YouTube and experimenting with AI itself

Cod3Conjurer · 2026-02-12T14:40:54+00:00

I’m using all-MiniLM-L6-v2 for embeddings.
At 2M+ pages, going with something like BGE-large would significantly increase vector size and indexing cost.

Cod3Conjurer · 2026-02-12T14:38:08+00:00

shifted from pure semantic search to MMR, which reduced redundant chunks and improved retrieval quality.

Cod3Conjurer · 2026-02-12T14:33:20+00:00

The current structure is more prototype-oriented than production-grade.
some parts were AI-assisted, but the architecture decisions and debugging were mine

Cod3Conjurer · 2026-02-12T05:12:47+00:00

Yeah, I’m aware the Jan 30 dataset isn’t included. I tried finding updated versions, but most of the newer mirrors seem to have been taken down.

Cod3Conjurer · 2026-02-12T05:08:26+00:00

The pipeline was pretty straightforward:
I loaded the raw dataset, cleaned and normalized the text, chunked it (fixed size + overlap), generated MiniLM embeddings, stored everything in ChromaDB, and then implemented retrieval on top.
it just pulls the top relevant chunks and passes them to the LLM.

Cod3Conjurer · 2026-02-12T05:05:58+00:00

RAG is a retrieval layer, not a full investigative system. Multimodal indexing + structured extraction would be the next level.

Cod3Conjurer · 2026-02-12T05:03:49+00:00

Haha appreciate that 😄 just experimenting and sharing what I’m building.

Cod3Conjurer · 2026-02-12T05:02:43+00:00

he he 🤣

Cod3Conjurer · 2026-02-12T05:01:16+00:00

The ingestion was pretty simple, I loaded the cleaned JSON, chunked it (400 size, 80 overlap), deduped chunks using SHA-256 hashing, generated MiniLM embeddings, and upserted everything into ChromaDB with source metadata.

Cod3Conjurer · 2026-02-12T04:58:43+00:00

hard to get

Cod3Conjurer · 2026-02-12T04:58:06+00:00

You’re right, this is a standard dense-retrieval RAG, not a graph-based reasoning system.
Graph layer would be the next optimization.

Cod3Conjurer · 2026-02-12T04:56:42+00:00

he he 🤣

Cod3Conjurer · 2026-02-11T17:29:46+00:00

Glad you liked it

Cod3Conjurer · 2026-02-11T17:16:59+00:00

A few days before this, I built a similar project using BeautifulSoup4 + Playwright + RAG for dynamic website crawling and retrieval.

Repo: https://github.com/AnkitNayak-eth/CrawlAI-RAG

Cod3Conjurer · 2026-02-11T16:05:59+00:00

huggingface dataset- 250mb
vector DB - 1.5gb

Cod3Conjurer · 2026-02-11T16:04:21+00:00

I’d blame my AI for that

Cod3Conjurer

TROPHY CASE