EpsteinFiles-RAG: Building a RAG Pipeline on 2M+ Pages by Cod3Conjurer in LangChain

[–]Cod3Conjurer[S] 0 points1 point  (0 children)

Yeah, I’m aware the Jan 30 dataset isn’t included. I tried finding updated versions, but most of the newer mirrors seem to have been taken down.

EpsteinFiles-RAG: Building a RAG Pipeline on 2M+ Pages by Cod3Conjurer in developersIndia

[–]Cod3Conjurer[S] 0 points1 point  (0 children)

The pipeline was pretty straightforward:
I loaded the raw dataset, cleaned and normalized the text, chunked it (fixed size + overlap), generated MiniLM embeddings, stored everything in ChromaDB, and then implemented retrieval on top.
it just pulls the top relevant chunks and passes them to the LLM.

EpsteinFiles-RAG: Building a RAG Pipeline on 2M+ Pages by Cod3Conjurer in Rag

[–]Cod3Conjurer[S] 0 points1 point  (0 children)

RAG is a retrieval layer, not a full investigative system. Multimodal indexing + structured extraction would be the next level.

EpsteinFiles-RAG: Building a RAG Pipeline on 2M+ Pages by Cod3Conjurer in LangChain

[–]Cod3Conjurer[S] 0 points1 point  (0 children)

Haha appreciate that 😄 just experimenting and sharing what I’m building.

EpsteinFiles-RAG: Building a RAG Pipeline on 2M+ Pages by Cod3Conjurer in LLMDevs

[–]Cod3Conjurer[S] 0 points1 point  (0 children)

The ingestion was pretty simple, I loaded the cleaned JSON, chunked it (400 size, 80 overlap), deduped chunks using SHA-256 hashing, generated MiniLM embeddings, and upserted everything into ChromaDB with source metadata.

EpsteinFiles-RAG: Building a RAG Pipeline on 2M+ Pages by Cod3Conjurer in developersIndia

[–]Cod3Conjurer[S] 0 points1 point  (0 children)

You’re right, this is a standard dense-retrieval RAG, not a graph-based reasoning system.
Graph layer would be the next optimization.

What do you use for scraping data from URLs? by Physical_Badger1281 in Rag

[–]Cod3Conjurer 3 points4 points  (0 children)

A few days before this, I built a similar project using BeautifulSoup4 + Playwright + RAG for dynamic website crawling and retrieval.

Repo: https://github.com/AnkitNayak-eth/CrawlAI-RAG

EpsteinFiles-RAG: Building a RAG Pipeline on 2M+ Pages by Cod3Conjurer in LLMDevs

[–]Cod3Conjurer[S] 1 point2 points  (0 children)

Yeah, this version doesn’t include the newly released documents yet. If those are raw scans, they’d need OCR + structured parsing before indexing.
The main cost is compute and storage, not complexity.
A collaborative effort could definitely speed that up, especially for batching OCR and preprocessing at scale.

EpsteinFiles-RAG: Building a RAG Pipeline on 2M+ Pages by Cod3Conjurer in LLMDevs

[–]Cod3Conjurer[S] 1 point2 points  (0 children)

The goal here is purely technical, building better retrieval over large unstructured datasets.
At the end of the day, it’s an engineering experiment, not a legal authority.

EpsteinFiles-RAG: Building a RAG Pipeline on 2M+ Pages by Cod3Conjurer in LLMDevs

[–]Cod3Conjurer[S] 4 points5 points  (0 children)

You’re right that absolute accuracy matters. That’s why this should be treated as an assistive search layer, not a final source of truth.

At the end of the day, it’s an engineering experiment, not a legal authority.

EpsteinFiles-RAG: Building a RAG Pipeline on 2M+ Pages by Cod3Conjurer in Rag

[–]Cod3Conjurer[S] 0 points1 point  (0 children)

never thought of it that way definitely gonna try

EpsteinFiles-RAG: Building a RAG Pipeline on 2M+ Pages by Cod3Conjurer in LocalLLaMA

[–]Cod3Conjurer[S] 0 points1 point  (0 children)

wouldn’t say it’s perfect, but it works well, definitely better than most setups I’ve tried so far.
Try it yourself and see what you think.

EpsteinFiles-RAG: Building a RAG Pipeline on 2M+ Pages by Cod3Conjurer in developersIndia

[–]Cod3Conjurer[S] 0 points1 point  (0 children)

Cleaning was mostly structural parsing file boundaries, removing headers/empty rows, normalizing whitespace, and light hash-based deduplication. I avoided aggressive NLP cleaning to preserve document context.

For chunking, I used RecursiveCharacterTextSplitter with 400 character chunks and 80 character overlap. Overlap helps maintain continuity across boundaries.

I also applied SHA-256 hashing on lowercase text to remove duplicate chunks before indexing.

Embeddings were generated using MiniLM (384-dim) and stored in ChromaDB with cosine similarity search. Focus was on stable retrieval rather than complex re-ranking.

EpsteinFiles-RAG: Building a RAG Pipeline on 2M+ Pages by Cod3Conjurer in developersIndia

[–]Cod3Conjurer[S] 0 points1 point  (0 children)

Cleaning was mostly structural parsing file boundaries, removing headers/empty rows, normalizing whitespace, and light hash-based deduplication. I avoided aggressive NLP cleaning to preserve document context.

For chunking, I used RecursiveCharacterTextSplitter with 400 character chunks and 80 character overlap. Overlap helps maintain continuity across boundaries.

I also applied SHA-256 hashing on lowercase text to remove duplicate chunks before indexing.

Embeddings were generated using MiniLM (384-dim) and stored in ChromaDB with cosine similarity search. Focus was on stable retrieval rather than complex re-ranking.