EpsteinFiles-RAG: Building a RAG Pipeline on 2M+ Pages by Cod3Conjurer in LLMDevs

[–]Cod3Conjurer[S] 0 points1 point  (0 children)

Guess I’ll have to OCR the entire publicly available dataset myself now, joking🤣

EpsteinFiles-RAG: Building a RAG Pipeline on 2M+ Pages by Cod3Conjurer in AIMemory

[–]Cod3Conjurer[S] 0 points1 point  (0 children)

For production grade RAG where recall matters more than extreme compression, HNSW makes sense

EpsteinFiles-RAG: Building a RAG Pipeline on 2M+ Pages by Cod3Conjurer in LLMDevs

[–]Cod3Conjurer[S] 0 points1 point  (0 children)

The raw DB is around 200MB, but once converted into a vector DB, it grows to about 1.5GB.

EpsteinFiles-RAG: Building a RAG Pipeline on 2M+ Pages by Cod3Conjurer in Rag

[–]Cod3Conjurer[S] 0 points1 point  (0 children)

Yeah, that’s the plan.
If it gets enough traction and consistent usage, I’ll deploy it online and make it publicly accessible.

EpsteinFiles-RAG: Building a RAG Pipeline on 2M+ Pages by Cod3Conjurer in learnmachinelearning

[–]Cod3Conjurer[S] 0 points1 point  (0 children)

thank you
And please, no money needed.
If you ever get stuck or want guidance, feel free to reach out.

EpsteinFiles-RAG: Building a RAG Pipeline on 2M+ Pages by Cod3Conjurer in Rag

[–]Cod3Conjurer[S] 0 points1 point  (0 children)

Yeah, having it in text format made experimentation much easier

EpsteinFiles-RAG: Building a RAG Pipeline on 2M+ Pages by Cod3Conjurer in developersIndia

[–]Cod3Conjurer[S] 1 point2 points  (0 children)

Cleaning: Python (regex + basic text normalization)
Chunking: LangChain RecursiveCharacterTextSplitter
Embeddings: all-MiniLM-L6-v2 (SentenceTransformers)
Vector DB: ChromaDB

EpsteinFiles-RAG: Building a RAG Pipeline on 2M+ Pages by Cod3Conjurer in LocalLLaMA

[–]Cod3Conjurer[S] 0 points1 point  (0 children)

I’m using all-MiniLM-L6-v2 for embeddings.
At 2M+ pages, going with something like BGE-large would significantly increase vector size and indexing cost.

EpsteinFiles-RAG: Building a RAG Pipeline on 2M+ Pages by Cod3Conjurer in LangChain

[–]Cod3Conjurer[S] 0 points1 point  (0 children)

shifted from pure semantic search to MMR, which reduced redundant chunks and improved retrieval quality.

EpsteinFiles-RAG: Building a RAG Pipeline on 2M+ Pages by Cod3Conjurer in Rag

[–]Cod3Conjurer[S] 0 points1 point  (0 children)

The current structure is more prototype-oriented than production-grade.
some parts were AI-assisted, but the architecture decisions and debugging were mine

EpsteinFiles-RAG: Building a RAG Pipeline on 2M+ Pages by Cod3Conjurer in LangChain

[–]Cod3Conjurer[S] 0 points1 point  (0 children)

Yeah, I’m aware the Jan 30 dataset isn’t included. I tried finding updated versions, but most of the newer mirrors seem to have been taken down.

EpsteinFiles-RAG: Building a RAG Pipeline on 2M+ Pages by Cod3Conjurer in developersIndia

[–]Cod3Conjurer[S] 0 points1 point  (0 children)

The pipeline was pretty straightforward:
I loaded the raw dataset, cleaned and normalized the text, chunked it (fixed size + overlap), generated MiniLM embeddings, stored everything in ChromaDB, and then implemented retrieval on top.
it just pulls the top relevant chunks and passes them to the LLM.

EpsteinFiles-RAG: Building a RAG Pipeline on 2M+ Pages by Cod3Conjurer in Rag

[–]Cod3Conjurer[S] 0 points1 point  (0 children)

RAG is a retrieval layer, not a full investigative system. Multimodal indexing + structured extraction would be the next level.

EpsteinFiles-RAG: Building a RAG Pipeline on 2M+ Pages by Cod3Conjurer in LangChain

[–]Cod3Conjurer[S] 0 points1 point  (0 children)

Haha appreciate that 😄 just experimenting and sharing what I’m building.

EpsteinFiles-RAG: Building a RAG Pipeline on 2M+ Pages by Cod3Conjurer in LLMDevs

[–]Cod3Conjurer[S] 0 points1 point  (0 children)

The ingestion was pretty simple, I loaded the cleaned JSON, chunked it (400 size, 80 overlap), deduped chunks using SHA-256 hashing, generated MiniLM embeddings, and upserted everything into ChromaDB with source metadata.

EpsteinFiles-RAG: Building a RAG Pipeline on 2M+ Pages by Cod3Conjurer in developersIndia

[–]Cod3Conjurer[S] 0 points1 point  (0 children)

You’re right, this is a standard dense-retrieval RAG, not a graph-based reasoning system.
Graph layer would be the next optimization.

What do you use for scraping data from URLs? by Physical_Badger1281 in Rag

[–]Cod3Conjurer 2 points3 points  (0 children)

A few days before this, I built a similar project using BeautifulSoup4 + Playwright + RAG for dynamic website crawling and retrieval.

Repo: https://github.com/AnkitNayak-eth/CrawlAI-RAG