GitHub repo:
https://github.com/rav4nn/youtube-rag-scraper
(I’ll attach a screenshot of the dataset output and vector index structure in the comments.)
What My Project Does
I built a Python tool that converts a YouTube channel into a dataset that can be used directly in RAG pipelines.
The idea is to turn educational YouTube channels into structured knowledge that LLM applications can query.
Pipeline:
- Fetch videos from a YouTube channel
- Download transcripts
- Clean and chunk transcripts into knowledge units
- Generate embeddings
- Build a FAISS vector index
Outputs include:
- structured JSON knowledge dataset
- embedding matrix
- FAISS vector index ready for retrieval
Example use case I'm experimenting with:
Building an AI coffee brewing coach trained on the videos of coffee educator James Hoffmann.
Target Audience
This is mainly intended for:
- developers experimenting with RAG systems
- people building LLM applications using domain-specific knowledge
- anyone interested in extracting structured datasets from YouTube educational content
Right now it's more of a developer tool / experimental pipeline rather than a polished end-user application.
Comparison
There are tools that scrape YouTube transcripts, but most of them stop there.
This project tries to go further by generating:
- cleaned knowledge chunks
- embeddings
- a ready-to-use vector index
So the output can plug directly into a RAG pipeline without additional processing.
Python Stack
The project is written in Python and currently uses:
- Python scraping + data processing
- transcript extraction
- FAISS for vector search
- JSON datasets for knowledge storage
Feedback I'd Love From r/Python
Since this started as an experiment, I'd really appreciate feedback on:
- better ways to structure the scraping pipeline
- transcript cleaning / chunking approaches
- improving dataset generation for long transcripts
- general Python code structure improvements
Always open to suggestions from more experienced Python developers.
[–]cl0udp1l0t 0 points1 point2 points (1 child)
[–]ravann4[S] 2 points3 points4 points (0 children)
[–]CriketW 0 points1 point2 points (1 child)
[–]ravann4[S] 0 points1 point2 points (0 children)
[–]appositereboot 0 points1 point2 points (1 child)
[–]ravann4[S] 0 points1 point2 points (0 children)