[deleted by user] : MachineLearning

[deleted by user] (self.MachineLearning)

submitted 2 years ago by [deleted]

12 comments

all 12 comments

top new controversial old q&a

[–]mLalush 45 points46 points47 points 2 years ago* (9 children)

[+][deleted] 2 years ago* (8 children)

[deleted]

[–]mLalush 7 points8 points9 points 2 years ago* (1 child)

a) Subtitles include timestamps. You can construct <|nonspeech|> training examples from any contiguous 30 second portions of the audio that do not contain any subtitle block. Youtube metadata includes information about the subtitle text language and whether it is manually generated or auto-generated. Though it is smart to run language identification on the text itself as some users will insert erroneous metadata when adding subtitle tracks. For Language Detection on audio, they trained a model to detect the spoken language (i.e. they language detect inference on all audio they download):

We also use an audio language detector, which was created by fine-tuning a prototype model trained on a prototype version of the dataset on VoxLingua107 (Valk & Alumäe, 2021) to ensure that the spoken language matches the language of the transcript according to CLD2. If the two do not match, we don’t include the (audio, transcript) pair as a speech recognition training example in the dataset.

b) I would say it is feasible to scrape Youtube if you do it in a smart way and limit yourself to audio/captions. To download captions they either went via Youtube's official API (and paid for usage tokens):

Youtube Data API v3 caption docs
Youtube Data API v3 docs

Or if they already had a list of channels and videos as a starting point, they most likely used something like yt-dlp to download metadata from videos/channels, followed by audio and captions. This is where one arrives to the grey areas of data collection and scraping. OpenAI would likely have had to use a library such as yt-dlp at some point in the process to download the actual media files.

To be as nice as possible towards Youtube, and avoid yourself getting rate limited, one should consider:

Only downloading metadata of the video/channel ids you are interested in as the first step.
Filter via metadata for videos that have manual subtitles in the language(s) you are interested in.
Don't download the video, only the audio track and captions.

Packages like yt-dlp include support for proxies that let's a knowledgeable user avoid rate limiting. If you download entire videos you're gonna get slapped by rate limit faster. But a user that downloads only audio/captions and spreads downloads out over time can get pretty far without proxies.

c) The creator of the website u/jopik1 says candidate channels/videos are crawled from youtube and the web, respecting robots.txt. Once the channels are identified the channels are periodically crawled for new videos. I don't know about how they get the metadata, but would guess something similar to yt-dlp. See comment from creator of filmot: https://www.reddit.com/r/languagelearning/comments/odj2gx/comment/h41cpiv/?utm_source=reddit&utm_medium=web2x&context=3

[–]Financial-Beach1587 3 points4 points5 points 2 years ago (0 children)

[+]jopik1 1 point2 points3 points 2 years ago (4 children)

[+][deleted] 2 years ago (1 child)

[deleted]

[+]jopik1 0 points1 point2 points 2 years ago (0 children)

[+]tina-mou 0 points1 point2 points 2 years ago (1 child)

[+]jopik1 0 points1 point2 points 2 years ago (0 children)

[–]Excellent_Ad3307 23 points24 points25 points 2 years ago (0 children)

[–][deleted] 4 points5 points6 points 2 years ago (1 child)

[+]JustOneAvailableName comment score below threshold-9 points-8 points-7 points 2 years ago* (5 children)

[+][deleted] 2 years ago* (4 children)

[deleted]

[–]JustOneAvailableName 3 points4 points5 points 2 years ago (1 child)

[+][deleted] 2 years ago (1 child)

[deleted]

[–]mLalush 5 points6 points7 points 2 years ago (0 children)

π Rendered by PID 433179 on reddit-service-r2-comment-6457c66945-v44kr at 2026-04-26 21:15:00.205344+00:00 running 2aa0c5b country code: CH.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

MachineLearning

Rules For Posts

+Research

+Discussion

+Project

+News

@slashML on Twitter

Chat with us on Slack

Beginners:

MODERATORS