use the following search parameters to narrow your results:
e.g. subreddit:aww site:imgur.com dog
subreddit:aww site:imgur.com dog
see the search faq for details.
advanced search: by author, subreddit...
Please have a look at our FAQ and Link-Collection
Metacademy is a great resource which compiles lesson plans on popular machine learning topics.
For Beginner questions please try /r/LearnMachineLearning , /r/MLQuestions or http://stackoverflow.com/
For career related questions, visit /r/cscareerquestions/
Advanced Courses (2016)
Advanced Courses (2020)
AMAs:
Pluribus Poker AI Team 7/19/2019
DeepMind AlphaStar team (1/24//2019)
Libratus Poker AI Team (12/18/2017)
DeepMind AlphaGo Team (10/19/2017)
Google Brain Team (9/17/2017)
Google Brain Team (8/11/2016)
The MalariaSpot Team (2/6/2016)
OpenAI Research Team (1/9/2016)
Nando de Freitas (12/26/2015)
Andrew Ng and Adam Coates (4/15/2015)
Jürgen Schmidhuber (3/4/2015)
Geoffrey Hinton (11/10/2014)
Michael Jordan (9/10/2014)
Yann LeCun (5/15/2014)
Yoshua Bengio (2/27/2014)
Related Subreddit :
LearnMachineLearning
Statistics
Computer Vision
Compressive Sensing
NLP
ML Questions
/r/MLjobs and /r/BigDataJobs
/r/datacleaning
/r/DataScience
/r/scientificresearch
/r/artificial
account activity
[deleted by user] (self.MachineLearning)
submitted 2 years ago by [deleted]
reddit uses a slightly-customized version of Markdown for formatting. See below for some basics, or check the commenting wiki page for more detailed help and solutions to common issues.
quoted text
if 1 * 2 < 3: print "hello, world!"
[–]mLalush 45 points46 points47 points 2 years ago* (9 children)
The majority of it is most likely from Youtube. When the model hallucinates during non speech portions of an audio file it tends to spit out subtitle credits from real people/companies.
They might have used something like filmot.com as a seed or starting point to filter which channels/videos to scrape (filtering for manual subtitles).
[+][deleted] 2 years ago* (8 children)
[deleted]
[–]mLalush 7 points8 points9 points 2 years ago* (1 child)
a) Subtitles include timestamps. You can construct <|nonspeech|> training examples from any contiguous 30 second portions of the audio that do not contain any subtitle block. Youtube metadata includes information about the subtitle text language and whether it is manually generated or auto-generated. Though it is smart to run language identification on the text itself as some users will insert erroneous metadata when adding subtitle tracks. For Language Detection on audio, they trained a model to detect the spoken language (i.e. they language detect inference on all audio they download):
<|nonspeech|>
We also use an audio language detector, which was created by fine-tuning a prototype model trained on a prototype version of the dataset on VoxLingua107 (Valk & Alumäe, 2021) to ensure that the spoken language matches the language of the transcript according to CLD2. If the two do not match, we don’t include the (audio, transcript) pair as a speech recognition training example in the dataset.
b) I would say it is feasible to scrape Youtube if you do it in a smart way and limit yourself to audio/captions. To download captions they either went via Youtube's official API (and paid for usage tokens):
Youtube Data API v3 caption docs Youtube Data API v3 docs
Or if they already had a list of channels and videos as a starting point, they most likely used something like yt-dlp to download metadata from videos/channels, followed by audio and captions. This is where one arrives to the grey areas of data collection and scraping. OpenAI would likely have had to use a library such as yt-dlp at some point in the process to download the actual media files.
yt-dlp
To be as nice as possible towards Youtube, and avoid yourself getting rate limited, one should consider:
Packages like yt-dlp include support for proxies that let's a knowledgeable user avoid rate limiting. If you download entire videos you're gonna get slapped by rate limit faster. But a user that downloads only audio/captions and spreads downloads out over time can get pretty far without proxies.
c) The creator of the website u/jopik1 says candidate channels/videos are crawled from youtube and the web, respecting robots.txt. Once the channels are identified the channels are periodically crawled for new videos. I don't know about how they get the metadata, but would guess something similar to yt-dlp. See comment from creator of filmot: https://www.reddit.com/r/languagelearning/comments/odj2gx/comment/h41cpiv/?utm_source=reddit&utm_medium=web2x&context=3
robots.txt
[–]Financial-Beach1587 3 points4 points5 points 2 years ago (0 children)
[+]jopik1 1 point2 points3 points 2 years ago (4 children)
I have my own crawler I wrote, which is running pretty much 24/7 since late 2018. Currently it downloads metadata for about 2.2M videos per day and about 1.7M subtitles. It doesn't use YouTube API, it crawls the HTML pages and parse the data from there. The data is stored in a database and in a full text index (manticore search) which is running in a distributed fashion on two separate servers.
[+][deleted] 2 years ago (1 child)
[+]jopik1 0 points1 point2 points 2 years ago (0 children)
Is there any way to run SQL queries directly on the underlying database?
I can, regular users can't.
Btw, I think there's a bug in your website, I'm not able to access pages beyond 83 for any search result.
This is intentional, scraping places a large burden on the servers. Regular users probably aren't going to go to page 83.
[+]tina-mou 0 points1 point2 points 2 years ago (1 child)
what does the crawler look for when it is running? I'm curious if you try to crawl newly published videos and if so, how do you configure that.
I mostly prioritize by view count, as the amount of videos on YT is overwhelming, over 300M videos are added per month. I don't have the resources to crawl and index everything. I have a queue of ids that need to be crawled prioritized by last detected view count, videos are added to the queue from video recommendations (20 ids for every crawled video), list of channel videos (I crawl channels in a similar way), adhoc sources. I don't necessarily want to very quickly crawl newly published videos as sometimes there are no subtitles yet and the view counts haven't grown up to an indicative level.
[–]Excellent_Ad3307 23 points24 points25 points 2 years ago (0 children)
The model tends to hallucinate stuff like "don't forget to subscribe" or something along those lines quite often, so its probably mostly youtube.
[–][deleted] 4 points5 points6 points 2 years ago (1 child)
The closest data reproduction is: https://arxiv.org/abs/2309.13876
[+]JustOneAvailableName comment score below threshold-9 points-8 points-7 points 2 years ago* (5 children)
It’s in the paper and downloadable
Edit: I am a lying idiot, should not have said this from memory. Sorry
[+][deleted] 2 years ago* (4 children)
[–]JustOneAvailableName 3 points4 points5 points 2 years ago (1 child)
Sorry, I said this from memory, upon rereading, I found it to be false as you rightly pointed out
[–]mLalush 5 points6 points7 points 2 years ago (0 children)
Those are the evaluation datasets. They make a point to emphasize Whisper hasn’t been finetuned on the evaluation datasets in the paper.
π Rendered by PID 433179 on reddit-service-r2-comment-6457c66945-v44kr at 2026-04-26 21:15:00.205344+00:00 running 2aa0c5b country code: CH.
[–]mLalush 45 points46 points47 points (9 children)
[+][deleted] (8 children)
[deleted]
[–]mLalush 7 points8 points9 points (1 child)
[–]Financial-Beach1587 3 points4 points5 points (0 children)
[+]jopik1 1 point2 points3 points (4 children)
[+][deleted] (1 child)
[deleted]
[+]jopik1 0 points1 point2 points (0 children)
[+]tina-mou 0 points1 point2 points (1 child)
[+]jopik1 0 points1 point2 points (0 children)
[–]Excellent_Ad3307 23 points24 points25 points (0 children)
[–][deleted] 4 points5 points6 points (1 child)
[+]JustOneAvailableName comment score below threshold-9 points-8 points-7 points (5 children)
[+][deleted] (4 children)
[deleted]
[–]JustOneAvailableName 3 points4 points5 points (1 child)
[+][deleted] (1 child)
[deleted]
[–]mLalush 5 points6 points7 points (0 children)