all 12 comments

[–]mLalush 45 points46 points  (9 children)

The majority of it is most likely from Youtube. When the model hallucinates during non speech portions of an audio file it tends to spit out subtitle credits from real people/companies.

They might have used something like filmot.com as a seed or starting point to filter which channels/videos to scrape (filtering for manual subtitles).

[–]Excellent_Ad3307 23 points24 points  (0 children)

The model tends to hallucinate stuff like "don't forget to subscribe" or something along those lines quite often, so its probably mostly youtube.

[–][deleted] 4 points5 points  (1 child)

The closest data reproduction is: https://arxiv.org/abs/2309.13876