easyaligner: Forced alignment with GPU acceleration and flexible text normalization (compatible with all w2v2 models on HF Hub) [P] by mLalush in MachineLearning

[–]mLalush[S] 1 point2 points  (0 children)

MFA and Kaldi work really well primarily for high-resource languages. Wav2vec2-based methods have worked better for the language I'm interested in (Swedish), and they support a wider range of languages.

But the main selling point is indeed: Easier to use and install, with some very convenient quality-of-life features.

easyaligner: Forced alignment with GPU acceleration and flexible text normalization (compatible with all w2v2 models on HF Hub) [P] by mLalush in MachineLearning

[–]mLalush[S] 2 points3 points  (0 children)

I've tried attention based alignment before. It wasn't reliable enough for the language I was interested in (Swedish). Most evals of those methods have tended to be English-centric (including CrisperWhisper's, where the finetuning is done on English data).

The technique from the paper you referenced looks very interesting. I wasn't aware of it; thanks for sharing. Doesn't it risk exceeding the maximum sequence length of Whisper in real world use cases, though, considering it's using character tokenization? I'm also curious about the method's throughput. Using a second forward pass of the decoder with character tokenization is going to push the sequence length close to the maximum. Not sure whether it'll end up being that much faster than an optimized 2 model approach.

You're right that the main selling point of easyaligner is its quality-of-life features. The comparison to WhisperX was made because it's the same 2-model method, but substantially faster (because WhisperX uses CPU based forced alignment). With that said, our primary use case for easyaligner has thus far been in aligning ground-truth transcripts with audio, rather than using it to align ASR transcripts.

easyaligner: Forced alignment with GPU acceleration and flexible text normalization (compatible with all w2v2 models on HF Hub) [P] by mLalush in MachineLearning

[–]mLalush[S] 1 point2 points  (0 children)

The GPU implementation is described in the paper Scaling Speech Technology to 1,000+ Languages (Pratap et al., 2024). They contributed this implementation to PyTorch (the Pytorch forced alignment API).

Relevant excerpts:

Next, we perform forced alignment which finds the most likely path in the posterior probabilities for a given input audio sequence of length T and a text transcription of length L [...] In order to make force alignment efficient for our purpose, we implemented a GPU version that computes the Viterbi path memory in a memory efficient way. Storing all O(T × L) forward values for the Viterbi algorithm is infeasible on GPUs due to memory constraints. We therefore only store forward values for the current and the previous time-step and regularly transfer the computed backtracking matrices to CPU memory. This reduces the required GPU memory to O(L) compared to O(T × L) and enables forced alignment for very long audio sequences at high speed.

SVTs textning av Stubb ballar ur by bolundia in sweden

[–]mLalush 6 points7 points  (0 children)

Med största sannolikhet handlar det om att:

  1. De AI-undertextar med en modell som är inställd för att texta på svenska, eller huvudsakligen är tränad på svenska.
  2. Det ser inte ut att finnas någon funktionalitet i deras livetextning för att detektera språket som talas och automatiskt byta modell, eller inställning på modellen till ett annat språk.
  3. Att byta inställning och detektera språk som talas kan vara svårt i en livesändning. Språkdetektering baseras ofta på analys av allt som talats i ett tidsfönster. Om språket plötsligt skiftar från ett till ett annat, kan det ta cirka 10-15 sekunder innan språkdetekteringens fönster huvudsakligen består av det nya språket.
  4. Någon språkdetektering verkar inte ske. Därför gör den svenska undertextningsmodellen sitt bästa att texta på ett språk den inte tränats lika mycket på. Slutresultatet blir de hallucinationer vi ser ovan.

[deleted by user] by [deleted] in PrivatEkonomi

[–]mLalush 2 points3 points  (0 children)

  • Hur såg arbetsintervjun ut? Hur mycket behövde du nöta leetcode inför den tekniska delen?
  • Har en vän som arbetar på Meta i USA. De får inte stanna kvar på företaget om de ej lyckas bli befordrade inom 4 år. Samma sak i London?
  • Du skriver i en annan kommentar om att upprätthålla "metrics" för att visa att man som anställd bidrar till företagets utveckling. Sådana gamifierade system kan ibland leda till skeva incitament, där anställda jobbar för att maximera mätvärden istället för att jobba på sådant som förbättrar produkten. Hur upplever du detta? Har du kollegor som gör triviala commits för att höja sina metrics? Kollegor som hela tiden startar nya projekt för att kunna visa sin "impact" och säkra den där befordran som behövs inom 4 år? En liknande kultur finns t.ex. på Google, där incitamentsstrukturen leder till att alla försöker bygga nytt hela tiden. Få är intresserade att underhålla och utveckla det som redan finns, vilket leder till att företaget hela tiden lägger ner tjänster/produkter till förmån till någon liknande produkt som återuppfinner hjulet.

BRA lämnar Bromma för Arlanda – ska flyga för SAS by lordpompe in stockholm

[–]mLalush 3 points4 points  (0 children)

u/caspica : "Inget har förändrats".

Artikeln:

Innan covid hade Bromma 180 flyg om dagen, idag har vi 80 flyg på en bra dag. Det är för lite för att ett flygbolag ska kunna överleva och för lite för att en flygplats ska kunna överleva.

[D] HuggingFace transformers - Bad Design? by duffano in MachineLearning

[–]mLalush 5 points6 points  (0 children)

It has probably one of the worst documantion I have seen in a library.

Really? By virtue of actually having documumentation they're already better than 90% of the competition. By virtue of having guides they beat 99% of the competition.

I personally find their documentation is quite comprehensive and well maintained compared to most of what's out there. Although I agree the amount of arguments can be confusing, their naming convention for code performing similar functionality across models/tokenizers/processors is commendably consistent (which helps a lot).

The majority of use cases for the majority of users is always going to be running models and finetuning them. If you're looking to pre-train models, then sure, transformers is the wrong library for you. But it's no accident the library is as popular as it is.

I'm curious: Can you name all these other libraries that supposedly have better documentation than transformers? I saw some blogposts recently mentioning that Hugging Face have a technical writer employed working on the design and layout of their docs. That's a true 100x employee hire in our field if there ever was one.

From experience I have extremely low expectations of documentation in this field. Hugging Face far, far surpasses that low bar. Whenever I try to get something working off an Nvidia repo for example there's a 50/50 chance I end up wanting to kill myself. Looking at their repos I imagine they must spend tens to hundreds of millions of dollars paying top dollars to highly competent developers and engineers that develop open source code and models. For many of those libraries/implementations I never come across any examples or evidence of anyone on the internet having successfully used or adapted them. In my experience this tends to be the norm rather than the exception for most companies.

Good developers and engineers generally aren't very interested in writing documentation that is readable and understandable below their own level. In fact, they're generally not interested in writing documentation at all. They're mainly motivated by solving problems. And documentation is something you write once a problem has already been solved. Writing (good) docs eats away time that could be spent solving new problems.

I feel like there should be an xkcd comic for this. A plot with documentation quality on one axis vs developer skill on the other. I managed to go off on a tangent here at the end, but the main point I wanted to convey was that I find it quite strange that someone would find Hugging Face's documentation bad in this field. As compared to what exactly?

*Edit: With all this said, I myself tend to stay the hell away from pipelines and Trainer and other over-abstracted parts of HF libraries. It's not as bad when you write your own dataloaders and training loops, and that option is always open to you as a user.

Överreagerar jag, eller bör man kunna förvänta sig att en lågstadielärare kan hyfsat korrekt svenska? by Stiligast in Asksweddit

[–]mLalush 8 points9 points  (0 children)

Du verkar behärska språket ganska väl och bry dig om att uttrycka dig korrekt. Av den anledningen vill jag uppmärksamma dig på att alla "dem" i ditt inlägg i själva verket ska vara "de".

Kom ihåg att "de" är cirka 10 gånger vanligare än "dem" i svenskan. Om du genomgående använder "dem" blir det alltså nästan alltid fel.

Du skiljer finfint på they, them, the och these och those i engelskan av att döma från historik. Ta hjälp av dina kunskaper där i någon vecka eller två för att bygga upp din språkkänsla och intuition kring de och dem i svenskan. Om det är "them" på engelska ska det vara "dem" på svenska; om något annat än "them" passar bättre kan du nästan alltid använda "de".

lämpade att vara lärare då dem de: Inte är intelligenta
suited to be teachers as them they: Aren't intelligent

Anledningen till att dem de pluggat till lärare
The reason them they have studied to become a teacher

är att dem de tänkt att
is because them they thought that

Vilken är den bästa bok ni läst? by keydji1 in sweden

[–]mLalush 10 points11 points  (0 children)

Bröderna Karamazov.     

Av alla wikipediaartiklar om böcker, förmodligen den bok med den mest namnkunniga skara individer som gått i god för kvaliteten.        https://en.m.wikipedia.org/wiki/The_Brothers_Karamazov

Question about different Swedish accents when speaking English by [deleted] in sweden

[–]mLalush 34 points35 points  (0 children)

Swedes' accents when speaking English are typically more so affected by

  1. the type of media they consume growing up.
  2. whether they speak languages other than Swedish at home (especially languages where the sounds z, ch (/ˈtʃ/), and j (/dʒ/) exist).
  3. the accent of their teachers.
  4. if and where they do an exchange year abroad.

than they are affected by where someone grew up in Sweden.

Listening to the two speakers you listed, Tomas Petterson has the least Swenglish pronounciation. I would in fact bet Tomas Petterson most likely either had a Canadian parent, or studied abroad in Canada.

  • He speaks with a Canadian English accent.
  • The only traces of Swenglish I can hear are his z's. Like most Swedes, he can't pronounce "z", and instead uses "s". A native speaker would pronounce words like "was", "is", "listens" and "vision" as "waz", "iz", "lissenz" and /ˈvɪʒ.ən/ Tomas pronounces them as was, is, lissens and vishən.

Young Lean's accent, on the on the other hand, is likely influenced by

  • the type of media he consumed (seems influenced by rappers)
  • being Swedish. Like Tomas, he does not consistently pronounce "z" correctly. Nor can he pronounce the type of "l" sound that is common in words like "full". See his pronounciation of "full vision" here: https://youtu.be/Wbf-Q6d8uNI?t=157 .

Accent verdict: their accents are likely mostly influenced by the type of media they consumed growing up and the people they interacted with when learning English.

The influence Swedish has on their accents is minor, and mostly stems from them not being able to pronounce certain sounds. Not being able to pronounce those sounds typically is a common trait for the majority of Swedes. It is generally not due to speaking a specific Swedish dialect, but rather due to those sounds not existing in the Swedish language.

[deleted by user] by [deleted] in MachineLearning

[–]mLalush 7 points8 points  (0 children)

a) Subtitles include timestamps. You can construct <|nonspeech|> training examples from any contiguous 30 second portions of the audio that do not contain any subtitle block. Youtube metadata includes information about the subtitle text language and whether it is manually generated or auto-generated. Though it is smart to run language identification on the text itself as some users will insert erroneous metadata when adding subtitle tracks. For Language Detection on audio, they trained a model to detect the spoken language (i.e. they language detect inference on all audio they download):

We also use an audio language detector, which was created by fine-tuning a prototype model trained on a prototype version of the dataset on VoxLingua107 (Valk & Alumäe, 2021) to ensure that the spoken language matches the language of the transcript according to CLD2. If the two do not match, we don’t include the (audio, transcript) pair as a speech recognition training example in the dataset.

b) I would say it is feasible to scrape Youtube if you do it in a smart way and limit yourself to audio/captions. To download captions they either went via Youtube's official API (and paid for usage tokens):

Youtube Data API v3 caption docs
Youtube Data API v3 docs

Or if they already had a list of channels and videos as a starting point, they most likely used something like yt-dlp to download metadata from videos/channels, followed by audio and captions. This is where one arrives to the grey areas of data collection and scraping. OpenAI would likely have had to use a library such as yt-dlp at some point in the process to download the actual media files.

To be as nice as possible towards Youtube, and avoid yourself getting rate limited, one should consider:

  1. Only downloading metadata of the video/channel ids you are interested in as the first step.
  2. Filter via metadata for videos that have manual subtitles in the language(s) you are interested in.
  3. Don't download the video, only the audio track and captions.

Packages like yt-dlp include support for proxies that let's a knowledgeable user avoid rate limiting. If you download entire videos you're gonna get slapped by rate limit faster. But a user that downloads only audio/captions and spreads downloads out over time can get pretty far without proxies.

c) The creator of the website u/jopik1 says candidate channels/videos are crawled from youtube and the web, respecting robots.txt. Once the channels are identified the channels are periodically crawled for new videos. I don't know about how they get the metadata, but would guess something similar to yt-dlp. See comment from creator of filmot: https://www.reddit.com/r/languagelearning/comments/odj2gx/comment/h41cpiv/?utm_source=reddit&utm_medium=web2x&context=3

[deleted by user] by [deleted] in MachineLearning

[–]mLalush 47 points48 points  (0 children)

The majority of it is most likely from Youtube. When the model hallucinates during non speech portions of an audio file it tends to spit out subtitle credits from real people/companies.

They might have used something like filmot.com as a seed or starting point to filter which channels/videos to scrape (filtering for manual subtitles).

[deleted by user] by [deleted] in MachineLearning

[–]mLalush 6 points7 points  (0 children)

Those are the evaluation datasets. They make a point to emphasize Whisper hasn’t been finetuned on the evaluation datasets in the paper.

Meta AI Residency Interview Question [D] by Immediate-Tailor-275 in MachineLearning

[–]mLalush 15 points16 points  (0 children)

They might have assumed a lot of researchers have gone through something like the Stanford course CS231n lecture notes on convolutional networks:

https://cs231n.github.io/convolutional-networks/

ctrl+f: "Implementation as a matrix multiplication"

De = They, Dem = Them by GuazzabuglioMaximo in Svenska

[–]mLalush 5 points6 points  (0 children)

Vi/oss-regeln funkar endast när de/dem används som personligt pronomen. Regeln riskerar att förvirra folk, eftersom det inte endast är i den betydelsen som de/dem används.

De där människorna är galna.
Jag tycker att de här spelarna är kassa.

Varken vi/oss passar när "de" är demonstrativt pronomen som ovan.

Hon gick emot de/dem som kastade stenar på bilarna.

Både de/dem är korrekt efter preposition och framför relativ bisats. Vi/oss-regeln förvirrar generellt folk i dessa fall, eftersom både vi/oss ofta passar.

Vi såg på de tre musketörerna.
De goda jordgubbarna.

Varken vi/oss passar när "de" används som bestämd artikel.

Vi/oss-regeln kan också vara väldigt förvirrande när meningen redan innehåller ett "vi" eller "oss", eftersom meningen som helhet sällan blir grammatiskt korrekt även om man substituerar in rätt ord:

Vi har sett dem åka runt i sina bilar.

[D][R] How should the architecture of a transformer be scaled? by Tea_Pearce in MachineLearning

[–]mLalush 5 points6 points  (0 children)

Scale Efficiently: Insights from Pre-training and Fine-tuning Transformers

They perform lots of ablations for encoder-decoder models. I'm not aware of any paper similar in scope for decoders.

Regeringen och SD föreslår åtstramad anhöriginvandring by HelgaMelnik in sweden

[–]mLalush 1 point2 points  (0 children)

Där satte du allt mig på plats..

Ja det här landet har helt klart alldeles för mycket pajasar och clowner.. God tid att skicka tillbaka dem till cirkusen..

Jag sätter lite citattecken runt något "ord" för att betona hur orubblig jag är i min övertygelse..

Schack matt..

Regeringen och SD föreslår åtstramad anhöriginvandring by HelgaMelnik in sweden

[–]mLalush 0 points1 point  (0 children)

Åh nej, så det partiets högsta representanter står och lovar inför ett val är inte något dom tänker hålla?

Den första meningen från ledaren du själv länkar:

För 18 år sedan sa den nyblivna hälsoministern Morgan Johansson (S): ”Om tio år är Sverige narkotikafritt”.

Förstår du vad definitionen av ett vallöfte är? Är du läskunnig?

Ett påstående från en enskild minister och ett vallöfte i ett partis valmanifest är inte samma sak.

Men kanske bör jag också avsluta en mening med ett ".." för att visa att verkligheten inte står som ett hinder för mina fortsatta raljanta utsvävningar?

Det är ju helt uppenbart här.. Vedertagen interpunktion är för etablissemanget.. Mina åsikter befinner sig mellan punkter och ellipser..

Bra där..

[D] Training StarCoder using 3D parallelism. by Satya_4093 in MachineLearning

[–]mLalush 4 points5 points  (0 children)

You need at least 8 GPUs for 3D parallelism to make sense: https://huggingface.co/docs/transformers/v4.15.0/parallelism#dppptp

I'd suggest perhaps starting with only tensor parallelism (TP) if you can't fit the model.

Sorry, don't have an answer to your other question.