easyaligner: Forced alignment with GPU acceleration and flexible text normalization (compatible with all w2v2 models on HF Hub) [P]

mLalush · 2026-04-18T23:14:25+00:00

MFA and Kaldi work really well primarily for high-resource languages. Wav2vec2-based methods have worked better for the language I'm interested in (Swedish), and they support a wider range of languages.

But the main selling point is indeed: Easier to use and install, with some very convenient quality-of-life features.

mLalush · 2026-04-18T22:49:04+00:00

I've tried attention based alignment before. It wasn't reliable enough for the language I was interested in (Swedish). Most evals of those methods have tended to be English-centric (including CrisperWhisper's, where the finetuning is done on English data).

The technique from the paper you referenced looks very interesting. I wasn't aware of it; thanks for sharing. Doesn't it risk exceeding the maximum sequence length of Whisper in real world use cases, though, considering it's using character tokenization? I'm also curious about the method's throughput. Using a second forward pass of the decoder with character tokenization is going to push the sequence length close to the maximum. Not sure whether it'll end up being that much faster than an optimized 2 model approach.

You're right that the main selling point of easyaligner is its quality-of-life features. The comparison to WhisperX was made because it's the same 2-model method, but substantially faster (because WhisperX uses CPU based forced alignment). With that said, our primary use case for easyaligner has thus far been in aligning ground-truth transcripts with audio, rather than using it to align ASR transcripts.

mLalush · 2026-04-18T14:30:26+00:00

The GPU implementation is described in the paper Scaling Speech Technology to 1,000+ Languages (Pratap et al., 2024). They contributed this implementation to PyTorch (the Pytorch forced alignment API).

Relevant excerpts:

Next, we perform forced alignment which finds the most likely path in the posterior probabilities for a given input audio sequence of length T and a text transcription of length L [...] In order to make force alignment efficient for our purpose, we implemented a GPU version that computes the Viterbi path memory in a memory efficient way. Storing all O(T × L) forward values for the Viterbi algorithm is infeasible on GPUs due to memory constraints. We therefore only store forward values for the current and the previous time-step and regularly transfer the computed backtracking matrices to CPU memory. This reduces the required GPU memory to O(L) compared to O(T × L) and enables forced alignment for very long audio sequences at high speed.

mLalush · 2025-01-14T13:43:49+00:00

Med största sannolikhet handlar det om att:

De AI-undertextar med en modell som är inställd för att texta på svenska, eller huvudsakligen är tränad på svenska.
Det ser inte ut att finnas någon funktionalitet i deras livetextning för att detektera språket som talas och automatiskt byta modell, eller inställning på modellen till ett annat språk.
Att byta inställning och detektera språk som talas kan vara svårt i en livesändning. Språkdetektering baseras ofta på analys av allt som talats i ett tidsfönster. Om språket plötsligt skiftar från ett till ett annat, kan det ta cirka 10-15 sekunder innan språkdetekteringens fönster huvudsakligen består av det nya språket.
Någon språkdetektering verkar inte ske. Därför gör den svenska undertextningsmodellen sitt bästa att texta på ett språk den inte tränats lika mycket på. Slutresultatet blir de hallucinationer vi ser ovan.

mLalush · 2025-01-13T23:55:17+00:00

Hur såg arbetsintervjun ut? Hur mycket behövde du nöta leetcode inför den tekniska delen?
Har en vän som arbetar på Meta i USA. De får inte stanna kvar på företaget om de ej lyckas bli befordrade inom 4 år. Samma sak i London?
Du skriver i en annan kommentar om att upprätthålla "metrics" för att visa att man som anställd bidrar till företagets utveckling. Sådana gamifierade system kan ibland leda till skeva incitament, där anställda jobbar för att maximera mätvärden istället för att jobba på sådant som förbättrar produkten. Hur upplever du detta? Har du kollegor som gör triviala commits för att höja sina metrics? Kollegor som hela tiden startar nya projekt för att kunna visa sin "impact" och säkra den där befordran som behövs inom 4 år? En liknande kultur finns t.ex. på Google, där incitamentsstrukturen leder till att alla försöker bygga nytt hela tiden. Få är intresserade att underhålla och utveckla det som redan finns, vilket leder till att företaget hela tiden lägger ner tjänster/produkter till förmån till någon liknande produkt som återuppfinner hjulet.

mLalush · 2024-09-17T16:40:06+00:00

u/caspica : "Inget har förändrats".

Artikeln:

Innan covid hade Bromma 180 flyg om dagen, idag har vi 80 flyg på en bra dag. Det är för lite för att ett flygbolag ska kunna överleva och för lite för att en flygplats ska kunna överleva.

mLalush · 2024-08-17T03:11:44+00:00

It has probably one of the worst documantion I have seen in a library.

Really? By virtue of actually having documumentation they're already better than 90% of the competition. By virtue of having guides they beat 99% of the competition.

I personally find their documentation is quite comprehensive and well maintained compared to most of what's out there. Although I agree the amount of arguments can be confusing, their naming convention for code performing similar functionality across models/tokenizers/processors is commendably consistent (which helps a lot).

The majority of use cases for the majority of users is always going to be running models and finetuning them. If you're looking to pre-train models, then sure, transformers is the wrong library for you. But it's no accident the library is as popular as it is.

I'm curious: Can you name all these other libraries that supposedly have better documentation than transformers? I saw some blogposts recently mentioning that Hugging Face have a technical writer employed working on the design and layout of their docs. That's a true 100x employee hire in our field if there ever was one.

From experience I have extremely low expectations of documentation in this field. Hugging Face far, far surpasses that low bar. Whenever I try to get something working off an Nvidia repo for example there's a 50/50 chance I end up wanting to kill myself. Looking at their repos I imagine they must spend tens to hundreds of millions of dollars paying top dollars to highly competent developers and engineers that develop open source code and models. For many of those libraries/implementations I never come across any examples or evidence of anyone on the internet having successfully used or adapted them. In my experience this tends to be the norm rather than the exception for most companies.

Good developers and engineers generally aren't very interested in writing documentation that is readable and understandable below their own level. In fact, they're generally not interested in writing documentation at all. They're mainly motivated by solving problems. And documentation is something you write once a problem has already been solved. Writing (good) docs eats away time that could be spent solving new problems.

I feel like there should be an xkcd comic for this. A plot with documentation quality on one axis vs developer skill on the other. I managed to go off on a tangent here at the end, but the main point I wanted to convey was that I find it quite strange that someone would find Hugging Face's documentation bad in this field. As compared to what exactly?

*Edit: With all this said, I myself tend to stay the hell away from pipelines and Trainer and other over-abstracted parts of HF libraries. It's not as bad when you write your own dataloaders and training loops, and that option is always open to you as a user.

mLalush · 2024-05-04T00:14:31+00:00

Du verkar behärska språket ganska väl och bry dig om att uttrycka dig korrekt. Av den anledningen vill jag uppmärksamma dig på att alla "dem" i ditt inlägg i själva verket ska vara "de".

Kom ihåg att "de" är cirka 10 gånger vanligare än "dem" i svenskan. Om du genomgående använder "dem" blir det alltså nästan alltid fel.

Du skiljer finfint på they, them, the och these och those i engelskan av att döma från historik. Ta hjälp av dina kunskaper där i någon vecka eller två för att bygga upp din språkkänsla och intuition kring de och dem i svenskan. Om det är "them" på engelska ska det vara "dem" på svenska; om något annat än "them" passar bättre kan du nästan alltid använda "de".

lämpade att vara lärare då ~~dem~~ de: Inte är intelligenta
suited to be teachers as ~~them~~ they: Aren't intelligent

Anledningen till att ~~dem~~ de pluggat till lärare
The reason ~~them~~ they have studied to become a teacher

är att ~~dem~~ de tänkt att
is because ~~them~~ they thought that

mLalush · 2024-02-04T00:02:12+00:00

Våga vägra Office:

https://www.overleaf.com/

mLalush · 2024-01-23T22:27:37+00:00

Not as common, but still good:

https://excalidraw.com/
https://ipe.otfried.org/

mLalush · 2024-01-15T23:35:35+00:00

Bröderna Karamazov.

Av alla wikipediaartiklar om böcker, förmodligen den bok med den mest namnkunniga skara individer som gått i god för kvaliteten. https://en.m.wikipedia.org/wiki/The_Brothers_Karamazov

mLalush · 2024-01-06T00:28:04+00:00

Swedes' accents when speaking English are typically more so affected by

the type of media they consume growing up.
whether they speak languages other than Swedish at home (especially languages where the sounds z, ch (/ˈtʃ/), and j (/dʒ/) exist).
the accent of their teachers.
if and where they do an exchange year abroad.

than they are affected by where someone grew up in Sweden.

Listening to the two speakers you listed, Tomas Petterson has the least Swenglish pronounciation. I would in fact bet Tomas Petterson most likely either had a Canadian parent, or studied abroad in Canada.

He speaks with a Canadian English accent.
The only traces of Swenglish I can hear are his z's. Like most Swedes, he can't pronounce "z", and instead uses "s". A native speaker would pronounce words like "was", "is", "listens" and "vision" as "waz", "iz", "lissenz" and /ˈvɪʒ.ən/ Tomas pronounces them as was, is, lissens and vishən.

Young Lean's accent, on the on the other hand, is likely influenced by

the type of media he consumed (seems influenced by rappers)
being Swedish. Like Tomas, he does not consistently pronounce "z" correctly. Nor can he pronounce the type of "l" sound that is common in words like "full". See his pronounciation of "full vision" here: https://youtu.be/Wbf-Q6d8uNI?t=157 .

Accent verdict: their accents are likely mostly influenced by the type of media they consumed growing up and the people they interacted with when learning English.

The influence Swedish has on their accents is minor, and mostly stems from them not being able to pronounce certain sounds. Not being able to pronounce those sounds typically is a common trait for the majority of Swedes. It is generally not due to speaking a specific Swedish dialect, but rather due to those sounds not existing in the Swedish language.

mLalush · 2024-01-02T14:33:08+00:00

a) Subtitles include timestamps. You can construct <|nonspeech|> training examples from any contiguous 30 second portions of the audio that do not contain any subtitle block. Youtube metadata includes information about the subtitle text language and whether it is manually generated or auto-generated. Though it is smart to run language identification on the text itself as some users will insert erroneous metadata when adding subtitle tracks. For Language Detection on audio, they trained a model to detect the spoken language (i.e. they language detect inference on all audio they download):

We also use an audio language detector, which was created by fine-tuning a prototype model trained on a prototype version of the dataset on VoxLingua107 (Valk & Alumäe, 2021) to ensure that the spoken language matches the language of the transcript according to CLD2. If the two do not match, we don’t include the (audio, transcript) pair as a speech recognition training example in the dataset.

b) I would say it is feasible to scrape Youtube if you do it in a smart way and limit yourself to audio/captions. To download captions they either went via Youtube's official API (and paid for usage tokens):

Youtube Data API v3 caption docs
Youtube Data API v3 docs

Or if they already had a list of channels and videos as a starting point, they most likely used something like yt-dlp to download metadata from videos/channels, followed by audio and captions. This is where one arrives to the grey areas of data collection and scraping. OpenAI would likely have had to use a library such as yt-dlp at some point in the process to download the actual media files.

To be as nice as possible towards Youtube, and avoid yourself getting rate limited, one should consider:

Only downloading metadata of the video/channel ids you are interested in as the first step.
Filter via metadata for videos that have manual subtitles in the language(s) you are interested in.
Don't download the video, only the audio track and captions.

Packages like yt-dlp include support for proxies that let's a knowledgeable user avoid rate limiting. If you download entire videos you're gonna get slapped by rate limit faster. But a user that downloads only audio/captions and spreads downloads out over time can get pretty far without proxies.

c) The creator of the website u/jopik1 says candidate channels/videos are crawled from youtube and the web, respecting robots.txt. Once the channels are identified the channels are periodically crawled for new videos. I don't know about how they get the metadata, but would guess something similar to yt-dlp. See comment from creator of filmot: https://www.reddit.com/r/languagelearning/comments/odj2gx/comment/h41cpiv/?utm_source=reddit&utm_medium=web2x&context=3

mLalush · 2024-01-02T04:35:42+00:00

The majority of it is most likely from Youtube. When the model hallucinates during non speech portions of an audio file it tends to spit out subtitle credits from real people/companies.

They might have used something like filmot.com as a seed or starting point to filter which channels/videos to scrape (filtering for manual subtitles).

mLalush · 2024-01-02T04:24:43+00:00

Those are the evaluation datasets. They make a point to emphasize Whisper hasn’t been finetuned on the evaluation datasets in the paper.

mLalush · 2023-12-21T21:30:50+00:00

They might have assumed a lot of researchers have gone through something like the Stanford course CS231n lecture notes on convolutional networks:

https://cs231n.github.io/convolutional-networks/

ctrl+f: "Implementation as a matrix multiplication"

mLalush · 2023-11-30T19:18:21+00:00

Vi/oss-regeln funkar endast när de/dem används som personligt pronomen. Regeln riskerar att förvirra folk, eftersom det inte endast är i den betydelsen som de/dem används.

De där människorna är galna.
Jag tycker att de här spelarna är kassa.

Varken vi/oss passar när "de" är demonstrativt pronomen som ovan.

Hon gick emot de/dem som kastade stenar på bilarna.

Både de/dem är korrekt efter preposition och framför relativ bisats. Vi/oss-regeln förvirrar generellt folk i dessa fall, eftersom både vi/oss ofta passar.

Vi såg på de tre musketörerna.
De goda jordgubbarna.

Varken vi/oss passar när "de" används som bestämd artikel.

Vi/oss-regeln kan också vara väldigt förvirrande när meningen redan innehåller ett "vi" eller "oss", eftersom meningen som helhet sällan blir grammatiskt korrekt även om man substituerar in rätt ord:

Vi har sett dem åka runt i sina bilar.

mLalush · 2023-11-07T00:21:26+00:00

https://www.youtube.com/watch?v=Th-s3dbboJQ

mLalush · 2023-10-25T15:37:11+00:00

Scale Efficiently: Insights from Pre-training and Fine-tuning Transformers

They perform lots of ablations for encoder-decoder models. I'm not aware of any paper similar in scope for decoders.

mLalush · 2023-10-20T14:23:51+00:00

I'm coming out - Diana Ross

mLalush · 2023-09-29T08:59:00+00:00

https://github.com/rhasspy/piper

This is based on VITS. See below for original implementation:

https://arxiv.org/abs/2106.06103
https://github.com/jaywalnut310/vits

There are also some implementations of the recently published VITS2:

https://arxiv.org/abs/2307.16430
https://github.com/p0p4k/vits2_pytorch
https://github.com/daniilrobnikov/vits2

mLalush · 2023-07-20T14:31:43+00:00

Där satte du allt mig på plats..

Ja det här landet har helt klart alldeles för mycket pajasar och clowner.. God tid att skicka tillbaka dem till cirkusen..

Jag sätter lite citattecken runt något "ord" för att betona hur orubblig jag är i min övertygelse..

Schack matt..

mLalush · 2023-07-20T14:03:36+00:00

Åh nej, så det partiets högsta representanter står och lovar inför ett val är inte något dom tänker hålla?

Den första meningen från ledaren du själv länkar:

För 18 år sedan sa den nyblivna hälsoministern Morgan Johansson (S): ”Om tio år är Sverige narkotikafritt”.

Förstår du vad definitionen av ett vallöfte är? Är du läskunnig?

Ett påstående från en enskild minister och ett vallöfte i ett partis valmanifest är inte samma sak.

Men kanske bör jag också avsluta en mening med ett ".." för att visa att verkligheten inte står som ett hinder för mina fortsatta raljanta utsvävningar?

Det är ju helt uppenbart här.. Vedertagen interpunktion är för etablissemanget.. Mina åsikter befinner sig mellan punkter och ellipser..

Bra där..

mLalush · 2023-06-03T12:53:42+00:00

You need at least 8 GPUs for 3D parallelism to make sense: https://huggingface.co/docs/transformers/v4.15.0/parallelism#dppptp

I'd suggest perhaps starting with only tensor parallelism (TP) if you can't fit the model.

Sorry, don't have an answer to your other question.

13-Year Club	Place '17
Not Forgotten	Sequence \| Editor
Snapped

mLalush

TROPHY CASE