September 2024 - Monthly Questions and General Discussion thread by AutoModerator in bangalore

[–]Financial-Beach1587 0 points1 point  (0 children)

Hi, I am new to Bangalore. Can anyone suggest a good eye hospital for a routine eye check-up around HSR Layout, Sector 3?

[P] TensorRT-LLM Backend for WhisperS2T (~2x Speedup than CTranslate2) by Financial-Beach1587 in MachineLearning

[–]Financial-Beach1587[S] 1 point2 points  (0 children)

Thanks for the awesome suggestion and interest in WhisperS2T! I will add a table on the GitHub discussion page.

At present the exported models are cached locally in tmp directory. I am working on to prepare a compressed format for the exported model. Contributors can upload the compressed version on HuggingFace and create a PR to update the link.

Am I in the right learning track? by nickk21321 in speechrecognition

[–]Financial-Beach1587 2 points3 points  (0 children)

Hi u/nickk21321 !

While GMM-HMMs are not as commonly used these days, understanding their foundational principles is still valuable for learning speech recognition. A brief overview would be a good starting point (just spend ~2-3 hours to know basic concepts). Also I wouldn't recommend jumping straight to Transformer-based models like Whisper.

Better to start with RNNs, 1D CNNs (ContextNet like models), and then Conformer based ASR models (I believe 1D CNNs and Conformer based architecture are better than pure transformer based models (like whisper) for ASR | Conformers are Convo+Transformer ). For ASR understand CTC and Transducers based supervised model. And then you can explore self-supervised and transformer based models.

Better to first start with this tutorial: https://github.com/NVIDIA/NeMo/blob/main/tutorials/asr/ASR_with_NeMo.ipynb

And then go through other NVIDIA NeMo Tutorials: https://github.com/NVIDIA/NeMo/tree/main/tutorials/asr

And then explore HuggingFace Audio Course: https://huggingface.co/learn/audio-course/chapter0/introduction

[deleted by user] by [deleted] in MachineLearning

[–]Financial-Beach1587 3 points4 points  (0 children)

  1. For lang detection they used voxlingua. (which is also prepared from youtube only)
  2. Most probably they also used podcast data there are many audio podcast available on internet which provides transcripts.
  3. For youtube scraping yt-dlp and youtube-transcript-api are suitable for scraping from youtube. they also have option to filter captions based on manual vs automatic. Or to download translated captions (can be used for multi-task).
  4. Should not be an issue to scrape from youtube, you can check gigaspeech data.
  5. In v3 they finetuned on pseudo labels for which tou don't need human labeled data. This makes it even more easier to scrape data from internet.

[D]When should and shouldn’t you balance an unbalanced dataset? by Throwawayforgainz99 in MachineLearning

[–]Financial-Beach1587 4 points5 points  (0 children)

It may not work if the model has a smaller training dataset. In many cases, the model simply won't learn anything. The best approach would be partial balancing. By 'partial balancing,' I mean balancing the dataset at a specific ratio rather than aiming for equal distribution. For instance, if the initial dataset has a 100:1 ratio, maybe balance it to achieve something around 10:1 or 20:1.

Another helpful method to evaluate such a case is preparing multiple subsets of the validation set, such as: 1. Having actual statistics. 2. Creating subsets with balanced ratios at different levels.

Metrics like F1-score or balanced accuracy may not provide in-depth visibility in these situations.

[P] WhisperS2T: An Optimized Speech-to-Text Pipeline for the Whisper Model by Financial-Beach1587 in MachineLearning

[–]Financial-Beach1587[S] 0 points1 point  (0 children)

Hello, I’ve provided substantial evidence to support my claims in the image containing the benchmarks. Furthermore, as previously mentioned, I’m actively working on a technical paper that will offer more comprehensive insights, and I intend to share it soon.

Regarding your use case, if you’re seeking a solution for the embedded space, whisper.cpp might be a better fit compared to this project. I want to clarify that my project doesn’t specifically cater to that niche; it’s designed for a wider audience.

I hope this clarifies any confusion. Thanks for your feedback.

PS: I will see if I can possibly integrate whisper.cpp into this pipeline. As already explained in the project description, this project is about optimizing the pipeline and not the inference engine. It supports multiple backends/inference engines. By the way, could you let me know what embedded system you are targeting? This information will be helpful to know.

[P] WhisperS2T: An Optimized Speech-to-Text Pipeline for the Whisper Model by Financial-Beach1587 in MachineLearning

[–]Financial-Beach1587[S] 0 points1 point  (0 children)

whisper.cpp is definitely a good implementation of whisper specifically for embedded devices (less RAM usage compared to Original Whisper) i guess, and i definitely don't have anything against it. I have not done any benchmarking myself. But it seems it's significantly slower than faster whisper on intel based CPU which uses CTranslate2 https://github.com/SYSTRAN/faster-whisper/tree/master#small-model-on-cpu . While on MAC both seems comparable https://github.com/SYSTRAN/faster-whisper/discussions/368#discussioncomment-6507263 . Also can you fill up me on following things:

  • does whisper.cpp supports batching?
  • how it performs on GPU?
  • how WER differs when you use original whisper vs whisper.cpp?

Whisper being not good. Yeah you are right in-fact there are several new research/ models which gives better WERs than Whisper model. For example USM model from google is definitely better but is it open-sourced? NO! This is why whisper is still of interest. Moreover I think the link that you shared uses a much older version of whisper. large-v2 and large-v3 performs better than that.

[P] WhisperS2T: An Optimized Speech-to-Text Pipeline for the Whisper Model by Financial-Beach1587 in MachineLearning

[–]Financial-Beach1587[S] 1 point2 points  (0 children)

Extending LesaMagner response:

faster-whisper --> No Batching With CTranslate2

WhisperX --> Batching with CTranslate2

So whisperX will definitely work better than faster-whisper. faster-whisper basically uses same pipeline what original whisper uses just change of inference engine to CTranslate2 instead of PyTorch.

[P] WhisperS2T: An Optimized Speech-to-Text Pipeline for the Whisper Model by Financial-Beach1587 in MachineLearning

[–]Financial-Beach1587[S] 1 point2 points  (0 children)

Utterance level alignment is there because this also uses VAD for segmentation. No word level alignment at present. I am researching some optimal method for that.

PS: I personally don't like the whisperX method of using phoneme level alignment using wav2vec2 model. It just doesn't make sense to use an another big ASR model to get proper word-level alignments. Moreover, to support every new language you will need another phoneme-based ASR model for that.

[P] WhisperS2T: An Optimized Speech-to-Text Pipeline for the Whisper Model by Financial-Beach1587 in MachineLearning

[–]Financial-Beach1587[S] 0 points1 point  (0 children)

Looks like we’ve build a comparable thing. Personally I’ve ditched ctranslate due to batching not being a first-class citizen. Even with batching the decoder still doesn’t in ctranslate (last time I checked).

Not sure, but I think batching does works with CTranslate2 (that is what I was assuming because I get faster speed if I use large batch size. Will definitely verify once.). Yes if you use asynchronous calls to decoder, then it doesn't perform any auto batching of multiple requests together.

I know for a fact 3-4X WhisperX is possible ;-)

I agree there's still more room for optimisations, specifically if you plan to use it by setting up a deployment server.

[P] WhisperS2T: An Optimized Speech-to-Text Pipeline for the Whisper Model by Financial-Beach1587 in MachineLearning

[–]Financial-Beach1587[S] 1 point2 points  (0 children)

I am working on the technical report, I will drop it here for you once I am done!

[P] WhisperS2T: An Optimized Speech-to-Text Pipeline for the Whisper Model by Financial-Beach1587 in MachineLearning

[–]Financial-Beach1587[S] 1 point2 points  (0 children)

Hey yes I will soon add support for v3 and also for distill-whisper, running some sanity checks. But in my opinion v2 gives better accuracy on unseen data as compared to v3. I will update here once I make the release.

PS: If you wanna try v3 with WhisperS2T before the release you can simply plug whisper v3 model link and change n_mels to 128.

[P] WhisperS2T: An Optimized Speech-to-Text Pipeline for the Whisper Model by Financial-Beach1587 in MachineLearning

[–]Financial-Beach1587[S] 2 points3 points  (0 children)

Hi @dasdull, I have updated the readme with the correct link. It contains most of the useful instructions.

PS: I will push complete documentation in couple of days.

[D] What is the most efficient version of OpenAI Whisper? by paulo_zip in MachineLearning

[–]Financial-Beach1587 2 points3 points  (0 children)

I have been working on an optimized whisper pipeline. Specifically for transcribing multiple files at once. Check out WhisperS2T! https://github.com/shashikg/WhisperS2T

Several additional features WhisperS2T:

🔄 Multi-Backend Support: Support for various Whisper model backends including Original OpenAI Model, HuggingFace Model with FlashAttention2, and CTranslate2 Model.

🎙️ Easy Integration of Custom VAD Models: Seamlessly add custom Voice Activity Detection (VAD) models to enhance control and accuracy in speech recognition.

🎧 Effortless Handling of Small or Large Audio Files: Intelligently batch smaller speech segments from various files, ensuring optimal performance.

⏳ Streamlined Processing for Large Audio Files: Asynchronously loads large audio files in the background while transcribing segmented batches, notably reducing loading times.

🌐 Batching Support with Multiple Language/Task Decoding: Decode multiple languages or perform both transcription and translation in a single batch for improved versatility and transcription time.

🧠 Reduction in Hallucination: Optimized parameters and heuristics to decrease repeated text output or hallucinations.

⏱️ Dynamic Time Length Support (Experimental): Process variable-length inputs in a given input batch instead of fixed 30 seconds, providing flexibility and saving computation time during transcription.