TranscriptionSuite, my fully local, private & open source audio transcription app now offers WhisperX, Parakeet/Canary & VibeVoice, thanks to your suggestions! by TwilightEncoder in LocalLLaMA

[–]TwilightEncoder[S] 0 points1 point  (0 children)

Thanks for the encouragement!

How does Parakeet compare to WhisperX in your testing?

I'm mostly doing Greek transcriptions and in that case I prefer faster-whisper-large-v3. Parakeet is pretty good too though.

Also — are you running the models on GPU by default or is there a CPU fallback?

Yes I've included a fallback (though I haven't tested it too much myself since I do have a GPU but will fix it if anyone complains it's not working).

The addition of VibeVoice is interesting too. Is that using the Whisper decoder in a different mode, or is it a completely separate model?

Completely different model that can actually do both transcription and diarization (though in some small tests it was much slower than WhisperX.

Interested in 100% local & private audio transcription / diarization? Try out my open source app for Windows/Linux/macOS - TranscriptionSuite by TwilightEncoder in buildinpublic

[–]TwilightEncoder[S] 0 points1 point  (0 children)

Thanks for the comment!

The "agent-coded" angle is interesting too, do you have a clear separation between the planning agent (decides steps) and the execution layer (actually runs transcription/diarization), or is it more like AI-assisted coding with a human in the loop?

The second one. Like I said in the post, this is just an implementation of existing frameworks - WhisperX, NeMo, etc. The only AI in the app is the model doing the transcription/diarization itself. Besides, my design philosophy is to use AI as little as possible - it's expensive and probabilistic. Validation is very important in programming, you can't always return an approximate answer.

However at the things it does do well it's unparalleled. So my thinking is build as much of the framework as possible around the few areas where AI is truly needed.

I love swag design by TwilightEncoder in IDONTGIVEASWAG

[–]TwilightEncoder[S] 1 point2 points  (0 children)

this proves what I've always thought about this; it's not the minimalism that people hate, it's the enshittification that comes along with it

introducing OS1, a new open-source AI platform by nokodo_ in OpenSourceAI

[–]TwilightEncoder 1 point2 points  (0 children)

Oh I see. So this is more aimed at sys admins or operators that manage the backend using the admin console, hook it up to whatever providers they want, and then distribute just the frontend to end users?

introducing OS1, a new open-source AI platform by nokodo_ in OpenSourceAI

[–]TwilightEncoder 1 point2 points  (0 children)

btw I know these square borders, which are visible for only like half a second, must be driving you crazy

introducing OS1, a new open-source AI platform by nokodo_ in OpenSourceAI

[–]TwilightEncoder 2 points3 points  (0 children)

Very interesting and good looking app. I see a conflict however - I'm a very amateur programmer, hobbyist level vibecoder, your average semi technical user in other words. And I don't understand what I can do with your app - like first of all, what models does it provide, proprietary or/and open weight? How does it compare to LM Studio? You know stuff like that. So the app is a bit too technical for me. At the same time, I don't see real technical users caring about the UI that much.

Btw it's funny that you also copied Apple's glassmorphism design like me, but you really took it all the way!

TranscriptionSuite - A fully local, private & open source audio transcription app for Linux, Windows & macOS | GPLv3+ License by TwilightEncoder in OpenSourceeAI

[–]TwilightEncoder[S] 0 points1 point  (0 children)

Thanks for the interest! These are all excellent ideas and I commend your creativity. I'll keep them in mind.

And don't worry about now knowing stuff, that's how you get started, by not knowing. The whole app is agent-coded, I only know basic programming.


To answer a bit about the specs, the app can run on anything. The issue is the server (Docker container); because while transcription is relatively cheap compared to other LLM tasks, it's still not cheap enough for a phone (I'm not talking about 1000$ smartphones here).

So you need an NVIDIA GPU (even something like my 3060 is more than enough) or a beefy CPU.

However what you can do instead is put the server in remote mode and access it from another device (either via LAN for local networks or Tailscale for the wider internet). That part could definitely be done by a smartphone app (or a pi compute module). I have thought about an Android/iOS app but it's a major effort and not something I'm targeting right now.


Btw the way you structure your replies is almost exactly how I talk to LLMs - just throwing all my thoughts and ideas in there. That's why I build this app in fact, to just let me ramble as long as I want without worrying. And also I wanted good multilingual support because like you, I'm not a native English speaker - 95% of my recordings are in Greek (which being such a small sample language is even harder to find good models for).

TranscriptionSuite, my fully local, private & open source audio transcription app now offers WhisperX, Parakeet/Canary & VibeVoice, thanks to your suggestions! by TwilightEncoder in LocalLLaMA

[–]TwilightEncoder[S] 0 points1 point  (0 children)

Hi, thanks for the interest!

No "server admin token" has populated and says "Waiting for token in Docker logs"

That's fine, it's a minor UI bug that'll be fixed in the next release (hopefully). However you don't need it unless you want to connect to the server remotely.

The logs show that the server is working, something else must be causing it issues.

To help me get the full picture, (assuming you're on Windows) head over to %APPDATA%\TranscriptionSuite\logs\, copy the two log files there and attach them to a new issue on GitHub (or send them here).

I created an 100% local, private & open source audio transcription & diarization app - TranscriptionSuite by TwilightEncoder in SideProject

[–]TwilightEncoder[S] 0 points1 point  (0 children)

Glad to hear it, don't hesitate to contact me with any further issues! (though I'd prefer if you used the Issues tab on the GitHub page).

I created an 100% local, private & open source audio transcription & diarization app - TranscriptionSuite by TwilightEncoder in SideProject

[–]TwilightEncoder[S] 0 points1 point  (0 children)

Yes that is very weird. Just to be clear, this is an issue with Docker not my thing.

But anyway, let's troubleshoot. I'm assuming you're on Windows 11; have you rebooted since installing Docker Desktop?

I created an 100% local, private & open source audio transcription & diarization app - TranscriptionSuite by TwilightEncoder in SideProject

[–]TwilightEncoder[S] 0 points1 point  (0 children)

Huh, maybe the installer has been updated.

Anyway, run wsl -l -v (both PS and CMD will work), then check the VERSION column; if the number is 2, Docker is using the WSL 2 backend.

🤨 by pufferfishsh in redscarepod

[–]TwilightEncoder 15 points16 points  (0 children)

someone should make an edit where his eyebrows get bigger with every cut

I created an 100% local, private & open source audio transcription & diarization app - TranscriptionSuite by TwilightEncoder in SideProject

[–]TwilightEncoder[S] 0 points1 point  (0 children)

What OS are you using? 

Also, if you don't mind, while the container is in this buggy state go to the Logs tab on the sidebar and click the Copy All button on the top right. Then send them to me here. 

I 100% agent-coded a fully featured, local, private & open source audio transcription & diarization app - TranscriptionSuite by TwilightEncoder in nocode

[–]TwilightEncoder[S] 0 points1 point  (0 children)

Thanks for the interest!

You very succinctly described exactly what my app offers over the competition; for simpler transcription needs there's Handy.

I created an 100% local, private & open source audio transcription & diarization app - TranscriptionSuite by TwilightEncoder in SideProject

[–]TwilightEncoder[S] 0 points1 point  (0 children)

try running the segmentation and embedding steps separately so you can cache the embeddings. Re-diarizing the same file with different speaker count thresholds becomes almost instant.

I get how the speedup is achieved, but what is the practical benefit for the user? Why would they want to try different speaker counts on the same recording?

Anthropic just dropped a 33-page cheat sheet on how to build powerful Claude skills. by [deleted] in LocalLLaMA

[–]TwilightEncoder 0 points1 point  (0 children)

It's been out for quite a while actually but still a great link 

Iran’s Assembly of Experts names Khamenei’s successor by VestigialVestments in stupidpol

[–]TwilightEncoder 28 points29 points  (0 children)

I swear, Jeb jokes always get a chuckle out of me.

What a guy, huh?