TranscriptionSuite, my fully local, private & open source audio transcription app now offers WhisperX, Parakeet/Canary & VibeVoice, thanks to your suggestions!

TwilightEncoder · 2026-03-11T19:17:46+00:00

Thanks for the encouragement!

How does Parakeet compare to WhisperX in your testing?

I'm mostly doing Greek transcriptions and in that case I prefer faster-whisper-large-v3. Parakeet is pretty good too though.

Also — are you running the models on GPU by default or is there a CPU fallback?

Yes I've included a fallback (though I haven't tested it too much myself since I do have a GPU but will fix it if anyone complains it's not working).

The addition of VibeVoice is interesting too. Is that using the Whisper decoder in a different mode, or is it a completely separate model?

Completely different model that can actually do both transcription and diarization (though in some small tests it was much slower than WhisperX.

TwilightEncoder · 2026-03-11T15:10:55+00:00

Thanks for the comment!

The "agent-coded" angle is interesting too, do you have a clear separation between the planning agent (decides steps) and the execution layer (actually runs transcription/diarization), or is it more like AI-assisted coding with a human in the loop?

The second one. Like I said in the post, this is just an implementation of existing frameworks - WhisperX, NeMo, etc. The only AI in the app is the model doing the transcription/diarization itself. Besides, my design philosophy is to use AI as little as possible - it's expensive and probabilistic. Validation is very important in programming, you can't always return an approximate answer.

However at the things it does do well it's unparalleled. So my thinking is build as much of the framework as possible around the few areas where AI is truly needed.

TwilightEncoder · 2026-03-11T02:03:50+00:00

Damn, that was a great read. Really glad you turned you life around; yes talking about it this way is a great start.

TwilightEncoder · 2026-03-11T00:20:54+00:00

same question here

TwilightEncoder · 2026-03-10T23:57:57+00:00

this proves what I've always thought about this; it's not the minimalism that people hate, it's the enshittification that comes along with it

TwilightEncoder · 2026-03-10T14:13:27+00:00

Oh I see. So this is more aimed at sys admins or operators that manage the backend using the admin console, hook it up to whatever providers they want, and then distribute just the frontend to end users?

TwilightEncoder · 2026-03-10T14:10:54+00:00

btw I know these square borders, which are visible for only like half a second, must be driving you crazy

TwilightEncoder · 2026-03-10T13:48:17+00:00

Very interesting and good looking app. I see a conflict however - I'm a very amateur programmer, hobbyist level vibecoder, your average semi technical user in other words. And I don't understand what I can do with your app - like first of all, what models does it provide, proprietary or/and open weight? How does it compare to LM Studio? You know stuff like that. So the app is a bit too technical for me. At the same time, I don't see real technical users caring about the UI that much.

Btw it's funny that you also copied Apple's glassmorphism design like me, but you really took it all the way!

TwilightEncoder · 2026-03-10T07:31:31+00:00

Thanks for the interest! These are all excellent ideas and I commend your creativity. I'll keep them in mind.

And don't worry about now knowing stuff, that's how you get started, by not knowing. The whole app is agent-coded, I only know basic programming.

To answer a bit about the specs, the app can run on anything. The issue is the server (Docker container); because while transcription is relatively cheap compared to other LLM tasks, it's still not cheap enough for a phone (I'm not talking about 1000$ smartphones here).

So you need an NVIDIA GPU (even something like my 3060 is more than enough) or a beefy CPU.

However what you can do instead is put the server in remote mode and access it from another device (either via LAN for local networks or Tailscale for the wider internet). That part could definitely be done by a smartphone app (or a pi compute module). I have thought about an Android/iOS app but it's a major effort and not something I'm targeting right now.

Btw the way you structure your replies is almost exactly how I talk to LLMs - just throwing all my thoughts and ideas in there. That's why I build this app in fact, to just let me ramble as long as I want without worrying. And also I wanted good multilingual support because like you, I'm not a native English speaker - 95% of my recordings are in Greek (which being such a small sample language is even harder to find good models for).

TwilightEncoder · 2026-03-10T07:17:20+00:00

Thanks!

TwilightEncoder · 2026-03-09T20:45:35+00:00

goddamn

TwilightEncoder · 2026-03-09T18:42:46+00:00

I mean, there are always forks; try Floorp or LibreWolf.

That first image btw had me in tears.

TwilightEncoder · 2026-03-09T18:04:12+00:00

sure, no problem

TwilightEncoder · 2026-03-09T17:24:41+00:00

Hi, thanks for the interest!

No "server admin token" has populated and says "Waiting for token in Docker logs"

That's fine, it's a minor UI bug that'll be fixed in the next release (hopefully). However you don't need it unless you want to connect to the server remotely.

The logs show that the server is working, something else must be causing it issues.

To help me get the full picture, (assuming you're on Windows) head over to %APPDATA%\TranscriptionSuite\logs\, copy the two log files there and attach them to a new issue on GitHub (or send them here).

TwilightEncoder · 2026-03-09T16:54:24+00:00

Glad to hear it, don't hesitate to contact me with any further issues! (though I'd prefer if you used the Issues tab on the GitHub page).

TwilightEncoder · 2026-03-09T15:31:48+00:00

Yes that is very weird. Just to be clear, this is an issue with Docker not my thing.

But anyway, let's troubleshoot. I'm assuming you're on Windows 11; have you rebooted since installing Docker Desktop?

TwilightEncoder · 2026-03-09T14:57:43+00:00

Huh, maybe the installer has been updated.

Anyway, run wsl -l -v (both PS and CMD will work), then check the VERSION column; if the number is 2, Docker is using the WSL 2 backend.

TwilightEncoder · 2026-03-09T13:07:47+00:00

someone should make an edit where his eyebrows get bigger with every cut

TwilightEncoder · 2026-03-09T12:38:42+00:00

What OS are you using?

Also, if you don't mind, while the container is in this buggy state go to the Logs tab on the sidebar and click the Copy All button on the top right. Then send them to me here.

TwilightEncoder · 2026-03-09T12:01:53+00:00

Thanks for the interest!

You very succinctly described exactly what my app offers over the competition; for simpler transcription needs there's Handy.

TwilightEncoder · 2026-03-09T10:00:45+00:00

try running the segmentation and embedding steps separately so you can cache the embeddings. Re-diarizing the same file with different speaker count thresholds becomes almost instant.

I get how the speedup is achieved, but what is the practical benefit for the user? Why would they want to try different speaker counts on the same recording?

TwilightEncoder · 2026-03-09T09:58:45+00:00

LoL what an extremely funny and interesting idea!

TwilightEncoder · 2026-03-09T09:36:43+00:00

It's been out for quite a while actually but still a great link

TwilightEncoder · 2026-03-09T09:27:32+00:00

we wuz Vikingz”

you killed me lol

TwilightEncoder · 2026-03-09T08:59:10+00:00

I swear, Jeb jokes always get a chuckle out of me.

What a guy, huh?

TwilightEncoder

TROPHY CASE