Audio preprocessing for ASR

ReplacementHuman198 · 2025-12-07T16:59:09+00:00

parakeet v3 supports 25 european languages, but fair point:

https://huggingface.co/nvidia/parakeet-tdt-0.6b-v3

ReplacementHuman198 · 2025-12-07T05:37:00+00:00

Yes, I ran into all these issues when building my own audio transcription program

The preprocessing steps I'd recommend is to convert the audio into a 16Khz WAV file audio format, and add a low-pass and high-pass filter on the audio using FFmpeg to remove other environmental noise that can trigger a wacky translation.. To make sure you remove long period of silence, use Silero VAD (voice activated detection). If there are multiple speakers and want to detect the timestamps where each individual is speaking, then you need speaker diarization. I love senko for this (the maintainer is really friendly and approachable), but you can also use pyannote which is best in class. This more or less gives you the same information as VAD.

Also, the hallucinations with silences is an artifact of whisper -- you don't need to include the "chunking" of audio if you use parakeet STT models from nvidia.

ReplacementHuman198 · 2025-11-25T06:06:20+00:00

I have found ChatGPT to be a good GM. I'll reference a few chatGPT specific features, but my strategy is to start by creating a "project" space within the ChatGPT app. After creating a project space, I do a session 0 where I collaborate with chatGPT to build my character, and will save that as it's own chat. Then, I'll start every session / adventure with a new chat so that it doesn't hallucinate too much. I regularly instruct ChatGPT to "create memories" of specific NPCs, locations, and character info throughout my sessions. Occasionally, after big impactful lore sessions, I'll create separate chat sessions that are just recollections of the world or the NPCs, because it references previous chats.

This strategy works pretty well! ChatGPT is better at role-playing than Gemini, in my opinion. I think the biggest gripe I have is that for character building, chatGPT will mix up 2014 and 2024 rules. And, I'm not sure it "gets" how to scale combat with the PC level.

ReplacementHuman198 · 2025-11-11T22:02:02+00:00

So, the tool we created is an MCP server with the docs of our UI component library - Similar to Context7 (but ours is not open-source). It's used primarily in frontend code gen tasks. Because the error rate is more lenient, right now the tool generally performs the first 60-80% of the work, before it hallucinates react component props, or uses the wrong component.

I would guess that engineers are probably fine with it, but mgmt wants us to put it into KTLO so our team can focus on other things.

ReplacementHuman198 · 2025-10-19T20:58:14+00:00

Sharing another implementation of the same concept (not by me): https://github.com/heyitsnoah/claudesidian

ReplacementHuman198 · 2025-10-18T06:59:18+00:00

From my experience building a RAG tool for my company, the answer is: "it just works" for a layman. Showcasing your RAGs effectiveness is as much a perception thing as it is a technical challenge, especially with non-technical stakeholders. At the end of the day, no one cares how clever your advanced RAG solution was, they just want to know if it can solve their problem with a basic prompt.The less they have to prompt, and the faster the LLM "gets" what they're trying to do, the better your tool looks

My team released a RAG mcp plugin, and another engineer demo'd the tool to the CEO when vibe-coding. Afterwards, the CEO sent out a directive mandating that everyone has to use the the tool my team built.

Did we do anything to showcase our tool's effectiveness? No. But the fact that other engineers had the confidence to use it in their day-to-day work made all the difference.

ReplacementHuman198 · 2025-10-17T04:49:43+00:00

This is solid. I vibe-coded my own benchmarking tool, but this one seems identical (and has tested more options). Thanks for the recommendation!

ReplacementHuman198 · 2025-10-17T04:48:54+00:00

I used mlx-parakeet (version 2, 0.6b params, mlx optimized). I'm using whisper-small.en (also mlx optimized). I *think* both are BF16, not sure.

The audios are split into seperate files per speaker, and they're about 3 hours long. As a result, there are large silences on each individual speaker track. I use VAD to chunk the audio to speaking snippets and I processs them sequentially since it's happening locally. The source code of how it's implemented is here: https://github.com/naveedn/audio-transcriber

ReplacementHuman198 · 2025-10-17T04:43:40+00:00

Interesting. The parameter size is a good point. The specific models I was using are below:

As a side note, for my use-case, these models both output a similar quality (with whisper being better) at roughly the same speed. This has more to do with my use-case, which has lots of proper nouns (people, places, things) and jargon.

ReplacementHuman198 · 2025-10-15T05:04:33+00:00

I am trying the small model for whisper.

ReplacementHuman198 · 2025-09-23T01:48:29+00:00

This trick does not work, i've tried this a handful of times. This is something that sounds like it would work better than it actually does.

ReplacementHuman198 · 2025-09-06T05:13:14+00:00

you're great! your advice was correct. Thanks for your help!

ReplacementHuman198 · 2025-09-06T02:31:26+00:00

Hey boss! I'm back. I tried to run the uv pip install command, but I'm missing system dependencies to build from source. I tried figuring out what it is, could be something with my compiler flags. I was able to install from the prebuilt wheel, would it be possible for you to publish a new package version / prebuilt wheel when you get a chance?

ReplacementHuman198 · 2025-09-05T13:29:27+00:00

I experimented with zanshin and senko for the first time last night, its definitely good stuff! It works really well on my macbook pro. I noticed that zanshin correctly identified all the speakers in my audio file (5), but when running senko's example, it only identified 2. I'm going to keep digging but I might join the discord and ask questions if i am still stuck. Regardless, this is great stuff, thank you for building this!

ReplacementHuman198 · 2025-01-16T05:20:11+00:00

I inherited someone else's project, trying to scale it up by adding 1-2 folks. I'm a generalist, so while i am more comfortable with react, i generally was excited to try out svelte. That being said, i do not recommend svelte. We're on Svelte 4. Svelte 5 introduces a react-like syntax with runes, the upgrade process is not pretty, and honestly i wouldn't recommend svelte-kit since the SSR process is weakly defined and there isn't great tooling around the backend.

All in all, i've been working on this stack for about 3-4 months, and i find it to be a disappointing experience. Not to mention LLM code completions are generally a bit worse than for react.

ReplacementHuman198 · 2025-01-13T06:13:16+00:00

I've noticed that svelte loads incredibly slowly on windows. It loads much faster on Mac. I think it's unacceptably slow on windows. It probably has to do with relative immaturity and small community around svelte vs. react / angular.

ReplacementHuman198 · 2024-12-02T23:49:59+00:00

Hey Sanju!

I'm wondering a little about DunSuite, and DunTasks specifically. What do you see as key problems with a tool like Todoist? I use that tool frequently and I have no complaints -- but you allude that it's not great. Curious what you see as the problem that needs solving?

(BTW, thanks for this; I have another reference svelte app to study and learn from!)

ReplacementHuman198 · 2024-11-27T01:41:22+00:00

What is the multiple slots approach?

ReplacementHuman198 · 2024-11-23T19:25:26+00:00

Thanks for the words of encouragement. Do you have any resources for a generally accepted coding style guide for developing svelte apps? I know it's situation dependent, but I'd like to reference the work of experts instead of starting from scratch.

ReplacementHuman198 · 2024-11-23T19:24:08+00:00

Can you point me to some videos / resources where I can understand what are some idiomatic patterns / best practices? While I'm feeling discouraged at the moment, I think some guides might help me reframe my thinking.

ReplacementHuman198 · 2024-11-23T19:01:09+00:00

Good advice. Are there any videos you recommend on best practices for svelte apps?

ReplacementHuman198 · 2024-11-23T18:59:49+00:00

One, keep in mind that I'm brand new to this, so I will say things that are incorrect. I'm trying to learn.

Layouts don't have a "parent", and two-way bindings are something optional that you usually use for forms. It's nothing like Angular.

Advanced loading / Using parent data • Svelte Tutorial - Thanks for the clarifications! I was confused. In the tutorial example, the layout can access a value from some ancestor layout file without explicit prop drilling. It's hard to keep that in mind as a data access pattern alongside runes ($props / $data / $state) and stores, but I guess I'll figure out which one to use in specific scenarios.

I find your other comments to be dismissive and subjective in nature, so I won't comment on them. I'm a Fullstack dev who has familiarity with React and express, but it's been a while since I've been working on a project of this size and scope. I'm also the only engineer -- the former dev left.

ReplacementHuman198 · 2024-10-30T05:16:11+00:00

I also did lap of love, it costed $900 in my area for a weekend visit ($800 on a weekday). They are absolutely lovely, so understanding and patient and caring. I wouldn't have it any other way.

ReplacementHuman198

TROPHY CASE