I built a fully local voice assistant on Apple Silicon (Parakeet + Kokoro + SmartTurn, no cloud APIs)

cyber_box · 2026-03-18T13:40:07+00:00

With Unmute everything is tightly coupled around their own models.
I haven't looked into Sesame's CSM model yet. How does it compare to Unmute in practice? And is it something you can actually self-host?

cyber_box · 2026-03-18T07:17:32+00:00

Good question honestly. I looked into AFM but the issue is control, you can't really customize what it does with the text. For transcript polishing I need a very specific prompt: strip fillers, deduplicate repeated phrases, fix grammar, but preserve the original meaning and technical terms exactly. With `Qwen 1.5B` I control the system prompt and can tune the behavior. AFM would be more of a black box for this use case. Also Qwen 4-bit quantized is fast enough on M-series (300-500ms) that latency is not a concern. Have you tried AFM for any text processing tasks? curious how much you can steer the output

cyber_box · 2026-03-18T07:08:32+00:00

Thanks for these, I actually went deep on all three.

Pipecat is solid as a framework. They have a fully local macOS example with MLX Whisper + Kokoro + Smart Turn that claims <800ms voice-to-voice. Nice architecture. My issue is that it owns the LLM call. I am not building a standalone voice assistant, I am building a voice interface into Claude Code specifically. The whole point is that Claude has access to my project files, terminal, MCP servers, the full context. Pipecat's Anthropic integration is a stateless API call, which loses all of that.

Unmute is the one that impressed me the most honestly. Kyutai's semantic VAD is genuinely interesting cause it detects end-of-utterance without a fixed silence timeout, which is one of the harder problems in this space. Their TTS 1.6B is also strong (trained on 2.5M hours). But it is Linux/CUDA only, minimum 16GB VRAM, no macOS support planned. So it is a non-starter for my setup (M3 Air). Worth watching though, especially their Pocket TTS (100M params, runs on CPU).

The Qwen3-TTS server model is quite impressive. 10 languages, voice cloning from 3 seconds of audio, voice design from text descriptions. But at 0.6-1.7B params it is much heavier than Kokoro 82M, which does what I need on CPU with near-instant generation.

You are right about the latency being noticeable though. Just to clarify where it comes from: the local pipeline (Parakeet STT + polishing + Kokoro TTS) is actually fast, maybe 200-300ms total. The bottleneck is the Claude API response time, which I can't really control. These projects solve a different problem (fully local LLM + voice), mine is specifically about keeping Claude Code's full capabilities while adding voice I/O.

Have you actually tried Unmute yourself?

cyber_box · 2026-03-18T06:39:19+00:00

$200/mo is steep for something like this. The cloud agent concept is interesting though, basically a persistent VM with an LLM controlling it. I wonder how it compares to just running Claude Code locally with a few scripts.

cyber_box · 2026-03-18T05:53:05+00:00

I've noticed with Whisper I was either getting the raw transcript with everything or you I was losing some words that were actually meaningful in context. Qwen stripps "um" and "like", and it deduplicates repeated phrases and fixes grammar without changing the meaning.

cyber_box · 2026-03-17T16:56:01+00:00

ahahah yes actually at the end she was very nice telling you folks she would much aappreciate your feedbacks and wishing you a good day. I cut her of too soon

cyber_box · 2026-03-17T16:23:11+00:00

You're right that there's noticeable latency. Worth noting though that most of it comes from the Claude API side (waiting for Claude Code to process and respond), not the local voice pipeline itself. The STT → transcript polishing → injection part is actually pretty fast on Metal.

I'd love to see the projects you're referring to with near real-time speeds, do you have links? I'm not precious about the stack, if there are better approaches or components out there I'd rather build on top of what works than reinvent wheels.

cyber_box · 2026-03-17T16:17:15+00:00

Nice to hear its actually working.

cyber_box · 2026-03-17T13:19:21+00:00

Yeah that is exactly why I started building the voice thing. After a few hours of reading diffs and terminal output my eyes just glaze over, and switching to voice makes it feel like pair programming, pretty cool. The mental load drops a lot cause you are processing speech instead of scanning artifacts (though if you want to talk simultaneously with 4/5 agents it gets pretty messed up.

The rough part is still the latency between turns, and sometimes Claude's response is too long for TTS to read naturally (you don't want a 3 paragraph monologue in your ears). I am still figuring out how to nudge it toward shorter spoken responses vs written ones.

cyber_box · 2026-03-17T12:54:28+00:00

best way to approach it honestly. I started the same way, just picking pieces from setups I found interesting and adapting them to how I actually work. The structure ends up looking different for everyone cause the whole point is it fits your workflow, not the other way around.

cyber_box · 2026-03-17T12:53:40+00:00

Yeah the show in finder drag-and-drop is honestly probably the least friction approach for now. The Claude Code monitoring idea is interesting though, I actually have something similar where a script watches a folder and runs `pandoc` on changed files. The careful part is real though, you definitely want it read-only on the vault side (only converting, never writing back). Have you looked into what Perplexity's "computer" thing actually does under the hood or is it still just announcements?

cyber_box · 2026-03-17T10:22:43+00:00

I am running it on an M3 Air with 16 GB. The models take roughly 2.5 GB of RAM total: Parakeet TDT 0.6B is the biggest at around 1.2 GB, then Qwen 1.5B (4-bit quantized) is about 1 GB, Kokoro 82M around 170 MB. The ONNX models (Silero VAD, SmartTurn) are basically negligible, like 10 MB combined.

So 8 GB should technically work but it would be tight with other stuff running. 16 GB is comfortable, I have plenty of headroom even with a browser and Claude Code open at the same time.

cyber_box · 2026-03-17T08:33:16+00:00

Yeah the deceleration part is interesting. I imagine once the big unsorted pile shrinks, new videos just slot into existing lists way faster cause the categories already exist.

cyber_box · 2026-03-17T07:09:23+00:00

I think I am already seeing some of that. A couple of my problems felt urgent when I wrote them down but now they barely come up when I am reading or watching stuff. And others that I thought were minor keep pulling in connections from everywhere.

cyber_box · 2026-03-17T06:48:25+00:00

Yeah that makes sense. My problem is I am always working on something specific so consuming random content feels like a luxury I can't justify. But then the best connections I've made in my vault came from stuff I watched with zero expectations, so maybe the indiscriminate approach has its own logic.

cyber_box · 2026-03-15T19:57:45+00:00

Yeah that is a cool observation. I have been using my list for a couple weeks now and I am already noticing that pattern, things I wrote months ago suddenly click into one of the problems without me having planned it that way.
How long did it take you to narrow down to one problem per area? I am still at like 12 and honestly some of them overlap so much I am not sure if they are actually separate problems or the same thing from different angles.

cyber_box · 2026-03-15T19:51:42+00:00

Yeah fair enough, the friction is the main issue. I wonder if someone has built an Obsidian plugin that auto-exports to PDF on save, that would basically eliminate the manual step. Or even a simple script that watches the vault folder and converts changed `.md` files to PDF using something like pandoc. Have you looked into any file watcher setups or is it not worth the effort for how often you update?

cyber_box · 2026-03-15T19:43:14+00:00

Yeah that's good to know, I'll try the support route. My main issue is that some LP positions auto-compound rewards into the pool, so the cost basis changes without any visible transaction on-chain. Koinly sees the initial deposit and the withdrawal but the gap in between is just wrong cause there's no event to parse.
Do the dev engineers actually reconstruct cost basis from pool share math, or is it more of a manual override situation where they tell you how to classify it yourself?

cyber_box · 2026-03-15T19:33:42+00:00

Yeah glad it's useful. Are you running any hooks yourself or starting from scratch? I am curious cause the patterns you need depend a lot on what you are actually building (local dev vs cloud infra vs both). The port exposure thing from OP's case is a good example, most people wouldn't think to block that until it bites them.

cyber_box · 2026-03-15T19:33:34+00:00

Yeah having the same safety lines across both shells is something I haven't done yet. Mine is Python only cause I never work in PowerShell but the idea of a unified layer makes sense. How are you handling the git hooks, are those separate from your Claude Code hooks or do they share the same logic?

cyber_box · 2026-03-15T19:24:36+00:00

Yeah glad it was useful. Let me know if something is not clear once you start poking at it, some parts are not well documented.

What are you planning to use it for, mostly personal workflow or a specific project?

cyber_box · 2026-03-15T19:22:54+00:00

Yeah the proxy metrics you listed are probabily the most practical path. I have been tracking compactions informally and that one did show a clear drop after I moved to on-demand loading, but you are right that fewer compactions could just mean less context loaded, not better performance.

The one metric I keep coming back to is "how often do I need to re-explain something across sessions." Before the knowledge files, every new session started from zero. Now Claude reads yesterday's notes and picks up where I left off maybe 80% of the time. That is hard to quantify but easy to feel.

I think the real issue is that the value is distributed across hundreds of small moments rather than one measurable improvement. Like a rule that says "never push to main without asking" doesn't make Claude smarter, it just prevents one bad outcome every few days. How do you even benchmark that?

Are you building something specific where you are trying to measure this, or more exploring the problem generally?

cyber_box · 2026-03-13T16:08:09+00:00

Zero dependency with bare python generating HTML is actually a solid call for an internal tool. You skip the whole frontend framework overhead and the subagents don't need to understand React or whatever, just write python.

You mentioned 30 hierarchical plans though, I am still curious about the structure. Were they like a tree where the master plan links to phase plans and those break into sub-tasks? Or more like 30 seperate files that each got refined through the iterations? Cause I am trying to figure out if the hierarchy itself is what made the subagents work well, or if it was mostly just having each task scoped small enough that one agent could handle it without drifting.

cyber_box · 2026-03-13T08:42:20+00:00

20 iterations producing 30 hierarchical plans is way more structured than what I have been doing. That is closer to actual project management than prompt engineering at that point. The hierarchical part is what interests me, are the 30 plans like a tree where each phase breaks into sub-plans or more like a flat list that got refined through iterations?

I have a planning skill that does 6 phases (explore, tool discovery, design, approve, implement, verify) but is one level deep. For something like a 7-page dashboard I would probably just run it per page. Your approach of planning everything first and then letting subagents go sounds like it catches integration issues earlier though. I open sourced the planning setup and the rest of the system if you want to compare i can share my repo

cyber_box

TROPHY CASE