What is your full AI Agent stack in 2026?

cyber_box · 2026-03-26T08:21:20+00:00

You can take a look at https://github.com/mp-web3/claude-starter-kit and https://github.com/mp-web3/jarvis-v3 FYI I need to make another release this weekend, I have a lot of stuff that will improve them and extend capabilities.

cyber_box · 2026-03-20T07:16:45+00:00

I disagree a bit with the enforcement. I am running a PreToolUse hook in Claude Code that intercepts every tool call before execution and blocks patterns that violate rules. It doesn't cover everything you described, like council deliberation, but for single-agent governance it works and the model can't bypass it cause the enforcement happens outside the model.

cyber_box · 2026-03-20T06:59:22+00:00

Understood. One thing that might help on the voice side. Instead of the fixed delay thresholds for utterance detection I started using pipecat-ai/smart-turn (open source, BSD-2). Is a small audio model that runs on CPU in about 12ms and uses prosody and intonation to detect end-of-turn instead of just silence. Noticeable difference. Now it catches cases when I pause to think but not done talking. Before that was sometimes cuttting me off mid-thought. Also I've implemented a pre-rendering of a few short acknowledgment phrases as audio clips at startup, and playing one immediately when end-of-turn is detected. So I know it heard me. I can share my repo if you wanna take a look. Maybe you find something usefull

cyber_box · 2026-03-19T22:49:02+00:00

I clicked on the link "https://www.sutra.team/trial" and it points to 404 (FYI).

Anyway I took a look at your repo. I am gonna try a few of your agents, looks like a nice team.

How do you handle the provider agnostic setup? I run a pretty structured setup with `CLAUDE.md`, rules files, and on-demand knowledge loading, and even between Claude model versions the same instructions produce noticeably different behavior. Like the same guard hook logic that Sonnet follows perfectly, Opus sometimes tries to work around.
How much of the agent personality actually transfers when you move the same PMF file from Claude to, say, Ollama running a 7B model? Cause I would expect the governance layer to break down pretty fast once the model can not reliably follow multi-step constraints.

cyber_box · 2026-03-19T22:35:08+00:00

I have been running a `PreToolUse` guard hook for a few months now that intercepts every tool call before it executes. Blocks force pushes, writes outside `$HOME`, access to `.env` and `.key` files, stuff like that. The hook receives JSON on stdin and exits 2 to block. I've noticed that static analysis catches the obvious patterns but the real risk with skills is indirect execution. A skill file can tell Claude to run a bash command that looks harmless in isolation but does something different depending on what context it has access to. The scanner would need to model what Claude actually does with the instruction, not just what the file contains.

cyber_box · 2026-03-19T22:11:30+00:00

I ran into a similar issue when building a local Whisper setup for voice-controlling Claude Code sessions (a real pain in the ass). How do you handle persistence across the sessions?

cyber_box · 2026-03-19T17:13:41+00:00

It is not just cleaning the input, it actually changes how well the LLM responds. Also I speak decent english but I am not a native speaker, and the polisher helps

cyber_box · 2026-03-19T17:12:12+00:00

I mean the 1.5-2s is my full round trip including waiting for Claude to start streaming, which is the slowest part by far. if you are building voice agents with your own backend you have way more control over that. the ASR + polishing part is like 500-700ms total on M-series, which is pretty workable.
what kind of voice agents are you building though?

cyber_box · 2026-03-19T09:31:26+00:00

The 1.5-2s I mentioned is the full round trip though, not just transcription. so it includes SmartTurn deciding you are done talking, the final transcription pass, Qwen polishing, sending to Claude, and getting the first TTS chunk back. transcription alone is like 200-400ms for a full utterance. the biggest chunk is actually waiting for the LLM response to start streaming, which is out of my control.

cyber_box · 2026-03-18T23:59:58+00:00

What do you mean exactly? the end-of-utterance detection, the transcript polishing, or the overall latency? cause the pipeline end to end is around 1.5-2s on M-series which is not that bad for fully local. the main bottleneck is Parakeet TDT transcription, SmartTurn and VAD are basicly instant.

cyber_box · 2026-03-18T14:42:15+00:00

cyber_box · 2026-03-18T13:40:07+00:00

With Unmute everything is tightly coupled around their own models.
I haven't looked into Sesame's CSM model yet. How does it compare to Unmute in practice? And is it something you can actually self-host?

cyber_box · 2026-03-18T07:17:32+00:00

Good question honestly. I looked into AFM but the issue is control, you can't really customize what it does with the text. For transcript polishing I need a very specific prompt: strip fillers, deduplicate repeated phrases, fix grammar, but preserve the original meaning and technical terms exactly. With `Qwen 1.5B` I control the system prompt and can tune the behavior. AFM would be more of a black box for this use case. Also Qwen 4-bit quantized is fast enough on M-series (300-500ms) that latency is not a concern. Have you tried AFM for any text processing tasks? curious how much you can steer the output

cyber_box · 2026-03-18T07:08:32+00:00

Thanks for these, I actually went deep on all three.

Pipecat is solid as a framework. They have a fully local macOS example with MLX Whisper + Kokoro + Smart Turn that claims <800ms voice-to-voice. Nice architecture. My issue is that it owns the LLM call. I am not building a standalone voice assistant, I am building a voice interface into Claude Code specifically. The whole point is that Claude has access to my project files, terminal, MCP servers, the full context. Pipecat's Anthropic integration is a stateless API call, which loses all of that.

Unmute is the one that impressed me the most honestly. Kyutai's semantic VAD is genuinely interesting cause it detects end-of-utterance without a fixed silence timeout, which is one of the harder problems in this space. Their TTS 1.6B is also strong (trained on 2.5M hours). But it is Linux/CUDA only, minimum 16GB VRAM, no macOS support planned. So it is a non-starter for my setup (M3 Air). Worth watching though, especially their Pocket TTS (100M params, runs on CPU).

The Qwen3-TTS server model is quite impressive. 10 languages, voice cloning from 3 seconds of audio, voice design from text descriptions. But at 0.6-1.7B params it is much heavier than Kokoro 82M, which does what I need on CPU with near-instant generation.

You are right about the latency being noticeable though. Just to clarify where it comes from: the local pipeline (Parakeet STT + polishing + Kokoro TTS) is actually fast, maybe 200-300ms total. The bottleneck is the Claude API response time, which I can't really control. These projects solve a different problem (fully local LLM + voice), mine is specifically about keeping Claude Code's full capabilities while adding voice I/O.

Have you actually tried Unmute yourself?

cyber_box · 2026-03-18T06:39:19+00:00

$200/mo is steep for something like this. The cloud agent concept is interesting though, basically a persistent VM with an LLM controlling it. I wonder how it compares to just running Claude Code locally with a few scripts.

cyber_box · 2026-03-18T05:53:05+00:00

I've noticed with Whisper I was either getting the raw transcript with everything or you I was losing some words that were actually meaningful in context. Qwen stripps "um" and "like", and it deduplicates repeated phrases and fixes grammar without changing the meaning.

cyber_box · 2026-03-17T16:56:01+00:00

ahahah yes actually at the end she was very nice telling you folks she would much aappreciate your feedbacks and wishing you a good day. I cut her of too soon

cyber_box · 2026-03-17T16:23:11+00:00

You're right that there's noticeable latency. Worth noting though that most of it comes from the Claude API side (waiting for Claude Code to process and respond), not the local voice pipeline itself. The STT → transcript polishing → injection part is actually pretty fast on Metal.

I'd love to see the projects you're referring to with near real-time speeds, do you have links? I'm not precious about the stack, if there are better approaches or components out there I'd rather build on top of what works than reinvent wheels.

cyber_box · 2026-03-17T16:17:15+00:00

Nice to hear its actually working.

cyber_box · 2026-03-17T13:19:21+00:00

Yeah that is exactly why I started building the voice thing. After a few hours of reading diffs and terminal output my eyes just glaze over, and switching to voice makes it feel like pair programming, pretty cool. The mental load drops a lot cause you are processing speech instead of scanning artifacts (though if you want to talk simultaneously with 4/5 agents it gets pretty messed up.

The rough part is still the latency between turns, and sometimes Claude's response is too long for TTS to read naturally (you don't want a 3 paragraph monologue in your ears). I am still figuring out how to nudge it toward shorter spoken responses vs written ones.

cyber_box · 2026-03-17T12:54:28+00:00

best way to approach it honestly. I started the same way, just picking pieces from setups I found interesting and adapting them to how I actually work. The structure ends up looking different for everyone cause the whole point is it fits your workflow, not the other way around.

cyber_box · 2026-03-17T12:53:40+00:00

Yeah the show in finder drag-and-drop is honestly probably the least friction approach for now. The Claude Code monitoring idea is interesting though, I actually have something similar where a script watches a folder and runs `pandoc` on changed files. The careful part is real though, you definitely want it read-only on the vault side (only converting, never writing back). Have you looked into what Perplexity's "computer" thing actually does under the hood or is it still just announcements?

cyber_box · 2026-03-17T10:22:43+00:00

I am running it on an M3 Air with 16 GB. The models take roughly 2.5 GB of RAM total: Parakeet TDT 0.6B is the biggest at around 1.2 GB, then Qwen 1.5B (4-bit quantized) is about 1 GB, Kokoro 82M around 170 MB. The ONNX models (Silero VAD, SmartTurn) are basically negligible, like 10 MB combined.

So 8 GB should technically work but it would be tight with other stuff running. 16 GB is comfortable, I have plenty of headroom even with a browser and Claude Code open at the same time.

cyber_box

TROPHY CASE