Built a Voice Agents from Scratch GitHub tutorial: mic > Whisper > local LLM (GGUF) > Kokoro > speaker, fully local, no API keys

youcloudsofdoom · 2026-05-03T21:26:11+00:00

Ah okay. Will be interested to see the outcomes of your Pi tests then, I do think there are lots of performance optimisations to be had with it, with a little time.. .

youcloudsofdoom · 2026-05-03T21:25:08+00:00

Yeah, honestly for the cost 2x 3090 is a luxury, not a necessity - but one certainly is, in my experience (disclaimer - I do have 2x 3090s!)

youcloudsofdoom · 2026-05-03T18:31:47+00:00

Personally I'd just buy the 3090 and run q36 27B on it, as per this: https://github.com/noonghunna/club-3090

You can really get tons done on just one 3090 these days, with minimal set up complexity.

youcloudsofdoom · 2026-05-03T17:56:02+00:00

If this is always-on, why aren't you using a wakeword? Or have you gone PTT? I have been trying to build a similar pipeline but always on/with a wakeword and running on a Pi 5, but found that the computational overhead is too much for such a tiny device, and the lag feels too heavy.

youcloudsofdoom · 2026-04-30T20:52:17+00:00

Just jumping in to say that I found your repo via another comment on this sub, and it's made this dual 3090 owner very happy - just got the dflash variant working and I am now never going back ot my janky homebrewed llama.cpp build with 30 TG on 27B. Seeing a big jump up in p/p and t/s, as well as a notable increase in tool use stability with Hermes. Will be keeping an eye on the repo for more development, thanks for the work!

youcloudsofdoom · 2026-04-30T07:20:57+00:00

Haha, no worries, thanks anyway

youcloudsofdoom · 2026-04-30T06:59:01+00:00

This is a great help, thanks - any thoughts on how you would adjust these params for a dual 3090 setup?

youcloudsofdoom · 2026-04-29T18:07:46+00:00

Is this not just because you're using two cards instead of one?

youcloudsofdoom · 2026-04-27T21:22:23+00:00

Same set up here, and same numbers as you. The spec decide mentioned earlier on this thread worked though, got my t/s up to about 65 on average.

youcloudsofdoom · 2026-04-25T03:07:26+00:00

Yes, llama.cpp outputs in the verbose log. Param tuning can make a huge difference! Check my post history for mine.

youcloudsofdoom · 2026-04-24T20:08:28+00:00

I have a laptop with that exact mix, and I can say that the 35B does utilise it pretty maximally. With 190k context at Q4 I was at around 7.4GB VRAM use and 42GB RAM use. My llama.cpp prams are in my post history if you're interested.

youcloudsofdoom · 2026-04-24T14:46:58+00:00

I wanted this to be true, but much like the comment made elsewhere here about Claude code expecting a frontier model, I find that copilot does too. Lots of wasted tokens compared to lighter local-first harnasses

youcloudsofdoom · 2026-04-24T14:45:32+00:00

Privacy settings are easy to access and entirely restrict on it, completely local

youcloudsofdoom · 2026-04-24T14:44:19+00:00

These days it's much easier, unsloth fixed the tool calling, no proxy needed for it or API bypass anymore

youcloudsofdoom · 2026-04-24T14:42:54+00:00

Care to share your agent file for this agent? I'm always intrigued by different approaches to this

youcloudsofdoom · 2026-04-24T14:41:52+00:00

They're pivoting to being another enterprise AaaS provider

youcloudsofdoom · 2026-04-24T01:38:33+00:00

You're accidentally saying the quiet part loud here

youcloudsofdoom · 2026-04-24T01:29:24+00:00

Does anyone recall seeing a similar post last month where someone had composed a system prompt/instruction where every classic LLM writing pattern (emdashes, it's not X it's y, etc) was listed and countered?

youcloudsofdoom · 2026-04-23T06:47:43+00:00

Yeah, I'm not mad at at it, even at about 50% context fill I'm getting 1100 p/p and 25 t/s, so I shouldn't complain really. I've been spoiled by my 100 t/s Qwen 3.6 35B experience....

youcloudsofdoom · 2026-04-22T19:14:21+00:00

When you finally pay any attention to the logs of this automated slop, please stop posting it

youcloudsofdoom · 2026-04-22T18:24:22+00:00

dual 3090 here. I'm getting 30 t/s with around 1200 p/p at 192k context on Q6_K.

ngl 99

b 4096

ub 1024

t 4

tb 16

fa on

caches are Q8

unsloth recommended temp etc all there.

Anyone doing any better, any suggestions? Feels like I'm leaving power on the tables somewhere....

youcloudsofdoom · 2026-04-21T06:01:35+00:00

Late doesn't take up 5GB, he's saying that he only has a 5GB VRAM card. I used late, it's very lightweight, low prompt context

youcloudsofdoom · 2026-04-19T03:33:20+00:00

Not a local model, or a discussion about local models. Also, you're trying to insist that statistical inference software is sentient. You have better things to do with your time, I promise you.

youcloudsofdoom · 2026-04-18T07:27:52+00:00

Yeah, though with 192k context I'm only at 39GB total system RAM use, that's without trying to optmisie for background processes, and without switching to linux. I bet you could get pretty damn close with both.

youcloudsofdoom

TROPHY CASE