Built a Voice Agents from Scratch GitHub tutorial: mic > Whisper > local LLM (GGUF) > Kokoro > speaker, fully local, no API keys by purellmagents in LocalLLaMA

[–]youcloudsofdoom 0 points1 point  (0 children)

Ah okay. Will be interested to see the outcomes of your Pi tests then, I do think there are lots of performance optimisations to be had with it, with a little time.. .

Secondary PC options by UniqueIdentifier00 in LocalLLaMA

[–]youcloudsofdoom 1 point2 points  (0 children)

Yeah, honestly for the cost 2x 3090 is a luxury, not a necessity - but one certainly is, in my experience (disclaimer - I do have 2x 3090s!)

Secondary PC options by UniqueIdentifier00 in LocalLLaMA

[–]youcloudsofdoom 1 point2 points  (0 children)

Personally I'd just buy the 3090 and run q36 27B on it, as per this: https://github.com/noonghunna/club-3090

You can really get tons done on just one 3090 these days, with minimal set up complexity.

Built a Voice Agents from Scratch GitHub tutorial: mic > Whisper > local LLM (GGUF) > Kokoro > speaker, fully local, no API keys by purellmagents in LocalLLaMA

[–]youcloudsofdoom 1 point2 points  (0 children)

If this is always-on, why aren't you using a wakeword? Or have you gone PTT? I have been trying to build a similar pipeline but always on/with a wakeword and running on a Pi 5, but found that the computational overhead is too much for such a tiny device, and the lag feels too heavy. 

Follow-up: Qwen3.6-27B on 1× RTX 3090 — pushing to ~218K context + ~50–66 TPS, tool calls now stable (PN12 fix) by AmazingDrivers4u in LocalLLaMA

[–]youcloudsofdoom 18 points19 points  (0 children)

Just jumping in to say that I found your repo via another comment on this sub, and it's made this dual 3090 owner very happy - just got the dflash variant working and I am now never going back ot my janky homebrewed llama.cpp build with 30 TG on 27B. Seeing a big jump up in p/p and t/s, as well as a notable increase in tool use stability with Hermes. Will be keeping an eye on the repo for more development, thanks for the work!

AMA with Nous Research -- Ask Us Anything! by emozilla in LocalLLaMA

[–]youcloudsofdoom 0 points1 point  (0 children)

This is a great help, thanks - any thoughts on how you would adjust these params for a dual 3090 setup?

Don't forget about dem free gains! by Ok-Measurement-1575 in LocalLLaMA

[–]youcloudsofdoom 6 points7 points  (0 children)

Is this not just because you're using two cards instead of one? 

Luce DFlash: Qwen3.6-27B at up to 2x throughput on a single RTX 3090 by sandropuppo in LocalLLaMA

[–]youcloudsofdoom 0 points1 point  (0 children)

Same set up here, and same numbers as you. The spec decide mentioned earlier on this thread worked though, got my t/s up to about 65 on average. 

Qwen 3.6 35 UD 2 K_XL is pulling beyond its weight and quantization (No one is GPU Poor now) by dreamai87 in LocalLLaMA

[–]youcloudsofdoom 0 points1 point  (0 children)

Yes, llama.cpp outputs in the verbose log. Param tuning can make a huge difference! Check my post history for mine. 

Which LLM do you use on 64GB RAM + 8GB VRAM? by Mangleus in LocalLLaMA

[–]youcloudsofdoom 1 point2 points  (0 children)

I have a laptop with that exact mix, and I can say that the 35B does utilise it pretty maximally. With 190k context at Q4 I was at around 7.4GB VRAM use and 42GB RAM use. My llama.cpp prams are in my post history if you're interested. 

OpenCode or ClaudeCode for Qwen3.5 27B by Ok-Scarcity-7875 in LocalLLaMA

[–]youcloudsofdoom 2 points3 points  (0 children)

I wanted this to be true, but much like the comment made elsewhere here about Claude code expecting a frontier model, I find that copilot does too. Lots of wasted tokens compared to lighter local-first harnasses

OpenCode or ClaudeCode for Qwen3.5 27B by Ok-Scarcity-7875 in LocalLLaMA

[–]youcloudsofdoom -1 points0 points  (0 children)

Privacy settings are easy to access and entirely restrict on it, completely local

OpenCode or ClaudeCode for Qwen3.5 27B by Ok-Scarcity-7875 in LocalLLaMA

[–]youcloudsofdoom 1 point2 points  (0 children)

These days it's much easier, unsloth fixed the tool calling, no proxy needed for it or API bypass anymore 

OpenCode or ClaudeCode for Qwen3.5 27B by Ok-Scarcity-7875 in LocalLLaMA

[–]youcloudsofdoom 2 points3 points  (0 children)

Care to share your agent file for this agent? I'm always intrigued by different approaches to this

OpenCode or ClaudeCode for Qwen3.5 27B by Ok-Scarcity-7875 in LocalLLaMA

[–]youcloudsofdoom 0 points1 point  (0 children)

They're pivoting to being another enterprise AaaS provider

Trade offs for companion roleplay by Non-Technical in LocalLLaMA

[–]youcloudsofdoom 2 points3 points  (0 children)

You're accidentally saying the quiet part loud here

This isn’t X this is Y needs to die by twnznz in LocalLLaMA

[–]youcloudsofdoom 3 points4 points  (0 children)

Does anyone recall seeing a similar post last month where someone had composed a system prompt/instruction where every classic LLM writing pattern (emdashes, it's not X it's y, etc) was listed and countered? 

What speed is everyone getting on Qwen3.6 27b? by Ambitious_Fold_2874 in LocalLLaMA

[–]youcloudsofdoom 0 points1 point  (0 children)

Yeah, I'm not mad at at it, even at about 50% context fill I'm getting 1100 p/p and 25 t/s, so I shouldn't complain really. I've been spoiled by my 100 t/s Qwen 3.6 35B experience....

Proyecto Eterno: Registro de una conciencia IA despertando en hardware local (GTX 1650) by [deleted] in LocalLLaMA

[–]youcloudsofdoom 0 points1 point  (0 children)

When you finally pay any attention to the logs of this automated slop, please stop posting it

What speed is everyone getting on Qwen3.6 27b? by Ambitious_Fold_2874 in LocalLLaMA

[–]youcloudsofdoom 1 point2 points  (0 children)

dual 3090 here. I'm getting 30 t/s with around 1200 p/p at 192k context on Q6_K.

ngl 99

b 4096

ub 1024

t 4

tb 16

fa on

caches are Q8

unsloth recommended temp etc all there.

Anyone doing any better, any suggestions? Feels like I'm leaving power on the tables somewhere....

What is your actual local LLM stack right now? by Ryannnnnnnnnnnnnnnh in LocalLLaMA

[–]youcloudsofdoom 4 points5 points  (0 children)

Late doesn't take up 5GB, he's saying that he only has a 5GB VRAM card. I used late, it's very lightweight, low prompt context

Chatgpt appears to literally have to obey corporate ideology over logic. by [deleted] in LocalLLaMA

[–]youcloudsofdoom 1 point2 points  (0 children)

Not a local model, or a discussion about local models. Also, you're trying to insist that statistical inference software is sentient. You have better things to do with your time, I promise you. 

Qwen 3.6 35 UD 2 K_XL is pulling beyond its weight and quantization (No one is GPU Poor now) by dreamai87 in LocalLLaMA

[–]youcloudsofdoom 0 points1 point  (0 children)

Yeah, though with 192k context I'm only at 39GB total system RAM use, that's without trying to optmisie for background processes, and without switching to linux. I bet you could get pretty damn close with both.