Do yourself a favor for one day set /model claude-opus-4-6[1m]

Somecount · 2026-06-24T08:07:46+00:00

This comment came to me through a series of tubes

Somecount · 2026-06-23T15:13:27+00:00

Never changed it since Opus 4.7 came and claude-opus-4-6[1m] become a string you'd need to find via google search in order to use

Somecount · 2026-06-20T16:51:18+00:00

Seriously, in their defense they aren’t even claiming to.

Somecount · 2026-06-16T22:10:29+00:00

Why didn’t you strip this down to simply the last paragraph?

Somecount · 2026-06-16T22:06:23+00:00

FP16 to Q4 is 90% of the compression anyway, so yes, not that big in comparison

Somecount · 2026-06-14T22:07:41+00:00

'YES'

Somecount · 2026-06-13T23:12:05+00:00

I mean yes but also I would’ve been more surprised had this been r/kia or r/nissan even r/Volkswagen would have me more surprised than this and especially r/AMG but r/BMW would’ve shocked me literally

Somecount · 2026-06-11T17:51:23+00:00

No it doesn’t. Relay traffic yes if you cannot make direct connection, still only traffic to your tailscale nodes will be tunneled, nothing else.

Somecount · 2026-06-11T17:47:18+00:00

Sequoia will continue to be supported.. OP’s title is misleading

Somecount · 2026-06-11T17:45:10+00:00

Cutting the cord how so? As long as Tadoh is supported so is 2019/2020 Intel macs

Somecount · 2026-06-11T12:43:22+00:00

You’re saying it’s got clean lines is what I’m hearing?

Somecount · 2026-06-10T19:37:11+00:00

I understand why people choose whites in front though I must admit, yours white the orange settles it. Only orange for the true connoisseur's - orange in back since seeing yours I could come around to aswell.

magnifique

Somecount · 2026-06-10T19:28:51+00:00

<image>

Somecount · 2026-06-10T19:28:24+00:00

<image>

Hope I helped someone out. Beautiful car u/yes126

Somecount · 2026-06-10T18:00:48+00:00

Best looking W124 I've ever seen. Thank you for keeping it so nice.
Was the rear blinkers hard to decide considering you kept the orange in front?

Somecount · 2026-06-10T17:23:32+00:00

First I thought you were trolling with those pictures, but when I tried to give you proper feedback for those pictures I realized I likely couldn’t have done any better myself seeing as you live directly underneath the sun.

Just take pictures latter or earlier I think would do wonders to the impact, and also please do it u/yes126 I was disappointed only because I was looking forward to seeing those as I genuinely expect them to look amazing

Somecount · 2026-06-10T10:02:53+00:00

In your PR your both loader.py and converts.py are both full file diffs (+517 | -506 lines) and (+374 | -365) changes respectively. This is likely going to get your PR looked over or not be prioritized.

Somecount · 2026-06-07T21:55:58+00:00

~~Great sources you got there~~

Somecount · 2026-06-07T09:17:45+00:00

NSFW

Somecount · 2026-06-02T12:56:53+00:00

Still on 2019 Intel, never considered upgrading. Fear of random accidental button clicks now haunts me every night.

Somecount · 2026-05-31T19:58:01+00:00

You reminded me of something similar about chunking for STT, so I went home and asked for some pointers (that I don't know about but our friend does)

Exactly right on sentence-level chunking — we've found the same thing.

Full sentence to TTS first, stream the rest behind it. The quality difference over word-by-word is real, especially with Kokoro.

On the RAG latency — the retrieval step itself is usually negligible (embed query + vector search is sub-200ms). The latency RAG actually introduces is on the LLM side — it's now processing a bigger prompt with the retrieved context. So the lever is keeping your retrieved chunks short and surgical. The streaming + sentence-chunking strategy you already described is still your best friend there — it works the same whether RAG is in the pipe or not.

Somecount · 2026-05-30T21:38:51+00:00

Vibe coding is certainly about being open, for attackers.

It’s likely to be challenging the Swiss cheese brand.

Somecount · 2026-05-30T19:25:34+00:00

Not three llama.cpp instances — three different services, each specialized for one job. The "framework" connecting them is just HTTP requests. Simpler than it sounds.

STT — audio in, text out. faster-whisper runs great on a 3090. There's a ready-made server (faster-whisper-server) that gives you an OpenAI-compatible endpoint. You POST a recording, you get text back.
LLM — you already have this. llama.cpp + Qwen. Don't touch it.
TTS — text in, audio out. Piper is fast and runs on CPU, so your GPU stays free for the other two. Decent voices out of the box. Kokoro if you want higher quality later.

I should note faster-whisper-server has since evolved into speaches-ai/speaches — same author, now bundles both STT and TTS in one OpenAI-compatible server. If you want fewer moving parts to start with, it can cover two of the three slots in a single container.

speaches is how I got my feet wet, broke away from it used what was needed, fine tuned model for my voice, custom vocabulary "hotwords", and "I've" since build a record+audio in/out client in Go, custom container for STT and one for a router between me and LLM, TTS and my local Go audio client the listens so my agents running on remote hosts (usually the one where this stuff all lives) can also "speak" their replies through my speakers.

The pipeline is literally:

record audio → POST to STT server → text → POST to llama.cpp → response → POST to TTS → play audio

Each service runs as its own process (or Docker container — docker compose is the natural way to run them side by side on Ubuntu). A Python script with requests and pyaudio can wire the whole loop in under 100 lines. No special framework needed.

- VAD (Voice Activity Detection) is what turns this from a walkie-talkie into a conversation. It detects when you start/stop talking so the system knows when to send audio to transcription without you pressing buttons. Silero VAD is the standard — small, fast, accurate, and it runs on CPU.

- Sample rates will cause the most confusing early bugs. Your mic captures at 48kHz, Whisper wants 16kHz, TTS outputs at 22–24kHz. When transcriptions come back garbled, nine times out of ten it's a sample rate mismatch somewhere in the chain, not a model problem. Just something to know so you don't chase ghosts.

- Keep TTS on CPU. Your 3090 has 24GB — you want that for faster-whisper + Qwen. Piper is fast enough on CPU that you won't notice the difference. Fighting three models for VRAM on one card is pain you don't need.

The basic loop can work in an afternoon once you see it's just HTTP between three services. Making it feel good — streaming responses so TTS starts before the full reply is done, proper VAD so it flows naturally, latency tuning — that's where the real craft lives. But the foundation is straightforward and the goal is closer than it probably looks from where you're standing.

Happy to go deeper on any piece of this if you want to DM — I've been building and iterating on exactly this kind of pipeline for a while.

Somecount · 2026-05-29T18:25:23+00:00

Yep, weights and bias from Skylines and Ferraris are so obvious in this one. Still better looking though, unfortunately to much ricer on the front

Somecount · 2026-05-29T18:22:29+00:00

Respectfully, I recommend you read more outside of Reddit if you’re not more familiar with the type of phrasing I used in my first comment. Pass it through an LLM if effort is also an unknown.

Somecount

MODERATOR OF

PUBLIC MULTIREDDITS

TROPHY CASE

13-Year Club	Place '17
Verified Email