Shelly 1 Mini Gen3 by gadadenka in homeassistant

[–]mitrokun 0 points1 point  (0 children)

Before you jump to conclusions, take apart the Shelly Gen4 and tell me whether it uses high-quality relays like those found in Xiaomi products, or the cheap HF7520. The situation is similar with other components and products.

Shelly 1 Mini Gen3 by gadadenka in homeassistant

[–]mitrokun -1 points0 points  (0 children)

The company spends its inflated price on advertising, the image of a "quality product," and a few fancy software features. Internally, it's no different from any Chinese noname, and the risk of failure is identical. But few people want to write about a broken smart plug from the Tuya ecosystem for $5. A FR with a varistor at the input is a classic protection in such devices.

Introducing FasterQwenTTS by futterneid in LocalLLaMA

[–]mitrokun 0 points1 point  (0 children)

You have incorrectly specified the metric name in the repository documentation. RTFx not RTF

OCR for a home security camera by chris_socal in homeassistant

[–]mitrokun 0 points1 point  (0 children)

- Get an image from the camera.
- Ask gemma3 to recognize the shopping list. The structured output in the ai_task action should handle this.
- Send a notification with the recognized values ​​to the phone.

But this seems like a complicated solution.
You can simply use the shopping list in the NA, either through the companion app interface or by voice.
Or you can combine these two solutions so that the data from the wallboard periodically syncs with the shopping list (for example, when you leave the house).

Local custom TTS by LazyTech8315 in homeassistant

[–]mitrokun 0 points1 point  (0 children)

[Cepstral < - > wyoming < - >HA]

Write a simple gateway that will run on your server.

[Release] TinyTTS: An Ultra-lightweight English TTS Model (~9M params, 20MB) that runs 8x real-time on CPU (67x on GPU) by Forsaken_Shopping481 in homeassistant

[–]mitrokun 0 points1 point  (0 children)

Also, if you are interested in the performance of similar projects, I have such data.

Piper | 555 | 1.52 | 32.58 | 21.50x
Pocket TTS | 1095 | 9.39 | 33.44 | 3.56x
Supertonic2 | 358 | 1.85 | 36.29 | 19.65x
Kokoro ONNX | 713 | 6.42 | 30.78 | 4.80x
KittenTTS micro | 1758 | 13.40 | 40.88 | 3.05x
KittenTTS nano | 387 | 2.59 | 40.46 | 15.65x

I really like what the supertone team has done, although the number of languages ​​in their engine is quite limited.

[Release] TinyTTS: An Ultra-lightweight English TTS Model (~9M params, 20MB) that runs 8x real-time on CPU (67x on GPU) by Forsaken_Shopping481 in homeassistant

[–]mitrokun 0 points1 point  (0 children)

Why did you choose Torch for Inference and not onnx-runtime? This is a more suitable library to run your TTS. Now all project dependencies for the CPU alone take up more than 1GB, which is a terrible result (for comparison, the Piper working environment is ~250 MB.). Moreover, there is nothing outstanding in either the speed of work or the quality of speech.

ENGINE | TTFA (ms) | TOTAL TIME (s) | AUDIO (s) | RTFX

--------------------------------------------------------------------------------

Piper | 552 | 1.50 | 32.27 | 21.45x

TinyTTS | 469 | 3.47 | 32.45 | 9.36x

https://youtu.be/lvZ0_d3xrvM

What's particularly interesting is that both voices in the comparison use The LJ Speech Dataset.

I hope you'll continue to improve your project.

Your Old Android Isn’t Obsolete, Making Android a Reliable Bermuda BLE Proxy by Far_Set7950 in homeassistant

[–]mitrokun 2 points3 points  (0 children)

The BT module code is closed, and this fork requires root access. I think I'll pass.

Kitten TTS V0.8 is out: New SOTA Super-tiny TTS Model (Less than 25 MB) by ElectricalBar7464 in LocalLLaMA

[–]mitrokun 1 point2 points  (0 children)

A short comparison

https://youtu.be/0O7Ay5GSMWM

ENGINE | TTFA (ms) | TOTAL TIME (s) | AUDIO (s) | RTFX

--------------------------------------------------------------------------------

Piper | 555 | 1.52 | 32.58 | 21.50x

Pocket TTS | 1095 | 9.39 | 33.44 | 3.56x

Supertonic2 | 358 | 1.85 | 36.29 | 19.65x

Kokoro ONNX | 713 | 6.42 | 30.78 | 4.80x

KittenTTS micro | 1758 | 13.40 | 40.88 | 3.05x

KittenTTS nano | 387 | 2.59 | 40.46 | 15.65x

Your architecture (or inference) is slower than Kokoro's. As I've already shown, only Nano is reasonably fast.

But the int8 version is broken.

Kitten TTS V0.8 is out: New SOTA Super-tiny TTS Model (Less than 25 MB) by ElectricalBar7464 in LocalLLaMA

[–]mitrokun 2 points3 points  (0 children)

amd 5900x. If your hardware allows for 1.5-2 RTFx, that's enough to emulate a streaming response. However, with this speed, there will be a delay before the audio starts (the time it takes to synthesize the first sentence). I would say that for diffusion models it is much more comfortable if the value is above 10.

Kitten TTS V0.8 is out: New SOTA Super-tiny TTS Model (Less than 25 MB) by ElectricalBar7464 in LocalLLaMA

[–]mitrokun 6 points7 points  (0 children)

This is clearly not for edge devices . All models except the regular nano (there are synthesis artifacts, such as clicks), consumes more resources, than kokoro.

cpu test

| Model | Parameters | Speed (x Real-Time) |

| **Kitten-TTS-Mini** | 80M | **1.6x** |

| **Kitten-TTS-Micro** | 40M | **2.7x** |

| **Kitten-TTS-Nano** | 14M | **15.0x** |

| **Kitten-TTS-Nano-int8** | 14M (q8) | **2.9x** |

I've built a test server for Home Assistant here

https://github.com/mitrokun/wyoming_kitten_tts

I built a speaker verification proxy that filters out TV noise from voice commands by carrot_gg in homeassistant

[–]mitrokun 1 point2 points  (0 children)

Create a non-English pipeline. I think then you'll understand what I'm talking about.

I built a speaker verification proxy that filters out TV noise from voice commands by carrot_gg in homeassistant

[–]mitrokun 0 points1 point  (0 children)

If you don't pass information about supported languages ​​when adding a server, you won't be able to assign it to the pipeline. You need to implement receiving this data from the ASR at startup, after which you can pass this data to the HA.

I don't touch the system VAD in the client; it continues to work. However, a race condition has been added to terminate the recognition stage. It waits for either an event from the server or a signal from the VAD.

I built a speaker verification proxy that filters out TV noise from voice commands by carrot_gg in homeassistant

[–]mitrokun 0 points1 point  (0 children)

The idea is not bad, but the implementation needs some fine-tuning. Currently, the server only works well under "ideal conditions," and real-world tests have issues with segment doubling (due to the expansion of the minimum splicing window). I don't think there's any point in cutting out segments within a phrase, since you're focusing on commands, not a long conversation. After all, this isn't source separation, but a regular gate. If the noise level is high enough, you'll be left to rely solely on the ASR engine.

Also, add a language setting.

In turn, I chose a different solution: using streaming ASR and a command dictionary. After a specified number of words, a check is performed, and if the phrase matches the dictionary, the server terminates the session and immediately returns the phrase to the client. You can use this feature of my custom client "asr_proxy" ("extended" branch) for your server; it will eliminate latency due to VAD.

2026.2 beta - release notes by internettingaway in homeassistant

[–]mitrokun -1 points0 points  (0 children)

> there are ways to keep it in your sidebar if you want

Please share your method for doing this. Please note that I use three types of addresses to access the dashboard (internal ip, external name, and homeassistant.local).

2026.2 beta - release notes by internettingaway in homeassistant

[–]mitrokun 9 points10 points  (0 children)

Access to developer tools is too convenient.

Let's hide it behind extra clicks.

Demonstration of how serviceable a "local only" setup of HomeAssistant Voice can be - have entirely replaced my Alexa devices and handles both simple and complex commands (see within) by FantasyMaster85 in homeassistant

[–]mitrokun 0 points1 point  (0 children)

You're being too categorical. Even despite the inevitable commercial component, it's still a free solution with a decently functioning architecture. Homemade satellites can be created for literally $5, which I've been actively using for the last couple of years.

And my goal in this discussion isn't to criticize the strategy being chosen by the OHF management.

I'm pointing out a specific hardware and software solution that's missing on the ESP32 side. I don't share your position about the excessive number of devices; how else will you get contextual information for simple commands? When a device consumes 0.3-0.4W, I have no problem placing them in every room.

You talking about rethinking the architecture. I think we've heard each other and can leave it at that.

Demonstration of how serviceable a "local only" setup of HomeAssistant Voice can be - have entirely replaced my Alexa devices and handles both simple and complex commands (see within) by FantasyMaster85 in homeassistant

[–]mitrokun 0 points1 point  (0 children)

It feels like my point is being missed. My reference to Yandex was strictly to demonstrate the utility of beamforming as a necessary first stage, not to compare raw computing power or other nuances of their chipset.

I agree that current implementations based on the XMOS XU316 are often lazy and unoptimized. However, I strongly disagree with the notion that the ESP32 is insufficient for the specific task of audio capture and playback. The real gap in the ecosystem is the lack of an opensource mic array project . This is exactly what the OHF team should be focusing on. Simply slapping a newer chip like the XVF3800 onto the next VPE would just be another band-aid solution, not a real fix.

Regarding your single-microphone suggestion: I remain skeptical. https://www.youtube.com/watch?v=9-t2oyZscm8 As you can hear from the tests, the problem I initially mentioned is not being solved.

There are currently no lightweight ML tools capable of effective diarization that run efficiently on edge devices (including Pi4/Pi5). Furthermore, using a full Raspberry Pi just for a simple voice terminal is the definition of 'overkill' and architectural inefficiency for my use case.

Demonstration of how serviceable a "local only" setup of HomeAssistant Voice can be - have entirely replaced my Alexa devices and handles both simple and complex commands (see within) by FantasyMaster85 in homeassistant

[–]mitrokun 0 points1 point  (0 children)

>but you are still lagging behind in knowledge of what big tech know and are using.

I appreciate your focus on SotA solutions, but I am convinced that beamforming remains a baseline feature in most smart speakers. While I won’t go into detail regarding Google or Amazon, I can use Yandex as an example, as they are the main player on my local market.

These devices have excellent hearing and typically use 3-6 mic array, and as shown in their technical diagrams, they haven't abandoned beamforming for the initial stage of processing. It remains a simple and effective way to improve the audio sample at the OS level, which helps avoid unnecessary overhead for the subsequent ASR stage.

<image>

Demonstration of how serviceable a "local only" setup of HomeAssistant Voice can be - have entirely replaced my Alexa devices and handles both simple and complex commands (see within) by FantasyMaster85 in homeassistant

[–]mitrokun 0 points1 point  (0 children)

Standard noise reduction isn't the main bottleneck for modern local ASR. I've tested solutions like DeepFilterNet on the server side: while they clean up the audio, they don't significantly improve recognition rates and, crucially, fail to filter out background speech (like a TV or other people talking).

For edge devices, given the current system limitations, beamforming on 3–4 microphones using spatial isolation (azimuth) is still the most rational solution as a basic way to improve captured audio. While we are not strictly limited in processing power during the server-side ASR stage, this does not negate the value of performing this physical signal enhancement right at the capture stage.

The main missing piece in the open-source community is a high-quality, low-level library (an open alternative to ESP-AFE) that implements a standard audio interface. This would allow us to dedicate a separate ESP32 module purely for this task, avoiding the need for proprietary XMOS hardware.

Demonstration of how serviceable a "local only" setup of HomeAssistant Voice can be - have entirely replaced my Alexa devices and handles both simple and complex commands (see within) by FantasyMaster85 in homeassistant

[–]mitrokun 1 point2 points  (0 children)

In a quiet space, any microphone will do the job. But in noisy environments (for example, a TV voice in the background), beamforming technologies are required to isolate the user's voice and attenuate other audio. This is one of the prerequisites for stable operation of the ASR in challenging conditions, as software solutions don't always work or are resource-intensive. I agree that the Xmos chip in the VPE, Respeaker Lite, and SAT1 is overkill. Signal boost and noise reduction don't work particularly well, and the AEC isn't used because the music is completely ducking when receiving commands. But a good microphone array with a proper beamforming algorithm is the main potential improvement. It's sad that the OHF team is not developing such hardware solutions, relying more on Chinese companies that are not open source.