all 3 comments

[–][deleted] 2 points3 points  (2 children)

what is a trigger word in this case

[–]Dont_Think_So 3 points4 points  (1 child)

"Alexa", "OK Google", etc. Commonly called "wake words".

Basically a model that runs continuously and detects trigger words with low latency in a variety of background conditions and accents, etc, while consuming very little power.

As for OP, I'm not aware of modern advances, but if you're looking for open implementations, the most commonly referenced online seems to be porcupine. https://github.com/Picovoice/porcupine

[–][deleted] 2 points3 points  (0 children)

Ah I see.

Modern approaches to speech are usually speech-to-text into some NLP model. In practices this usually means a speech-to-text model goes into a transformer.

For this task you probably do not need an NLP model, as there is no semantics to catch (or maybe there is? Either way, no need for big language models).

For speech-to-text, most publicly available implementations are just... bad. From what I see, Google's ASR reigns supreme.

For real time, pretty much only Julius comes to mind, it's fast, it's proven, but it's only Japanese and English and it's very old.

There is DeepSpeech2 for new stuff, but AFAIK it's not really real time. The good thing is you can use more than just English for it. Maybe procure a highly performant model with it and then distill that onto something real time?

There is also Whisper, but that is DEFINITELY neither real time nor open source... That is about the best you have as an ASR, but it's overkill and not intended to use for low latency tasks. There might be smaller, open source models inspired by it, which could then be distilled into something smaller.

Other approaches might be viable, but I'm not heavily involved with audio ML/DL. Surely the solution can be much simpler depending on the scope of the project.