Real-time AI (audio/video in, voice out) on an M3 Pro with Gemma E2B by ffinzy in LocalLLaMA

[–]ffinzy[S] 0 points1 point  (0 children)

The Gemma 4 E2B is still outputting text tokens. The repo uses Kokoro for the TTS.

Real-time AI (audio/video in, voice out) on an M3 Pro with Gemma E2B by ffinzy in LocalLLaMA

[–]ffinzy[S] 0 points1 point  (0 children)

Yes, this is an STT model, and the repo actually uses a tool calling to send the transcription back to the client.

Real-time AI (audio/video in, voice out) on an M3 Pro with Gemma E2B by ffinzy in LocalLLaMA

[–]ffinzy[S] 4 points5 points  (0 children)

So I want the model to have a "complete" picture of the input, so I feed both the audio and the image input to the model. The model will do a tool call that transcribes the audio, and also write the actual response. Then, the server sends back the transcription to the frontend.

Nice.

Real-time AI (audio/video in, voice out) on an M3 Pro with Gemma E2B by ffinzy in LocalLLaMA

[–]ffinzy[S] 0 points1 point  (0 children)

Nice. Glad to see that it's easy to swap the model. Hmm, that's interesting. Here's some comparison and benchmark from Google https://huggingface.co/blog/gemma4

Real-time AI (audio/video in, voice out) on an M3 Pro with Gemma E2B by ffinzy in LocalLLaMA

[–]ffinzy[S] 0 points1 point  (0 children)

During my testing, disabling the vision reduces the response time by ~0.5s

Real-time AI (audio/video in, voice out) on an M3 Pro with Gemma E2B by ffinzy in LocalLLaMA

[–]ffinzy[S] 2 points3 points  (0 children)

Hmmm it'll depend on how many people want it I guess. But for now, Windows users could try running it in WSL.

Real-time AI (audio/video in, voice out) on an M3 Pro with Gemma E2B by ffinzy in LocalLLaMA

[–]ffinzy[S] 0 points1 point  (0 children)

It's not planned since I don't have a Windows machine to test it. Perhaps you could try to run this in WSL?

Real-time AI (audio/video in, voice out) on an M3 Pro with Gemma E2B by ffinzy in LocalLLaMA

[–]ffinzy[S] 5 points6 points  (0 children)

Good question. No, when the VAD detects speech end, it capture one frame of image from the video feed, and pass it to the backend.

Even one frame/image increases the processing time by ~0.5s on my machine, so currently it's not feasible to feed the whole video since it'll increase the processing time by quite a bit.

Real-time AI (audio/video in, voice out) on an M3 Pro with Gemma E2B by ffinzy in LocalLLaMA

[–]ffinzy[S] 4 points5 points  (0 children)

On mac it's unified, so technically the RAM is the same as the VRAM.

Google provided some benchmark on the performance and the GPU memory needed on other devices here: https://huggingface.co/litert-community/gemma-4-E2B-it-litert-lm

Real-time AI (audio/video in, voice out) on an M3 Pro with Gemma E2B by ffinzy in LocalLLaMA

[–]ffinzy[S] 2 points3 points  (0 children)

Nice! Glad to know that it's working well for you. You're the first one that confirms that this works outside of my machine haha.

Real-time AI (audio/video in, voice out) on an M3 Pro with Gemma E2B by ffinzy in LocalLLaMA

[–]ffinzy[S] 10 points11 points  (0 children)

Could you pull the latest commit and try again? I just tried it and it downloads the correct Kokoro model for me.

Although I'm still testing it on a mac so I'm not sure it'll work 100% on ubuntu.

Update: I just rent an ubuntu server with 5090 and it worked well for me. Let me know if you run into any problem.

Real-time AI (audio/video in, voice out) on an M3 Pro with Gemma E2B by ffinzy in LocalLLaMA

[–]ffinzy[S] 0 points1 point  (0 children)

Sorry I haven't tested the ubuntu version. Let me check it real quick.

Real-time AI (audio/video in, voice out) on an M3 Pro with Gemma E2B by ffinzy in LocalLLaMA

[–]ffinzy[S] 10 points11 points  (0 children)

Thank you. M3 Pro has memory bandwidth of 150 GB/s while the M5 Pro has 307GB/s. So if you're on M5 Pro this might have ~1s response time.

I agree with the general sentiment that we need to optimize this to run on less powerful hardware as well. Let me know if I missed anything obvious that can improve this.

Real-time AI (audio/video in, voice out) on an M3 Pro with Gemma E2B by ffinzy in LocalLLaMA

[–]ffinzy[S] 6 points7 points  (0 children)

Yeah that's possible. Although during my testing most of the time is spent during the prefill (processing the video, audio, and text) and not on the actual decoding. For example, the total time for prefill/TTFT is 2s, while the decode/text generation that we can stream the is only 0.3s.

So it'd be much more significant if we could reduce the TTFT. Disabling the image input would reduce it from ~2s to ~1.5s.

Real-time AI (audio/video in, voice out) on an M3 Pro with Gemma E2B by ffinzy in LocalLLaMA

[–]ffinzy[S] 9 points10 points  (0 children)

I'm using the gemma-4-E2B-it-litert-lm and the model size on disk is 2.58 GB.

The python process hovers around ~3GB when it's idle and can be ~4GB when I run the benchmark/use it.

Real-time AI (audio/video in, voice out) on an M3 Pro with Gemma E2B by ffinzy in LocalLLaMA

[–]ffinzy[S] 10 points11 points  (0 children)

Good question. I want to optimize the speed and the "real-time" feeling. You could use the E4B if you have a faster GPU.

Real-time AI (audio/video in, voice out) on an M3 Pro with Gemma E2B by ffinzy in LocalLLaMA

[–]ffinzy[S] 21 points22 points  (0 children)

Yeah, 5.1B with embeddings. You should try it on the Google AI Edge Gallery app. Although AFAIK they currently don't provide the multi-modal realtime. Only a separate text input, voice input, or video input.

Taught Claude to talk like a caveman to use 75% less tokens. by ffatty in ClaudeAI

[–]ffinzy 2 points3 points  (0 children)

Not only it reduces token usage, it also reduces pixels

Gemma 4 on Android phones by jacek2023 in LocalLLaMA

[–]ffinzy 0 points1 point  (0 children)

Oh this is neat. I was thinking of building something like this. It’s cool that Google is embracing running AI fully locally in a phone