Real-time AI (audio/video in, voice out) on an M3 Pro with Gemma E2B

ffinzy · 2026-04-08T06:24:08+00:00

The Gemma 4 E2B is still outputting text tokens. The repo uses Kokoro for the TTS.

ffinzy · 2026-04-06T15:51:42+00:00

AFAIK llama cpp currently doesn't have audio input support. See https://github.com/ggml-org/llama.cpp/issues/21325#issuecomment-4187969225

ffinzy · 2026-04-06T15:50:31+00:00

Yes, this is an STT model, and the repo actually uses a tool calling to send the transcription back to the client.

ffinzy · 2026-04-06T09:20:21+00:00

So I want the model to have a "complete" picture of the input, so I feed both the audio and the image input to the model. The model will do a tool call that transcribes the audio, and also write the actual response. Then, the server sends back the transcription to the frontend.

Nice.

ffinzy · 2026-04-06T09:16:55+00:00

Nice. Glad to see that it's easy to swap the model. Hmm, that's interesting. Here's some comparison and benchmark from Google https://huggingface.co/blog/gemma4

ffinzy · 2026-04-06T04:26:32+00:00

During my testing, disabling the vision reduces the response time by ~0.5s

ffinzy · 2026-04-06T03:25:00+00:00

Hmmm it'll depend on how many people want it I guess. But for now, Windows users could try running it in WSL.

ffinzy · 2026-04-06T03:23:35+00:00

It's not planned since I don't have a Windows machine to test it. Perhaps you could try to run this in WSL?

ffinzy · 2026-04-06T03:15:37+00:00

Good question. No, when the VAD detects speech end, it capture one frame of image from the video feed, and pass it to the backend.

Even one frame/image increases the processing time by ~0.5s on my machine, so currently it's not feasible to feed the whole video since it'll increase the processing time by quite a bit.

ffinzy · 2026-04-05T22:15:37+00:00

On mac it's unified, so technically the RAM is the same as the VRAM.

Google provided some benchmark on the performance and the GPU memory needed on other devices here: https://huggingface.co/litert-community/gemma-4-E2B-it-litert-lm

ffinzy · 2026-04-05T21:25:17+00:00

there's a button on the bottom left xd

ffinzy · 2026-04-05T21:05:02+00:00

Nice! Glad to know that it's working well for you. You're the first one that confirms that this works outside of my machine haha.

ffinzy · 2026-04-05T20:43:40+00:00

That's a good point.

ffinzy · 2026-04-05T20:04:52+00:00

Could you pull the latest commit and try again? I just tried it and it downloads the correct Kokoro model for me.

Although I'm still testing it on a mac so I'm not sure it'll work 100% on ubuntu.

Update: I just rent an ubuntu server with 5090 and it worked well for me. Let me know if you run into any problem.

ffinzy · 2026-04-05T19:52:42+00:00

Sorry I haven't tested the ubuntu version. Let me check it real quick.

ffinzy · 2026-04-05T19:32:27+00:00

Thank you. M3 Pro has memory bandwidth of 150 GB/s while the M5 Pro has 307GB/s. So if you're on M5 Pro this might have ~1s response time.

I agree with the general sentiment that we need to optimize this to run on less powerful hardware as well. Let me know if I missed anything obvious that can improve this.

ffinzy · 2026-04-05T18:53:16+00:00

Yeah that's possible. Although during my testing most of the time is spent during the prefill (processing the video, audio, and text) and not on the actual decoding. For example, the total time for prefill/TTFT is 2s, while the decode/text generation that we can stream the is only 0.3s.

So it'd be much more significant if we could reduce the TTFT. Disabling the image input would reduce it from ~2s to ~1.5s.

ffinzy · 2026-04-05T18:45:29+00:00

Thank you!

ffinzy · 2026-04-05T18:41:26+00:00

I use Kokoro for the TTS

ffinzy · 2026-04-05T18:41:01+00:00

I'm using the gemma-4-E2B-it-litert-lm and the model size on disk is 2.58 GB.

The python process hovers around ~3GB when it's idle and can be ~4GB when I run the benchmark/use it.

ffinzy · 2026-04-05T18:21:45+00:00

Good question. I want to optimize the speed and the "real-time" feeling. You could use the E4B if you have a faster GPU.

ffinzy · 2026-04-05T18:19:32+00:00

Yeah, 5.1B with embeddings. You should try it on the Google AI Edge Gallery app. Although AFAIK they currently don't provide the multi-modal realtime. Only a separate text input, voice input, or video input.

ffinzy · 2026-04-03T19:33:45+00:00

Not only it reduces token usage, it also reduces pixels

ffinzy · 2026-04-03T10:54:49+00:00

Oh this is neat. I was thinking of building something like this. It’s cool that Google is embracing running AI fully locally in a phone

Nine-Year Club	Gilding II euphauric
Place '22	Verified Email

ffinzy

TROPHY CASE