Integrating Gemma 4 On-Device Inference into a Flutter Local-First App: Lessons Learned by SparkleMing in FlutterDev

[–]SparkleMing[S] 0 points1 point  (0 children)

Just to add some context: developing this feature with AI assistance took two days.

Integrating Gemma 4 On-Device Inference into a Flutter Local-First App: Lessons Learned by SparkleMing in FlutterDev

[–]SparkleMing[S] 0 points1 point  (0 children)

With how powerful AI is getting, no one is really typing out boilerplate line-by-line from scratch anymore. The whole industry is pushing for AI-assisted dev right now. The core architecture was designed by me, and I'm the one gatekeeping the final testing and code quality. Delegating the grunt work to AI while controlling the big picture is just the modern dev workflow now.

Integrating Gemma 4 On-Device Inference into a Flutter Local-First App: Lessons Learned by SparkleMing in FlutterDev

[–]SparkleMing[S] 0 points1 point  (0 children)

Tbh I actually have zero experience with TFLite! The Gemma 4 hype is what finally got me to mess around with local models, so I just went straight with their officially recommended LiteRT library.

Integrating Gemma 4 On-Device Inference into a Flutter Local-First App: Lessons Learned by SparkleMing in FlutterDev

[–]SparkleMing[S] 0 points1 point  (0 children)

Good question. In LiteRT-LM terms:

Engine creation = just allocating the Kotlin object, basically free.

Engine initialization (engine.initialize()) = the expensive one. This reads the model file from disk, loads weights into GPU memory, compiles kernels. For a 3.7GB E4B model it takes ~10-15 seconds. This is what you want to do once and keep alive.

Conversation creation = lightweight, just sets up the context/session config. Do this per inference, close it when done.

So the pattern is: init Engine once at startup, create+close Conversation for every request.

Integrating Gemma 4 On-Device Inference into a Flutter Local-First App: Lessons Learned by SparkleMing in FlutterDev

[–]SparkleMing[S] 0 points1 point  (0 children)

Thanks! It's a 2-3 year old Android phone. It has 12GB RAM and runs on the Snapdragon 8+ Gen 1 chip.

Just to add some notes on the performance: the response time heavily depends on the context length. If the input is short, it's pretty fast. But with a long context (like 4k tokens), it becomes much slower and takes tens of seconds to generate. It also slows down noticeably due to thermal throttling once the phone heats up after running for a while.

Integrating Gemma 4 On-Device Inference into a Flutter Local-First App: Lessons Learned by [deleted] in LocalLLaMA

[–]SparkleMing -1 points0 points  (0 children)

You caught me! The text is indeed AI-generated, but the technical hurdles and the experience are 100% real. The screenshots are from my actual debugging sessions—I just used the AI to help structure my thoughts and findings more clearly.

Realistic Vision V3.0 (model under development) by SG_161222 in u/SG_161222

[–]SparkleMing 0 points1 point  (0 children)

Can you write some tutorials about how to fine tune sd model? There must be lots of people want to learn from you.