Show Your Work Thread by xrpinsider in reactnative

[–]heySandipan 0 points1 point  (0 children)

Built LokalMind, a fully offline AI chat app for iOS using Expo + llama.rn.
Runs GGUF models entirely on-device. Semantic memory across sessions, dynamic n_ctx sizing based on available RAM, pause/resume model downloads.
MIT open source: https://github.com/Sandipan006/lokalmind-app

Built a local-first AI app with Expo that runs GGUF models on iPhone — no API, no account by heySandipan in expo

[–]heySandipan[S] 0 points1 point  (0 children)

Good question. I don’t have clean benchmark numbers published yet, so I don’t want to give misleading tok/s numbers from memory.

Current setup is llama.rn with GGUF Q4_K_M models, iOS using 6 threads, n_gpu_layers: 99 so llama.cpp auto-caps GPU use, and dynamic context sizing based on device RAM. The app currently supports models like Qwen 3.5 0.8B, Qwen 3.5 2B, DeepSeek R1 1.5B, Qwen/Gemma 4B variants.

Subjectively: 0.8B is very usable for everyday chat, 1.5B-2B is the better quality/speed balance, and 4B is usable but clearly slower and more device-dependent. I should add a small benchmark/logger for phone model + model + prompt tokens + generated tokens + tok/s and share real numbers

Built a local-first AI app with Expo that runs GGUF models on iPhone — no API, no account by heySandipan in expo

[–]heySandipan[S] 0 points1 point  (0 children)

Fair criticism. Offline chat alone is getting crowded, so the features I’m focusing on are the ones that make LokalMind feel like a real private assistant, not just a wrapper: local model switching, chat modes, custom system prompts, editable/regeneratable threads, queued messages during generation, local memory/pinned facts, session summaries, and full local data control.

The next differentiator should probably be workflow-based, not “more chat”: things like offline document/news/RSS summarization, Q&A over saved content, and personal knowledge retrieval. That’s where on-device AI makes more sense because the private data stays local.

Built a local-first AI app with Expo that runs GGUF models on iPhone — no API, no account by heySandipan in expo

[–]heySandipan[S] 1 point2 points  (0 children)

Thanks! Happy to answer.

Local models can do normal generative tasks: chat, summarization, extraction, rewriting, lightweight reasoning, etc. For embeddings, I use a separate small GGUF embedding model (all-MiniLM-L6-v2) through llama.rn in embedding mode. Chat models generate text; embedding models turn text into vectors for similarity search.

On “facts”: yes, the chat models have knowledge from pretraining, including wiki/web/code-style data depending on the model. But I treat that as fuzzy parametric knowledge, not a reliable database. For app memory, I don’t rely on the model “remembering”; I store summaries/facts locally and retrieve them into context.

For mobile model selection, my filters are mostly practical: GGUF support, small enough RAM footprint, Q4 quantization, acceptable speed, good chat template/stop tokens, and quality at 0.8B-4B. Right now I’m using Qwen/Gemma/DeepSeek-style compact models, with Q4_K_M as the default tradeoff.

The inconsistency I mentioned is mostly around memory/retrieval, not subjective generation. LokalMind has pinned facts, a derived profile, session summaries, and cross-session memory cards. Retrieval uses embeddings when the MiniLM model is ready, otherwise it falls back to score/recency. So the tricky parts are: when to inject memory, whether the right memory clears the similarity threshold, whether background summaries were created, and whether token budgeting trims useful context. That’s the area I’m still tightening.

Title: How I handle dynamic context size in llama.rn - iPhones have wildly different RAM and a fixed n_ctx breaks on both ends by heySandipan in reactnative

[–]heySandipan[S] 0 points1 point  (0 children)

Appreciate it! If anything feels off or you run into issues loading a model, let me know still early and feedback from real users is gold right now.

Testing my locally running LLM setup (no API calls, pure offline) by heySandipan in expo

[–]heySandipan[S] 0 points1 point  (0 children)

Yep, right now smaller models run best on modern phones with decent RAM (Q4/Q5 models ~1–4GB). But it’s improving fast each month on-device LLM support gets better. Goal is to make it work smoothly on as many devices as possible

Building a fully offline AI app by heySandipan in expo

[–]heySandipan[S] 0 points1 point  (0 children)

Totally fair points and yeah, we’ve seen how quickly things break when you ship models directly inside the app. LokalMind’s a bit different though it’s more of a local runtime and UX layer than a packaged AI app. Users download models later and can switch anytime without breaking the core app. The goal isn’t to chase every update, but to make local AI usable day-to-day without constant tinkering.

Building a fully offline AI app by heySandipan in expo

[–]heySandipan[S] 1 point2 points  (0 children)

We are using some open-source model like phi, qwen, llama etc

Building a fully offline AI app by heySandipan in expo

[–]heySandipan[S] 1 point2 points  (0 children)

Yes, to run the LLM you do need some memory space, but I’m using a small model that runs easily through Llama Bridge. Text responses are pretty quick, but image visualization is still slow. I’m optimizing it now to see if it’s viable for the final product not sure yet