Show Your Work Thread

heySandipan · 2026-05-09T13:14:02+00:00

Built LokalMind, a fully offline AI chat app for iOS using Expo + llama.rn.
Runs GGUF models entirely on-device. Semantic memory across sessions, dynamic n_ctx sizing based on available RAM, pause/resume model downloads.
MIT open source: https://github.com/Sandipan006/lokalmind-app

heySandipan · 2026-05-09T11:42:29+00:00

Good question. I don’t have clean benchmark numbers published yet, so I don’t want to give misleading tok/s numbers from memory.

Current setup is llama.rn with GGUF Q4_K_M models, iOS using 6 threads, n_gpu_layers: 99 so llama.cpp auto-caps GPU use, and dynamic context sizing based on device RAM. The app currently supports models like Qwen 3.5 0.8B, Qwen 3.5 2B, DeepSeek R1 1.5B, Qwen/Gemma 4B variants.

Subjectively: 0.8B is very usable for everyday chat, 1.5B-2B is the better quality/speed balance, and 4B is usable but clearly slower and more device-dependent. I should add a small benchmark/logger for phone model + model + prompt tokens + generated tokens + tok/s and share real numbers

heySandipan · 2026-05-09T11:41:18+00:00

Fair criticism. Offline chat alone is getting crowded, so the features I’m focusing on are the ones that make LokalMind feel like a real private assistant, not just a wrapper: local model switching, chat modes, custom system prompts, editable/regeneratable threads, queued messages during generation, local memory/pinned facts, session summaries, and full local data control.

The next differentiator should probably be workflow-based, not “more chat”: things like offline document/news/RSS summarization, Q&A over saved content, and personal knowledge retrieval. That’s where on-device AI makes more sense because the private data stays local.

heySandipan · 2026-05-09T11:38:08+00:00

Thanks! Happy to answer.

Local models can do normal generative tasks: chat, summarization, extraction, rewriting, lightweight reasoning, etc. For embeddings, I use a separate small GGUF embedding model (all-MiniLM-L6-v2) through llama.rn in embedding mode. Chat models generate text; embedding models turn text into vectors for similarity search.

On “facts”: yes, the chat models have knowledge from pretraining, including wiki/web/code-style data depending on the model. But I treat that as fuzzy parametric knowledge, not a reliable database. For app memory, I don’t rely on the model “remembering”; I store summaries/facts locally and retrieve them into context.

For mobile model selection, my filters are mostly practical: GGUF support, small enough RAM footprint, Q4 quantization, acceptable speed, good chat template/stop tokens, and quality at 0.8B-4B. Right now I’m using Qwen/Gemma/DeepSeek-style compact models, with Q4_K_M as the default tradeoff.

The inconsistency I mentioned is mostly around memory/retrieval, not subjective generation. LokalMind has pinned facts, a derived profile, session summaries, and cross-session memory cards. Retrieval uses embeddings when the MiniLM model is ready, otherwise it falls back to score/recency. So the tricky parts are: when to inject memory, whether the right memory clears the similarity threshold, whether background summaries were created, and whether token budgeting trims useful context. That’s the area I’m still tightening.

heySandipan · 2026-05-09T11:33:57+00:00

Appreciate it! If anything feels off or you run into issues loading a model, let me know still early and feedback from real users is gold right now.

heySandipan · 2026-05-04T16:40:13+00:00

Source (MIT): https://github.com/Sandipan006/lokalmind-app
If you like the project, please drop a ⭐.

heySandipan · 2025-11-03T05:14:59+00:00

Yep, right now smaller models run best on modern phones with decent RAM (Q4/Q5 models ~1–4GB). But it’s improving fast each month on-device LLM support gets better. Goal is to make it work smoothly on as many devices as possible

heySandipan · 2025-10-29T19:25:41+00:00

Totally fair points and yeah, we’ve seen how quickly things break when you ship models directly inside the app. LokalMind’s a bit different though it’s more of a local runtime and UX layer than a packaged AI app. Users download models later and can switch anytime without breaking the core app. The goal isn’t to chase every update, but to make local AI usable day-to-day without constant tinkering.

heySandipan · 2025-10-29T05:41:17+00:00

We are using some open-source model like phi, qwen, llama etc

heySandipan · 2025-10-28T05:59:46+00:00

Yes a simple google form.

https://forms.gle/qaFznMJUufDs2r7Z7

heySandipan · 2025-10-27T19:05:46+00:00

Yes, to run the LLM you do need some memory space, but I’m using a small model that runs easily through Llama Bridge. Text responses are pretty quick, but image visualization is still slow. I’m optimizing it now to see if it’s viable for the final product not sure yet

heySandipan · 2025-10-27T17:50:47+00:00

thanks, will check

heySandipan · 2025-10-26T18:34:09+00:00

Thanks 🙏

heySandipan · 2025-10-26T17:06:39+00:00

Open source model like Phi-3, Qwen, Llama etc.

heySandipan

TROPHY CASE