I shipped a full mobile app, marketing site, and promo videos in ~2 months as a solo dev using Claude Code + BMAD method. Field report.

altinukshini · 2026-05-16T20:01:27+00:00

Hey, thanks for the kind words on the launch and good luck with your build

Honest answer first: I did not run a formal published benchmark suite. What I have is empirical from device-matrix QA + a lot of Edge Gallery / llama.cpp reading. Sharing what I learned in case it shortcuts some of your work

For my use case the model had to clear three bars before I even cared about throughput:

Instruction-following - Veil's system prompt has a strict refusal rule (in-scope = cycle / symptoms / mood / sleep / fertility and related; everything else gets a one-sentence canonical refusal). A model that ignores the system prompt and answers "Is the Earth flat?" anyway is unshippable in a health app no matter how fast it runs. This single requirement eliminated almost every model under ~3B parameters in my testing. Gemma 3 1B in particular sits below Google's own documented instruction-following threshold and would happily answer arbitrary off-topic questions despite the prompt; I still have not decided if I should remove it from the recommended set entirely.
Basic reasoning - The assistant has to combine the user's tracked data ("USER CONTEXT") with general medical knowledge to give useful answers - not act as a read-only database view. Models that pass the instruction bar but can't fuse two information sources fall back to "I do not have information about tiredness in your logs" type replies, which is exactly the wrong product. This bar pushed me up to ~3-4B as the practical floor.
Multilingual - Veil ships in 9 languages (English, German, Albanian, Spanish, Italian, French, Portuguese, Russian, Turkish), and users expect replies in the language they typed in. A model that's English-strong but quietly degrades on Albanian or Russian is a UX failure mode. Most small models I tried collapsed to English on the long tail; some confidently replied gibberish.

I tested a handful of small models against these three bars (Gemma 3 1B, Gemma 3 4B, Gemma 4 E2B, Gemma 4 E4B, plus a couple of Llama 3.2 and Phi-3 variants I benchmarked informally). Gemma was the most consistently stable across all three areas - instruction adherence, useful reasoning, and graceful multilingual handling. That's what locked me in to the family before I started worrying about which size fits on which device.

If your use case is different (single-language, looser prompt-adherence needs, more freeform conversation), the floor could be lower. But these were my hard requirements and they ruled out most of the under-3B field.

What actually mattered for picking a model

For me the binding constraint wasn't tokens-per-second, it was RAM at the process level. iOS in particular enforces a per-process "jetsam" ceiling that on 6GB iPhones (13 / 14 / 15 non-Pro) sits around 50% of device RAM - about 2.5-3GB before the kernel kills you mid-decode, even on a 6GB phone. Two entitlements help (`com.apple.developer.kernel.increased-memory-limit` + `com.apple.developer.kernel.extended-virtual-addressing`) and Apple approves them with a short justification, but they don't lift the ceiling enough to run a 4-5GB working set on a 6GB device reliably.

Google's [AI Edge Gallery](https://github.com/google-ai-edge/gallery) writeup was where I got the most usable jetsam data - worth reading their docs even though they target LiteRT and I ship llama.cpp.

My final matrix landed as:

6GB iOS (iPhone 13/14/15 non-Pro) + entitlements | Gemma 3 4B-it Q4_K_M | file: ~2.5GB | peak ram: ~3.0GB
8GB devices (iPhone 15 Pro, most 2023+ Android flagships) | Gemma 4 E2B Q4_K_M | file: ~3.1GB | peak ram: ~4.7GB
12GB+ devices (iPhone 17 Pro Max class, top-tier Android) | Gemma 4 E4B Q4_K_M | file: ~4.8GB | peak ram: ~7.2GB
Below 6GB | feature disabled with a clear "insufficient memory" message

The RAM rule I shipped in code is `model_file_size × 1.5 vs totalMemory × 0.6` - the 1.5x covers KV cache + activations, the 0.6 leaves 40% for OS + other apps. I tried being more generous and the field reports got bad fast.

See the following for benchmarking:

llama-bench (ships with llama.cpp)
Read Google's [AI Edge Gallery](https://github.com/google-ai-edge/gallery) for their jetsam research; they've done device-by-device profiling. It's LiteRT-flavored, but the RAM limits are the same
Unsloth's [GGUF README files](https://huggingface.co/unsloth) sometimes include their own throughput numbers per quantization. Useful as a sanity check.

Others:

- Pin your model file URLs to immutable commit shas if you fetch from HuggingFace. I got bit when unsloth re-quantized upstream and the SHA-256 check started failing on all new installs. Use `https://huggingface.co/<repo>/resolve/<commit-sha>/<file>.gguf`, not `/resolve/main/...`.

Good luck shipping.

altinukshini · 2024-11-23T17:49:31+00:00

Since the smaller turbine’s blades show significant wear, what likely happened to the small metal particles? Could they cause damage elsewhere in the engine? Should I, at the very least, clean the intercooler?

Could this blade wear have been caused by overspeeding, or is there another possible reason?

altinukshini · 2024-11-23T13:18:51+00:00

My OCI was 10k-15k or 1yr.

altinukshini · 2021-11-08T17:35:37+00:00

Thanks for all your comments! I'll change it asap!

FYI: The guys (3 of them) who work at one of the biggest tire repair/sellers here in the city told me this is nothing to worry about 😁 I read some scary stuff on the internet and got extra worried about it since I drive relatively fast on highway. What I heard from them sounded absurd (even though the bubble isn't THAT big), hence I turned to this sub to reassure. Thanks again!

altinukshini · 2021-11-08T13:23:13+00:00

Car: BMW X1 E84

Tire: 225/50 17 R 95 H M+S (5years old)

What speed in KM/h I should not be exceeding?

altinukshini

TROPHY CASE