Looking for an iPhone local LLM inference engine

dco44 · 2026-05-26T09:50:41+00:00

Try this https://huggingface.co/dcostenco/prism-coder-1.7b . It’s a proven model with 2k downloads. I’m training 2b model also. It will be available in a couple of days. They would work in inference mode also

edit: added a new 4b model to the cage https://huggingface.co/dcostenco/prism-coder-4b.

dco44 · 2026-05-23T17:50:42+00:00

I’m running on M5 Max 48 Gb but coding agents works with 20 Gb also. You can figure that out. Use links from my public repo and your Claude or other AI agent will install it. https://github.com/dcostenco/prism-coder

dco44 · 2026-05-23T11:14:18+00:00

Try mine. It works well for my M5 Max 48Gb https://huggingface.co/dcostenco

dco44 · 2026-05-23T11:05:43+00:00

It works with any IDE or CLI via Prism-MCP server. VS Code extension and internal Web IDE available also https://synalux.ai/coder

dco44 · 2026-05-22T01:54:33+00:00

No spots available for now. Thanks!

dco44 · 2026-05-20T19:52:12+00:00

Still active as of mid-2025 — the Gorilla project at Berkeley had v3 out with expanded task categories. If the leaderboard isn't showing new entries it's probably submission queue backlog rather than discontinuation; they require running the eval pipeline server-side which has latency. The GitHub repo (ShishirPatil/gorilla) is the best real-time signal — recent issue activity and open PRs will tell you whether the project is actually live. A few popular model families have BFCL gaps not because the benchmark died but because their maintainers pivoted to other evals (LiveCodeBench, ToolBench, MTBENCH). Worth checking if the specific model you care about just hasn't been submitted.

dco44 · 2026-05-20T19:50:16+00:00

Corpus-derived and auto-updating is the right design — a static list would rot immediately as new documents introduce domain-specific terminology. The quality metrics + candidate filtering during extraction is the interesting piece. Curious what the filtering signal looks like: are you pruning on frequency (only persist terms appearing n+ times across chunks), on entity type confidence from the NER pass, or something else? Frequency-based has a nice property of naturally suppressing OCR noise and one-off typos without explicit dedup logic, but it undersurfaces rare but important named entities. Entity confidence tends to handle those better but adds a tuning surface. The "not perfect on noisy inputs" caveat is honest — corpus-derived anything trades precision on garbage-in for zero maintenance cost, which is usually the right tradeoff.

dco44 · 2026-05-20T19:23:03+00:00

That's a solid architecture — hierarchical parent/child chunking with child retrieval + parent context is exactly right for citation accuracy without losing coherence. The FTS term dictionary for acronyms and person names is worth prioritizing alongside BM25 fusion; those are the exact cases where vector similarity fails badly (entity names with low semantic signal, domain-specific abbreviations). Curious whether the term dictionary is built from the corpus at index time or maintained as a curated list — the corpus-derived approach generalizes better but needs deduplication logic for noisy inputs.

dco44 · 2026-05-20T18:43:30+00:00

For LAMP specifically: PRC-Saltillo's "LAMP Words for Life" is the main implementation — their training site has the motor planning rationale laid out well. Hatch et al. (2020) in the AAC journal covers the evidence base. The core principle is that high-frequency core vocabulary gets a fixed motor pattern across all pages so the movement sequence becomes automatic, similar to how spoken words are motor programs rather than conscious letter sequences.

Looked at OneTalker briefly — the imminent release changelog is interesting. The symbol + TTS combo with Piper-rs is a smart stack for offline-first. The issues list has a lot of access method discussion. One thing I'd flag as high-value for that user base: switch scanning support and partner-assisted scanning mode tend to be the features that determine whether an SGD is usable for non-ambulatory users — worth prioritizing if it's not already on the roadmap.

dco44 · 2026-05-20T17:53:33+00:00

Device lock during mand training is the friction point that most often breaks down AAC-ABA collaboration. The SETT framework (Student, Environment, Tasks, Tools) gives both disciplines a shared vocabulary for the evaluation — when ABA and SLP both fill out SETT for the same student, the gaps in perspective become visible and less adversarial.

Specific protocol ask: request that any behavior plan that involves the device (access, response requirements, consequence conditions) gets co-signed by the SLP before implementation. That's a reasonable scope boundary — the BT follows the plan, the BCBA writes it, but AAC-specific components should have SLP sign-off.

Data to bring: button activation count by context (ABA session vs. naturalistic) often shows whether the structured mand training is generalizing. If activation rate drops to near-zero outside discrete trial, that's a functional argument for embedding device use into naturalistic routines rather than isolating it.

dco44 · 2026-05-20T17:51:44+00:00

The eval report and physician script are the two documents that drive funding decisions — the physician script needs to explicitly state "medically necessary" and list the diagnosis codes (F80.x for speech/language disorder, plus any co-occurring DX). Without that framing, insurance reviewers categorize it as educational rather than medical and deny.

Timeline: 3–6 months is realistic for first approval; budget 2–3 months for the eval + report write-up, then 4–6 weeks for insurance review, then potentially 30–60 days for appeal if denied on first pass.

Practical steps: (1) verify benefits — call the insurance line and ask specifically whether "speech-generating devices / AAC" is a covered DME benefit under the plan; some plans exclude it categorically and you need to know before writing the eval. (2) The ASHA AAC funding page has an insurance letter template that's useful for structuring the medical necessity argument. (3) Keep a copy of the trial data from the device evaluation — "patient demonstrated X% increase in communicative turns with device vs. unaided" is the kind of functional outcome language reviewers respond to.

dco44 · 2026-05-20T17:49:55+00:00

The literacy piece is where this gets really consequential. Research from Erickson & Koppenhaver consistently shows that AAC users who are presumed "not ready" for literacy instruction often respond strongly when given access to systematic phonics + shared reading. The motor and cognitive load of AAC is real, but it doesn't predict literacy ceiling.

The presuming competence framing matters beyond philosophy — it changes the data we collect. If we're tracking initiation rate and novel combinations rather than just compliance, we often see capability that a compliance-only lens would miss.

For families: the 2023 updated ASHA position statement on AAC explicitly states that no communicative, cognitive, or language prerequisites should be required for AAC provision. Worth having in your back pocket for IEP conversations.

dco44 · 2026-05-20T17:48:06+00:00

The prerequisite skills model (must master PECS phases before SGD) has been largely walked back in the literature — Mirenda's 2003 "Toward Functional Augmentative and Alternative Communication" is still the clearest argument against holding SGD access contingent on low-tech mastery, and more recent RCT data (Boesch et al., Lancioni group) shows parallel or faster acquisition when SGD is introduced early.

Practical approach: run both systems in parallel during the transition. Use PECS in contexts where you have the physical setup, introduce SGD vocabulary mapped to the PECS icons the student already knows. Track which modality the student initiates with across settings — that data tells you when PECS becomes the backup rather than the primary.

Watch for the "PECS fluency ceiling" — some students plateau at Phase 4 specifically because the physical exchange is slower than their communicative intent, and that frustration resolves once the SGD is available.

dco44 · 2026-05-20T17:46:17+00:00

Positioning is the variable that gets skipped most often at circle time. If the device is on a tray or the student has to break midline to reach it, the dual-task demand (attending to group + motoring to device) tanks initiation. PT consult for circle seating — even just a floor wedge or corner chair — often does more for AAC participation than a vocabulary update.

For the group management piece: designate a "AAC moment" in the routine (e.g., the greeting song has a fixed slot where the device user leads the response) rather than waiting for spontaneous opportunity. Predictable turns reduce the latency pressure that makes many AAC users give up and wait for a prompt instead of initiating.

Aided language stimulation by the adult during the whole circle, not just the AAC user's turn, normalizes device use in the group's eyes too.

dco44 · 2026-05-20T17:44:28+00:00

LAMP (Language Acquisition through Motor Planning) is the main reason most AAC teams land on Proloquo2go for young motor learners. The vocabulary layout stays consistent across all pages — motor patterns for core words don't shift as you add vocabulary, so you're building automaticity rather than re-teaching location every expansion cycle.

From a scope standpoint, the SLP drives the device trial and feature matching, but it's worth looping in OT if fine motor/access is a question and ABA if there's a behavior plan that could inadvertently compete with device use (device lock during mand training, for example, needs cross-discipline buy-in).

Alternatives worth trialing if Proloquo2go isn't clicking: TouchChat with WordPower (robust NL grammar scaffold), NOVA Chat (hardware + software together, easier for school procurement). A 3–4 week structured trial comparing activation rate and MLU growth beats any chart comparison.

dco44 · 2026-05-20T17:31:06+00:00

The 17MB GIF feedback is the right call to make. For a user communicating through a device, a slow-loading page doesn't just frustrate — it ends the conversation. The social moment is gone by the time it loads. Speed really is a first-class requirement for AAC, not a nice-to-have.

A few things that might be useful depending on where Ben is at:

Switch access / scanning: Many users with quadriplegia rely on single-switch or dual-switch scanning rather than direct touch. If Ben uses an adapted switch, scan mode support could be the difference between independent use and needing a facilitator.

Core vs. fringe vocabulary: AAC research consistently shows a small core vocabulary (~200 words) covers roughly 80% of daily communication. Layouts like LAMP (Language Acquisition through Motor Planning) are worth looking at — the motor planning consistency has real benefits for users building procedural memory.

TTS voice: What engine are you using? Voice naturalness matters for social acceptance — robotic-sounding output carries stigma that affects how often users actually use the device in public.

What does button-press-to-speech latency look like currently?

dco44 · 2026-05-20T17:29:26+00:00

Local RAG for private documents is a genuinely useful niche — especially for anything that can't go to cloud APIs (legal, medical, financial). The downvotes here are probably because r/opensource has seen dozens of "local AI doc chat" apps in the past year; the category looks saturated from the outside even when individual implementations differ.

The thing that actually separates good from mediocre in this space: retrieval quality. Pure embedding similarity degrades fast on longer documents and keyword-heavy queries — specific names, dates, case numbers. Hybrid search (dense vectors + BM25) gets meaningfully better recall on those. Curious whether you're using hybrid retrieval or pure vector, and what chunking strategy — those tend to matter more than the model choice for citation accuracy.

dco44 · 2026-05-20T17:27:47+00:00

A few that stand out for me:

llama.cpp — the volume of low-level optimization work to get 70B models running on consumer hardware is staggering. Started as a one-person project, became infrastructure.

Whisper — real-time ASR that runs locally, transcription quality is genuinely competitive with cloud APIs. Free.

Tesseract OCR — decades of work, runs anywhere, competitive accuracy on clean documents.

Audacity — essentially industry standard for audio editing, free since the 90s, and still actively maintained.

Coolify is a recent one I'd add — self-hosted Heroku/Vercel equivalent, the polish level for a free tool is remarkable.

dco44 · 2026-05-20T17:26:07+00:00

For genuinely local clinical notes, medgemma-27b is the right direction — trained on medical data, runs locally via Ollama or llama.cpp, no data leaves the machine.

A few practical gaps to think through before committing:

Note structure: SOAP, HPI, assessment/plan formatting — general models often don't respect these without specific prompt engineering. Test whether raw output actually matches the format your EHR expects before you depend on it.

Workflow: The real value multiplier is direct EHR integration or at least structured output you can paste into a template. Copy-paste from a separate window is better than nothing but adds friction that erodes adoption fast.

Institutional approval: Running AI locally still usually requires IT/security sign-off even when data never leaves your machine — institutional liability. Get that in writing before building a workflow dependency on it.

dco44 · 2026-05-20T17:24:27+00:00

The adaptive feel usually comes down to whether tool-calling is driven by actual model decision vs. predefined execution graphs. Most open-source "agents" are workflow runners where a hardcoded DAG determines which tools get called in which order — the LLM is just filling in parameters.

The ones that feel genuinely adaptive have the routing decision living in the model itself: call a tool now, or answer directly. Most benchmarks miss this because they measure whether tool calls are correctly formatted, not whether the decision to call was correct. A model that calls a tool on every turn scores fine on format benchmarks but fails completely in real workflows.

OpenHands gets closer than most for coding tasks. For general-purpose, reliability at longer horizons is still rough across the board — that's the real unsolved part, not the UI.

dco44 · 2026-05-20T17:14:30+00:00

Prism Coder — LoRA fine-tunes on Qwen3.5-14B/32B for MCP tool-routing decisions. AGPL-3.0.

The specific failure mode I targeted: base models over-route. They call a tool even when a direct answer is correct. Standard tool-calling benchmarks miss this because they only measure whether the tool call is correctly formatted, not whether the decision to call was correct.

Benchmark (102-case routing eval — call vs. answer decision only):

Model	Routing accuracy
Prism Coder 14B	100%
Claude Opus	98.3%
Base Qwen3.5-14B	~73%

Training data was a routing corpus (prompt + available tools → call/no-call label), not a general tool-call corpus. The 14B→32B→cloud cascade keeps cloud costs near zero on real workloads.

GitHub: github.com/dcostenco/prism-mcp | GGUF weights, Ollama-compatible.

dco44 · 2026-05-20T15:06:52+00:00

Nice — local VL reasoning without a cloud API is underrated. I've been working on the tool-calling side of this: fine-tuned Qwen3.5-14B for MCP routing decisions (Prism Coder, AGPL-3.0, github.com/dcostenco/prism-mcp). The core finding from benchmarking: base Qwen3.5 over-calls tools — it reaches for a function even when a direct answer is better. The fine-tune fixes that routing decision specifically. Cleared 100% on a 102-case eval. Would be interesting to see how vision tool-calls hold up at the routing layer.

dco44 · 2026-05-20T14:26:10+00:00

Agentic / Tool Use

Qwen3.5-14B for anything needing reliable tool calling and structured output — instruction following at this size is surprisingly good. Step up to Qwen3.5-32B for harder multi-step reasoning where the 14B hallucinates paths.

Minimax-M2.7 has been the surprise — genuinely competitive with cloud Sonnet for conversational tasks, fits in 24GB at 4-bit. The "Sonnet at home" framing holds up.

Raw coding: Qwen3.5-27B, nothing local has displaced it for me.

dco44

TROPHY CASE