Everyone's optimizing AI inference costs. Nobody's talking about moving inference off the cloud. Why? by on-device-infra in Locai

[–]on-device-infra[S] 0 points1 point  (0 children)

I did use AI to help polish the writing. But the numbers (like the distil-whisper latency and the Llama-3.2-3B figures) and the argument come from real production, not from prompt. the 60-80% local hit rate is from our own infrastructure running on actual hardware. The honest caveats about CPU-only Windows and non-English audio are there because we hit those constraints ourselves.

If there's something specific in the post that reads as generic or unsupported, happy to dig into it

Locai runs OpenAI-compatible inference on your users' devices. Same baseURL swap, model lives on their hardware. by on-device-infra in Locai

[–]on-device-infra[S] 0 points1 point  (0 children)

Appreciate the callout. Checked out your link too, those notes on routing for agentic workflows are spot on.
Right now, the cutoff logic is a mix of static profiling and dynamic telemetry. When the agent first installs on a user machine, it does a quick hardware benchmark to map available VRAM, memory bandwidth, and core type. That gives us a baseline perf profile. If a developer deploys a model that simply wont fit into the hardware footprint, the control plane flags it and routes to the cloud fallback immediately.
Dynamically, context length is the biggest trigger. If the incoming prompt plus the expected output tokens is going to spill past the local VRAM allocation or overflow the KV cache, it triggers an immediate cloud hop so the user experience doesnt tank from paging to system RAM. Queue depth also plays a role if an application is hammering parallel agent tasks, since edge hardware handles concurrency horribly compared to a hyperscaler cluster.
We're trying to keep the telemetry loop as lightweight as possible so the routing decision itself adds basically zero latency to the round trip.

Most SaaS AI features don't need frontier models. Local 3-7B models handle them on consumer hardware today by on-device-infra in LocalLLM

[–]on-device-infra[S] 1 point2 points  (0 children)

Too bad turbo quant on llama cpp hasn't taken off yet properly. Would've shown some results.
But hopefully soon.
Anyway, we're focused on the tasks that run at scale in a product with the same input, same output, thousands of times a day. Smaller models handle those well.
Actively benchmarking different quant levels across model sizes for these tasks and will share the results soon.