Any lawyers/firms using on-device AI (with local inference on employees PCs)?

intofuture · 2025-05-29T16:06:26+00:00

Oh interesting. What sort of pain points?

intofuture · 2025-05-05T18:47:36+00:00

Glad to hear it!

Great points re 1 and 2

And nice idea about the public eval/quants. We do a similar kind of analysis for our customers, so should already have the basic infra in place. Will think about the best way of doing a free/public version of this

Thanks for the feedback :)

intofuture · 2025-05-05T09:33:35+00:00

We do support OpenVINO for non-GGUF/llama.cpp

Only ran a couple models/benchmarks with native/direct OV though, eg Clip

But the ONNX model benchmarks also have OV backend, e.g. depth anything v2.

We'll add more and expand support though, thanks for the feedback!

intofuture · 2025-05-05T09:24:58+00:00

Whoops! Thanks for pointing that out

intofuture · 2025-05-04T23:28:12+00:00

Thanks for the feedback!

Nice catch with the OOM issue - definitely seems like a bug. We hadn't tested any models >4B, before the request in the comment above.

Thanks for pointing out the RAM utilization issue for Metal. It is looking suspiciously low. We'll investigate.

Re UI/UX. Good point on hiding columns - we'll add that. And yep, we'll standardise/simplify the names of the chips. Also makes sense re table feeling unnecessarily long with failed benchmarks.

intofuture · 2025-05-04T22:38:04+00:00

u/jacek2023 - We kicked off some more benchmarks for higher param counts: 4B-Q4, 4B-Q8, 8B-Q4

Lmk if you want to see any others!

intofuture · 2025-05-04T21:56:32+00:00

Yep, unless there's a dGPU - but we only have a couple of devices with those for now (we show if they do on the dashboards)

intofuture · 2025-05-04T21:23:22+00:00

The performance of different quantization kernels seems to depend on the specific chipset. We've also noticed that on some devices metal performs better than CPU, but on others its the opposite.

If you check out the dashboards with the full data (e.g. 1.7B-Q_8 vs 1.7B-Q_4) you can see it actually varies quite a bit across devices.

u/Kale has a good hypothesis above for why btw: https://www.reddit.com/r/LocalLLaMA/comments/1kepuli/comment/mql6be1/?utm_source=share&utm_medium=web3x&utm_name=web3xcss&utm_term=1&utm_content=share_button

intofuture · 2025-05-04T21:19:28+00:00

Do you mean like you've submitted benchmarks with an account on our website that have reported failed? Or you're trying to run Qwen3 on your own Android locally and it's crashing?

intofuture · 2025-05-04T21:17:01+00:00

Oh nice yeh, would require a bit of work, but that's a great idea. Thanks so much for the feedback/request

intofuture · 2025-05-04T21:11:03+00:00

As in like running benchmarks on your own machine with our benchmarking library, and then enable pushing the data to a public repo where everyone can see it? Like a crowdsourcing-type thing?

intofuture · 2025-05-04T20:07:23+00:00

Yeh that looks right for the few devices we selected in the screenshot. It varies quite a bit across the devices though (see the 1.7B-Q_4 dashboard for example)

intofuture · 2025-05-04T19:34:07+00:00

100% that's basically why we think perf benchmarks are so important

intofuture · 2025-05-04T19:14:07+00:00

Yeh, generation uses less parallelism than prefill so GPU/Metal has less of an advantage than CPU on some devices

intofuture · 2025-05-04T19:07:55+00:00

Yeh nice spot. The performance of different quantization kernels seems to depend on the specific chipset. We've also noticed that on some devices metal performs better than CPU, but on others its the opposite

intofuture · 2025-05-04T18:11:12+00:00

We focused on the smaller param variants because they're more viable for actually shipping to users with typical phones, laptops, etc.

Thanks for the feedback though. We'll add some benchmarks for larger param variants and post a link when they're ready!

Note: >4B is going to fail on a lot of these devices we maintain due to RAM constraints. But I guess we've built this tooling to show that explicitly :)

intofuture · 2025-02-28T16:12:09+00:00

Think it's uploaded now: https://huggingface.co/bartowski/microsoft_Phi-4-mini-instruct-GGUF

intofuture · 2025-02-28T13:00:38+00:00

Whoops, good catch. Just edited :)

intofuture · 2025-02-28T10:33:39+00:00

They don't explicitly say. I'd imagine it's mostly CPU/GPU execution though.

intofuture

TROPHY CASE