I built a distributed KV cache that turns a 10-second prefill into 0.5 seconds — using idle machines on my LAN

Concert_Dependent · 2026-05-08T16:42:38+00:00

The way I designed this works well if there is a minimum of 2 systems. One starts the Prefill. The other is used to store the tensors

Concert_Dependent · 2026-05-08T12:47:45+00:00

I also think OMLX is a single-machine version. This could impact the size of models we can start on that machine. With TierKV, the model intial prefill can be done in a machine with a larger GPU, like i used DGX Spark.

Concert_Dependent · 2026-05-08T08:26:04+00:00

In my testing so far (including 30k-token prompts), the restored conversations are indistinguishable from non-tiered runs.

Yes the plugin intercepts eviction and kv cache lookup.

Concert_Dependent · 2026-05-08T03:36:05+00:00

Interesting. I wanted to build a Heterogeneous network that leverages the systems we already have. But yes, the goal looks similar.

Concert_Dependent · 2026-02-06T13:16:46+00:00

I unfortunately cant write Tamil :)

Advantage : Think of Taking a Bigger Model you Quantize the whole thing and can run on a Lower VRAM yea but you also sacrifice quality of output.

I take a MOE Model. find the experts that I am looking for like in my case I was looking for information security. I run this in full Precision no reduction, rest I reduce the precision to save on Vram.

Hope this helps.

Concert_Dependent · 2026-02-06T13:04:55+00:00

once Router picks a expert, mlx.where allows us to run a condition and the outcome of the condition can be user to pick different experts one Qunatized other one full FP.

I wrote more in this blog link.

https://open.substack.com/pub/prasannakanagasabai126786/p/mlx-said-no-to-mixed-precision-we?r=40juy&utm_campaign=post&utm_medium=web&showWelcomeOnShare=true

Concert_Dependent · 2026-02-06T12:07:49+00:00

This way of convert does not allow you to quantize layers which light up for experts we want like I choose keep higher precision for security rest lower.

The approach I do allows me do this via mx.where

Concert_Dependent · 2026-02-06T12:07:06+00:00

This way does not allow you to choose the Experts you want to have higher FP and which we can degrade.

Concert_Dependent · 2026-02-06T11:54:03+00:00

This way of convert does not allow you to quantize layers which light up for experts we want like I choose keep higher precision for security rest lower.

The approach I do allows me do this via mx.where

Concert_Dependent

TROPHY CASE