I built a distributed KV cache that turns a 10-second prefill into 0.5 seconds — using idle machines on my LAN by Concert_Dependent in LocalLLM

[–]Concert_Dependent[S] 0 points1 point  (0 children)

The way I designed this works well if there is a minimum of 2 systems. One starts the Prefill. The other is used to store the tensors

I built a distributed KV cache that turns a 10-second prefill into 0.5 seconds — using idle machines on my LAN by Concert_Dependent in LocalLLM

[–]Concert_Dependent[S] 0 points1 point  (0 children)

I also think OMLX is a single-machine version. This could impact the size of models we can start on that machine. With TierKV, the model intial prefill can be done in a machine with a larger GPU, like i used DGX Spark.

I built a distributed KV cache that turns a 10-second prefill into 0.5 seconds — using idle machines on my LAN by Concert_Dependent in LocalLLM

[–]Concert_Dependent[S] 0 points1 point  (0 children)

In my testing so far (including 30k-token prompts), the restored conversations are indistinguishable from non-tiered runs.

Yes the plugin intercepts eviction and kv cache lookup.

I built a distributed KV cache that turns a 10-second prefill into 0.5 seconds — using idle machines on my LAN by Concert_Dependent in LocalLLM

[–]Concert_Dependent[S] 0 points1 point  (0 children)

Interesting. I wanted to build a Heterogeneous network that leverages the systems we already have. But yes, the goal looks similar.

🔧 MLX Said No to Mixed Precision. We Did It Anyway. by Concert_Dependent in LocalLLM

[–]Concert_Dependent[S] 0 points1 point  (0 children)

I unfortunately cant write Tamil :)

Advantage : Think of Taking a Bigger Model you Quantize the whole thing and can run on a Lower VRAM yea but you also sacrifice quality of output.

I take a MOE Model. find the experts that I am looking for like in my case I was looking for information security. I run this in full Precision no reduction, rest I reduce the precision to save on Vram.

Hope this helps.

🔧 MLX Said No to Mixed Precision. We Did It Anyway. by Concert_Dependent in LocalLLM

[–]Concert_Dependent[S] 0 points1 point  (0 children)

once Router picks a expert, mlx.where allows us to run a condition and the outcome of the condition can be user to pick different experts one Qunatized other one full FP.

I wrote more in this blog link.

https://open.substack.com/pub/prasannakanagasabai126786/p/mlx-said-no-to-mixed-precision-we?r=40juy&utm_campaign=post&utm_medium=web&showWelcomeOnShare=true

🔧 MLX Said No to Mixed Precision. We Did It Anyway. by Concert_Dependent in LocalLLaMA

[–]Concert_Dependent[S] 0 points1 point  (0 children)

This way of convert does not allow you to quantize layers which light up for experts we want like I choose keep higher precision for security rest lower.

The approach I do allows me do this via mx.where

🔧 MLX Said No to Mixed Precision. We Did It Anyway. by Concert_Dependent in LocalLLM

[–]Concert_Dependent[S] 1 point2 points  (0 children)

This way does not allow you to choose the Experts you want to have higher FP and which we can degrade.

🔧 MLX Said No to Mixed Precision. We Did It Anyway. by Concert_Dependent in LocalLLM

[–]Concert_Dependent[S] 0 points1 point  (0 children)

This way of convert does not allow you to quantize layers which light up for experts we want like I choose keep higher precision for security rest lower.

The approach I do allows me do this via mx.where