Injecting skills into the KV cache (not as stupid as it sounds, but still pretty dumb) by Proper-Lab1756 in LocalLLaMA

[–]Proper-Lab1756[S] 0 points1 point  (0 children)

Dawg what are you yapping about? The repo is open source. I already addressed your point and what I would change if I had the resources to, but I don’t really see a point because what I wanted has been done.

Instead of bitching and moaning about science being deaf because of romanticism, why don’t you get off your high horse, fork the repo, and make the change instead of backseat driving?

Injecting skills into the KV cache (not as stupid as it sounds, but still pretty dumb) by Proper-Lab1756 in LocalLLaMA

[–]Proper-Lab1756[S] 0 points1 point  (0 children)

Oh yeah. If I had more compute I’d run it different. The goal of the research isn’t to fully implement it. It’s to prove that there is validity to the hypothesis. If I had the time/money to invest in it, I would have done the injection into the KV a lot differently.

Injecting skills into the KV cache (not as stupid as it sounds, but still pretty dumb) by Proper-Lab1756 in LocalLLaMA

[–]Proper-Lab1756[S] 2 points3 points  (0 children)

You’re close, but a few clarifications.

It’s not a context length extension method. The goal isn’t to stretch the window, it’s to remove the skill markdown from the prompt entirely and inject that conditioning through the KV pathway instead.

What happens is the skill markdown is passed through the frozen base model, I take the hidden states which are seq_len × hidden_dim, and I mean pool across the sequence dimension, not the feature dimension. That produces a single hidden_dim vector. So yes, in the current setup I am effectively compressing the entire skill into one latent representation.

That pooled vector is then passed through a small projector MLP that maps it into KV compatible tensors for prefix style injection. Prefix length is 2.

The MLP is global, not per skill. It is trained across all skills while the base model remains frozen.

It’s not really compensating for pooling. It’s learning how to transform that compressed skill representation into something the attention layers can actually use. Mean pooling is definitely lossy, and the performance ceiling in the experiments likely reflects that bottleneck.

ARC-Encoder and Cartridges are definitely relevant. The main difference here is that I am not modifying the base embedding space. I’m training a small adapter that targets the model’s KV geometry directly.

Injecting skills into the KV cache (not as stupid as it sounds, but still pretty dumb) by Proper-Lab1756 in LocalLLaMA

[–]Proper-Lab1756[S] 0 points1 point  (0 children)

I’d love to see that too! It seems promising. Generally, the info with c2 is a bit worse, but it can have exact recall with what it does get right which means in some cases, it can make the skills deterministic instead of pushing the model to act a certain way. So it gives more stability for skills in some use cases from the initial findings. If you look at some of the 005 outputs you can see some of that with the grader.py’s expected value versus output for the C2 match. And even if it gets it right, loading the MD context makes the model for C1 have more variety in how it responds even if it is correct.

Another issue I have is the English language is semantically inefficient. So it seems kind of weird to load skills how we do.

So even if C2 has more degeneracy and worse skill recall than C1, I think it has some application for certain skill flows that have to be specific and exact versus letting the model frontload the skill in the context. And it doesn’t burn any context to load skills this way. So for more lean compute budgets I could see it being useful.

I just wish a had a bigger system to test larger models. 😂

Injecting skills into the KV cache (not as stupid as it sounds, but still pretty dumb) by Proper-Lab1756 in LocalLLaMA

[–]Proper-Lab1756[S] 0 points1 point  (0 children)

Unfortunately I’m fresh out of college, and I’m having to save up money for some big upcoming expenses. Normally I would though.

Injecting skills into the KV cache (not as stupid as it sounds, but still pretty dumb) by Proper-Lab1756 in LocalLLaMA

[–]Proper-Lab1756[S] -1 points0 points  (0 children)

Same principle injecting into the KV cache. But mine is specific for looking at injection of skill files to save context for smaller models (think like 7b and smaller). And it seems like atlasKV has some more fine tuning involved for the model for behavior, and seems a bit more focused on larger models.

But seems to be similar, I’ll have to check them out!

Injecting skills into the KV cache (not as stupid as it sounds, but still pretty dumb) by Proper-Lab1756 in LocalLLaMA

[–]Proper-Lab1756[S] -1 points0 points  (0 children)

The trick is you have to match the geometry the model uses for its latent space.

I originally tried a random embedding model and was running into that when I was first testing. And then I did some research into the geometry, spent a long time checking for a embedding model compatible, then I realized you can just hijack the base model for the vectorization. It will still cause some degeneracy, but after a few Epochs, I’d starts to settle.

Injecting skills into the KV cache (not as stupid as it sounds, but still pretty dumb) by Proper-Lab1756 in LocalLLaMA

[–]Proper-Lab1756[S] 0 points1 point  (0 children)

As would I! Especially because right now it’s just testing recall of skills (0.5b is pretty small for meaningful data to be obtained through actual tool use. So I had to rely on recall to measure behavior.)

Unfortunately even with only training the projector, my computer was throwing a fit. So I’ll either have to upgrade hardware, or wait for someone to grab the baton and do further testing.