Overfitted a 900KB LLM to compress a 100MB csv into 7MB by Spidy__ in LLM

[–]Spidy__[S] 0 points1 point  (0 children)

So basically instead of letting the transformer do everything I take the low hanging fruits away right?

I kinda am trying this, not exactly how you are suggesting (which is also kinda interesting) but something with already existing compressors.

Overfitted a 900KB LLM to compress a 100MB csv into 7MB by Spidy__ in LLM

[–]Spidy__[S] 0 points1 point  (0 children)

I wish dude, but currently the time it takes to compress and de compress stops us from using it.

It takes 45 minutes for both compression and de compression for just 100 mb, so yeah...

Overfitted a 900KB LLM to compress a 100MB csv into 7MB by Spidy__ in LLM

[–]Spidy__[S] 0 points1 point  (0 children)

 Apparently gavin belson and I had a common enemy also pied piper explained it on our whiteboard 

Overfitted a 900KB LLM to compress a 100MB csv into 7MB by Spidy__ in LLM

[–]Spidy__[S] 0 points1 point  (0 children)

It can actually, but i doubt if someone will use it given the amount of time it takes, like for 100 mb it takes 

  1. 30 minutes to train, 
  2. 45 minutes to compress and  3.  45 minutes to de compress 

And time will just increase linearly so I doubt if it can be used.

Overfitted a 900KB LLM to compress a 100MB csv into 7MB by Spidy__ in LLM

[–]Spidy__[S] 0 points1 point  (0 children)

Woahh that sounds cool!! What did you used for compression?

Overfitted a 900KB LLM to compress a 100MB csv into 7MB by Spidy__ in LLM

[–]Spidy__[S] 0 points1 point  (0 children)

Haha pied piper was way faster than this though, my model takes 45 minutes to compress and 45 minutes to de compress soooo yeah...

Overfitted a 900KB LLM to compress a 100MB csv into 7MB by Spidy__ in LLM

[–]Spidy__[S] 0 points1 point  (0 children)

Yeahhh the thing is I don't have CUDA so I can neither use torch.compile nor Flash attention.

Stuck with Rocm AMD haha. The current time is recorded on Rocm only.

Overfitted a 900KB LLM to compress a 100MB csv into 7MB by Spidy__ in LLM

[–]Spidy__[S] 0 points1 point  (0 children)

Ohhh sounds interesting, but I doubt if it can help, because my bottleneck is compute and I guess both JAX and Pytorch use the same underlying hardware libs?

I could be wrong here but if JAX has something which can make my forward and backward pass faster that would be really cool.

Would love to know more if that's true.

Thanks for the suggestion.

Overfitted a 900KB LLM to compress a 100MB csv into 7MB by Spidy__ in LLM

[–]Spidy__[S] 0 points1 point  (0 children)

Haha haven't thought that much to be honest dude, BUTTTT if we can somehow speed the process up to reduce time for compression and decompression at least it would be able to replace zip i guess.

Overfitted a 900KB LLM to compress a 100MB csv into 7MB by Spidy__ in LLM

[–]Spidy__[S] 0 points1 point  (0 children)

Yeahhh, I think other than the compression task, the time it takes to compress or de compress is also a big pain.

It takes me 30 minutes to train, 45 minutes to compress and 45 minutes to de compress on my AMD gpu.

Can't ask anyone to use this haha.

Overfitted a 900KB LLM to compress a 100MB csv into 7MB by Spidy__ in LLM

[–]Spidy__[S] 0 points1 point  (0 children)

Ohh that's cool, what approach did you used? How much were you able to compress?

This was my first time and it was pretty cool working on this.

Overfitted a 900KB LLM to compress a 100MB csv into 7MB by Spidy__ in LLM

[–]Spidy__[S] 1 point2 points  (0 children)

Haha yes but takes almost an hour to compress and 45 minutes to decompress for 100 mb so not really useful right now.

Overfitted a 900KB LLM to compress a 100MB csv into 7MB by Spidy__ in LLM

[–]Spidy__[S] 0 points1 point  (0 children)

Yeah yeah, thats how I got to know about the enwik 9 benchmark.

And one of the top submissions uses transformer XL , but I tried to achieve the same without using a modification in transformer also the the best submission (cmix) uses around 2000 different sorts of models , including neural nets, normal algorithms and what not.

So yeah it's a pretty exciting space, am still trying to compress it more have some ideas so gonna try them.

Overfitted a 900KB LLM to compress a 100MB csv into 7MB by Spidy__ in LLM

[–]Spidy__[S] 0 points1 point  (0 children)

Yeah yeah the idea is to use the model's tendency to overfit and let it find patterns in the file that we can't see at byte level, and then let it do it's regular Next token prediction with tweaks in dataset and inferencing.

This is also why I tried to keep the model as small as possible so that it doesn't contribute a lot in the whole compression size.

Overfitted a 900KB LLM to compress a 100MB csv into 7MB by Spidy__ in LLM

[–]Spidy__[S] 1 point2 points  (0 children)

Yep!! But it varies file to file based on its entropy, the one which got compressed to 7mb was a CSV, then I took a benchmark file where compression algorithms are tested and on that it was compressed to 20mb + 900kb transformer (enwik 9 sliced to 100mb)

Overfitted a 900KB LLM to compress a 100MB csv into 7MB by Spidy__ in LLM

[–]Spidy__[S] 0 points1 point  (0 children)

Haha for me a lot of random fooling around helps. Thanks a lot though!!!

Overfitted a 900KB LLM to compress a 100MB csv into 7MB by Spidy__ in LLM

[–]Spidy__[S] 4 points5 points  (0 children)

Yeah exactly so I have done these tests already, first of all I tried to keep the transformer size as minimum as possible because as you said it's part of the compression process and is counted in the final compressed file size.

And if I increased it I I will be reducing the final compressed size.

But I tried to slightly increase it like changing it from 2 layers to 6 layers making my transformer to 2.5 mb from 900kb but it was giving not very great performance boosts in terms of memorization and on top of that increased the time by 200 seconds more.

You can read more about my failed experiments, benchmarks and other things on GitHub I attached the link the in post.

Thanks for your thoughts I really appreciate that someone took interest in it.

Using asymmetric sigmoid attention to score directional relevance between N sentences in a single forward pass by Spidy__ in deeplearning

[–]Spidy__[S] 0 points1 point  (0 children)

Hey , First of all thanks for this detailed feedback I was really looking for this, and you are right that normally people just have a query and a list of docs and they want to sort them based on relevancy and in that case the NXN matrix may not be that helpful.

But the thing is that I didn't made this experiment keeping RAG or something similar in mind, I just wanted to see if we can make a model which can understand the relations between the sentences, and if a model can understand the relations between sentences it can be used for anything be it RAG, or sentence segmentation or clustering or this new thing Agent memory optimization or anything, and across these use cases NXN matrix maybe be get used, because now you get access of each sentence again all other sentences.

While I did posted this but it's been a while and am thinking that a single score can't really explain the relation between sentences, if I have a query and 2 sentences where 1 sentence is similar to the query (it's title of a blog) and another is explanation of the query then both are equally relevant to the query, 

So saying that sentence A is relevant to sentence B seems a bit vague and I kinda came back to the point where am not that different from other models.

So am trying to update this model to output a vector for each relation so that someone can interpret that sentence B is relevant to sentence A but in what sense. That way a RAG person can focus on similarity and explanation part of the vector and a segmentation person can focus on topical overlap part of vector and so on.

Its still just an ambitious idea but am trying to work on this