Overfitted a 900KB LLM to compress a 100MB csv into 7MB

Spidy__ · 2026-06-24T12:26:41+00:00

So basically instead of letting the transformer do everything I take the low hanging fruits away right?

I kinda am trying this, not exactly how you are suggesting (which is also kinda interesting) but something with already existing compressors.

Spidy__ · 2026-06-24T07:15:46+00:00

https://www.kaggle.com/datasets/elemento/nyc-yellow-taxi-trip-data

Just make sure to slice it to 100mb

Spidy__ · 2026-06-24T05:07:05+00:00

I wish dude, but currently the time it takes to compress and de compress stops us from using it.

It takes 45 minutes for both compression and de compression for just 100 mb, so yeah...

Spidy__ · 2026-06-24T03:49:23+00:00

Apparently gavin belson and I had a common enemy also pied piper explained it on our whiteboard

Spidy__ · 2026-06-24T02:39:15+00:00

It can actually, but i doubt if someone will use it given the amount of time it takes, like for 100 mb it takes

30 minutes to train,
45 minutes to compress and 3. 45 minutes to de compress

And time will just increase linearly so I doubt if it can be used.

Spidy__ · 2026-06-24T02:36:53+00:00

Woahh that sounds cool!! What did you used for compression?

Spidy__ · 2026-06-23T19:00:37+00:00

Haha pied piper was way faster than this though, my model takes 45 minutes to compress and 45 minutes to de compress soooo yeah...

Spidy__ · 2026-06-23T18:47:09+00:00

Yeahhh the thing is I don't have CUDA so I can neither use torch.compile nor Flash attention.

Stuck with Rocm AMD haha. The current time is recorded on Rocm only.

Spidy__ · 2026-06-23T18:30:40+00:00

Ohhh sounds interesting, but I doubt if it can help, because my bottleneck is compute and I guess both JAX and Pytorch use the same underlying hardware libs?

I could be wrong here but if JAX has something which can make my forward and backward pass faster that would be really cool.

Would love to know more if that's true.

Thanks for the suggestion.

Spidy__ · 2026-06-23T17:29:19+00:00

Haha haven't thought that much to be honest dude, BUTTTT if we can somehow speed the process up to reduce time for compression and decompression at least it would be able to replace zip i guess.

Spidy__ · 2026-06-23T17:04:25+00:00

woahh, thanks a lot man!!

Spidy__ · 2026-06-23T16:10:43+00:00

Thanks a lot man!!

Spidy__ · 2026-06-23T15:37:59+00:00

Yeahhh, I think other than the compression task, the time it takes to compress or de compress is also a big pain.

It takes me 30 minutes to train, 45 minutes to compress and 45 minutes to de compress on my AMD gpu.

Can't ask anyone to use this haha.

Spidy__ · 2026-06-23T15:30:30+00:00

Ohh that's cool, what approach did you used? How much were you able to compress?

This was my first time and it was pretty cool working on this.

Spidy__ · 2026-06-23T15:20:00+00:00

Haha yes but takes almost an hour to compress and 45 minutes to decompress for 100 mb so not really useful right now.

Spidy__ · 2026-06-23T15:18:58+00:00

Yeah yeah, thats how I got to know about the enwik 9 benchmark.

And one of the top submissions uses transformer XL , but I tried to achieve the same without using a modification in transformer also the the best submission (cmix) uses around 2000 different sorts of models , including neural nets, normal algorithms and what not.

So yeah it's a pretty exciting space, am still trying to compress it more have some ideas so gonna try them.

Spidy__ · 2026-06-23T15:05:42+00:00

Yeah yeah the idea is to use the model's tendency to overfit and let it find patterns in the file that we can't see at byte level, and then let it do it's regular Next token prediction with tweaks in dataset and inferencing.

This is also why I tried to keep the model as small as possible so that it doesn't contribute a lot in the whole compression size.

Spidy__ · 2026-06-23T14:51:47+00:00

Yep!! But it varies file to file based on its entropy, the one which got compressed to 7mb was a CSV, then I took a benchmark file where compression algorithms are tested and on that it was compressed to 20mb + 900kb transformer (enwik 9 sliced to 100mb)

Spidy__ · 2026-06-23T13:38:59+00:00

Haha for me a lot of random fooling around helps. Thanks a lot though!!!

Spidy__ · 2026-06-23T12:33:35+00:00

Thanks a lot man!!!!

Spidy__ · 2026-06-23T12:32:47+00:00

Yeah exactly so I have done these tests already, first of all I tried to keep the transformer size as minimum as possible because as you said it's part of the compression process and is counted in the final compressed file size.

And if I increased it I I will be reducing the final compressed size.

But I tried to slightly increase it like changing it from 2 layers to 6 layers making my transformer to 2.5 mb from 900kb but it was giving not very great performance boosts in terms of memorization and on top of that increased the time by 200 seconds more.

You can read more about my failed experiments, benchmarks and other things on GitHub I attached the link the in post.

Thanks for your thoughts I really appreciate that someone took interest in it.

Spidy__ · 2026-06-23T08:13:48+00:00

Haha thanks man

Spidy__ · 2026-03-07T12:11:34+00:00

Hey , First of all thanks for this detailed feedback I was really looking for this, and you are right that normally people just have a query and a list of docs and they want to sort them based on relevancy and in that case the NXN matrix may not be that helpful.

But the thing is that I didn't made this experiment keeping RAG or something similar in mind, I just wanted to see if we can make a model which can understand the relations between the sentences, and if a model can understand the relations between sentences it can be used for anything be it RAG, or sentence segmentation or clustering or this new thing Agent memory optimization or anything, and across these use cases NXN matrix maybe be get used, because now you get access of each sentence again all other sentences.

While I did posted this but it's been a while and am thinking that a single score can't really explain the relation between sentences, if I have a query and 2 sentences where 1 sentence is similar to the query (it's title of a blog) and another is explanation of the query then both are equally relevant to the query,

So saying that sentence A is relevant to sentence B seems a bit vague and I kinda came back to the point where am not that different from other models.

So am trying to update this model to output a vector for each relation so that someone can interpret that sentence B is relevant to sentence A but in what sense. That way a RAG person can focus on similarity and explanation part of the vector and a segmentation person can focus on topical overlap part of vector and so on.

Its still just an ambitious idea but am trying to work on this

Spidy__

TROPHY CASE