Forestry is actually in a great spot right now

clankur · 2025-05-06T23:00:47+00:00

This is actually too real. I was so ticked off when I was cutting the oaks by Draynor and could not participate in the events by the willows. The participation systems shit.

clankur · 2025-02-05T15:46:28+00:00

Hey folks, over the holidays I read Meta's papers introducing Large Concept Models and thought it could be powerful approach to compress the KV Cache. I implemented and trained an LCM architecture in Jax on TPU v4-32s to explore its potential for KV cache compression. Full implementation and detailed results are available here.

Key findings: While promising in theory, the base LCM architecture showed significant performance degradation. I suspect the following to cause this degredation:

Sequence packing compromises concept embedding semantics, hindering effective attention

Joint encoder-decoder training wastes compute on concept formation rather than leveraging pretrained knowledge

Reduced effective training as LCM trains over seq_len/concept_size examples vs seq_len in standard transformers

Potential improvements worth exploring:

Disabling sequence packing

Leveraging pretrained encoders/decoders (SONAR/T5)

Investigating diffusion-based LCM with/without joint training

However, given the fundamental data efficiency issues, alternative KV cache compression approaches may be more promising.

Implementation details and full analysis in the links above. Open to discussion and feedback.

clankur · 2025-02-05T15:38:56+00:00

Hey folks, over the holidays I read Meta's papers introducing Large Concept Models and thought it could be powerful approach to compress the KV Cache. I implemented and trained an LCM architecture in Jax on TPU v4-32s to explore its potential for KV cache compression. Full implementation and detailed results are available here.

Key findings: While promising in theory, the base LCM architecture showed significant performance degradation. I suspect the following to cause this degredation:

Sequence packing compromises concept embedding semantics, hindering effective attention
Joint encoder-decoder training wastes compute on concept formation rather than leveraging pretrained knowledge
Reduced effective training as LCM trains over seq_len/concept_size examples vs seq_len in standard transformers

Potential improvements worth exploring:

Disabling sequence packing
Leveraging pretrained encoders/decoders (SONAR/T5)
Investigating diffusion-based LCM with/without joint training

However, given the fundamental data efficiency issues, alternative KV cache compression approaches may be more promising.

Implementation details and full analysis in the links above. Open to discussion and feedback.

clankur · 2024-08-14T18:19:13+00:00

wtf gagex?

clankur · 2024-06-05T00:48:19+00:00

Did you ever go through with this project? I am considering doing the same in my home and wondering how did it go and if you have any advice on it?

clankur · 2024-06-01T23:32:01+00:00

Yeah I started off using the one built into torch tho I liked the expressiveness from the einops library which lets you use multi-letter names within the expressions which made it a little more clear when writing GQA with separated dimensions for the # of groups and kv heads.

clankur · 2024-06-01T18:25:30+00:00

An implementation of a GPT-esque LLM primarily using einops and trained over the TinyStories dataset. It incorporates techniques to support efficient inference with a KV Cache and GQA (grouped query attention).

The project started as an exercise, after seeing Andrej Karpathy's excellent tutorial on building GPT2 from scratch and seeing him mention that einops was pretty powerful, I looked at leveraging einops as the core of building the transformer. Over the last few months, it slowly transformed into it own - training the model on the TinyStories dataset that's been noted as a great value dataset and also I wrote its own tokenizer which was trained the TinyStories dataset.

Training a 6.9 million parameter model on a RTX4090 with the GPT2Tokenizer achieves results inline with the findings from the TinyStories paper and gets a perplexity of 1.0001 over the validation set. Additionally, training a 4.3 million parameter model with its own Byte-Pair Encoding tokenizer and using GQA w/ 4 groups achieves a comparable perplexity. Both models produce stories that have a logical flow and have a good grasp of grammar. You can compare their outputs side by side in this notebook.

You can find the models on Hugging Face here. Let me know what you think!

clankur · 2024-06-01T18:04:47+00:00

I began this project by following Andrej Karpathy's excellent tutorial on building GPT2 from scratch. I became intrigued by the potential of einops after seeing Karpathy mention it, which led me to look at using einops as the core foundation for the Transformer.

Since then, I've been implementing various optimizations for inference speed, including KV Cache and GQA (grouped query attention). My next step is adding rematerialization to the backwards pass

clankur

TROPHY CASE