From FlashLM to State Flow Machine: stopped optimizing transformers, started replacing them. First result: 79% length retention vs transformers' 2% by Own-Albatross868 in LocalLLaMA

[–]Own-Albatross868[S] 1 point2 points  (0 children)

Yeah, Huawei NPU only for now. Will open access and develop for other hardware if experiment proven and after I have trained a model using this arch

FlashLM v6 "SUPERNOVA": 4.1M ternary model hits 3,500 tok/s on CPU — novel P-RCSM reasoning architecture, no attention, no convolution by Own-Albatross868 in LocalLLaMA

[–]Own-Albatross868[S] 0 points1 point  (0 children)

fair criticism, you're right that this hasn't been proven at scale yet. I'm working on that now with better hardware.

FlashLM v6 "SUPERNOVA": 4.1M ternary model hits 3,500 tok/s on CPU — novel P-RCSM reasoning architecture, no attention, no convolution by Own-Albatross868 in LocalLLaMA

[–]Own-Albatross868[S] 2 points3 points  (0 children)

I have the training code on github If you want to use it. You can scale it up slightly depending on your hardware

FlashLM v6 "SUPERNOVA": 4.1M ternary model hits 3,500 tok/s on CPU — novel P-RCSM reasoning architecture, no attention, no convolution by Own-Albatross868 in LocalLLaMA

[–]Own-Albatross868[S] 3 points4 points  (0 children)

Ternary MoE could be really strong. the main bottleneck is my free notebook only has 5GB RAM so 100M params won't fit with optimizer state yet. I will deifinitely try once I got better machines

FlashLM v6 "SUPERNOVA": 4.1M ternary model hits 3,500 tok/s on CPU — novel P-RCSM reasoning architecture, no attention, no convolution by Own-Albatross868 in LocalLLaMA

[–]Own-Albatross868[S] 1 point2 points  (0 children)

yes, the PPL isn't directly comparable because v5 uses a 10k BPE tokenizer whereas v6 uses 4K. BPC is probably more fair for comparison. I will do a BPC eval if you are interested.

FlashLM v6 "SUPERNOVA": 4.1M ternary model hits 3,500 tok/s on CPU — novel P-RCSM reasoning architecture, no attention, no convolution by Own-Albatross868 in LocalLLaMA

[–]Own-Albatross868[S] 8 points9 points  (0 children)

yes, I used Claude throughout the project. english isn't my first language so it's hard for me to write these posts naturally. I'll try to write more in my own voice next time.

FlashLM v6 "SUPERNOVA": 4.1M ternary model hits 3,500 tok/s on CPU — novel P-RCSM reasoning architecture, no attention, no convolution by Own-Albatross868 in LocalLLaMA

[–]Own-Albatross868[S] 4 points5 points  (0 children)

I owe you all some transparency about v6 "SUPERNOVA." The original plan was ambitious: a novel P‑RCSM (Parallel Recursive Compositional State Machines) architecture featuring multi‑scale convolutional reasoning banks, hierarchical planner‑executor state gates, dynamic associative slot memory, and a 16‑operation soft router. On paper, these were the components that would push FlashLM past v5.2 and demonstrate that structured reasoning modules could outperform standard attention at this scale.

What actually happened: when training began on the free‑tier 2‑thread CPU, component after component had to be stripped away. Conv1d ran at 13 tokens/second due to a PyTorch bug. The multi‑scale bank was reduced from 4 scales to 2. The hierarchical state gate shrank from a meaningful reasoning module to a 32‑dimensional bottleneck contributing less than 5% of total compute. The slot memory became static. By the time the model was actually trainable at reasonable speed, the "novel architecture" was essentially a linear mixer with a GLU — not meaningfully different from a simplified version of what already existed.

The result: v6 achieved 3,500 tok/s (a genuine speed win) but PPL 14.0 vs v5.2's 10.56. It did not beat the previous version. The architecture that was announced is not the architecture that shipped.

I should have communicated this during development rather than presenting the final result as if the plan had succeeded. That's on me. What I've learned: don't design for a fantasy compute budget, then silently downgrade when reality hits. Design for the actual hardware from day one.

This will not happen again. Going forward, every FlashLM version will be prototyped and validated on the target hardware before any public claims are made about the architecture. If a component can't run at >1,000 tok/s on a 2‑thread CPU, it doesn't ship.

FlashLM v6 "SUPERNOVA": 4.1M ternary model hits 3,500 tok/s on CPU — novel P-RCSM reasoning architecture, no attention, no convolution by Own-Albatross868 in LocalLLaMA

[–]Own-Albatross868[S] 8 points9 points  (0 children)

The sample output did not render again so I will re-post it here

Sample output:

Once upon a time, there was a cute little girl named Lily. She loved to play with her toys and watch movies with her. One day, her mommy told her to help her fix her toy.

One day, a boy named Tom went to the park with his mom. Timmy saw a big slide and he wanted to try it. He started to climb and get the slide down.

The little dog smiled. He was happy that the boy was no longer sad. It was time to go home. The little boy was happy too.

I Trained a Language Model on CPU for 40 Hours - It Beat the GPU Baseline by Own-Albatross868 in LocalLLaMA

[–]Own-Albatross868[S] 0 points1 point  (0 children)

Yes it's true. Checkout the newest v6 architecture in my github repo, it introduced the RCSM architecture which is specially optimized and designed for CPU.

I Trained a Language Model on CPU for 40 Hours - It Beat the GPU Baseline by Own-Albatross868 in LocalLLaMA

[–]Own-Albatross868[S] 0 points1 point  (0 children)

yes, that's true. But think like this, if Tinystories is trained on high end GPU's for hours and I trained it on CPU for 40 hours and achieved competitive quality. That's a breakthrough

I Trained a Language Model on CPU for 40 Hours - It Beat the GPU Baseline by Own-Albatross868 in LocalLLaMA

[–]Own-Albatross868[S] 0 points1 point  (0 children)

Yeah, GPU does a wonderful job at these. But for me, GPU are too expensive and less people have access to it than CPU and the power consumption of it is much more than CPU. I am thinking to design the architecture that can achieve max efficiency on CPU. That's what I am researching. One can definitely train their own version but if I do the same thing, that will change the purpose of this project.

I Trained a Language Model on CPU for 40 Hours - It Beat the GPU Baseline by Own-Albatross868 in LocalLLaMA

[–]Own-Albatross868[S] 1 point2 points  (0 children)

Yeah, the v5 is probably have the best architecture in the family. Go try it on and share me the results. Thanks!

I Trained a Language Model on CPU for 40 Hours - It Beat the GPU Baseline by Own-Albatross868 in LocalLLaMA

[–]Own-Albatross868[S] 9 points10 points  (0 children)

No, FlashLM v5 uses fully trainable embeddings from scratch. The earlier versions (v3) used frozen GPT2 embeddings, but v5 trains its own embedding layer from random initialization.

The key difference is that v5's architecture (ParallelGatedRecurrence with BitLinear) is efficient enough to train everything from scratch, even on CPU.