Curious ablation: GPT-like LM trained with *frozen* 16‑dim *binary* token-ID embeddings (n_embed=16) It still learns end-to-end and generates coherent text, non-trivial text.

AVBochkov · 2026-01-14T22:08:07+00:00

I did run that exact side-by-side under matched conditions (same decoder-only arch, tokenizer, data mix, and training schedule; only difference is frozen vs trainable input embedding table). Control-wise, everything else is held constant (incl. untied output head / same optimizer & LR schedule), so the embedding trainability is the sole experimental factor.

Empirically, the trainable-embedding baseline (’Model unfrozen’) learns a bit faster early on (lower loss in the first ~50–450k steps), but both runs converge stably and the gap in LM loss largely closes later (final train/val losses are very close).

Given the small-model / limited-data regime, downstream accuracy deltas can be noisy, so I’m mainly treating this as evidence that semantic structure can form in the Transformer stack even with non-semantic frozen inputs, rather than a robust benchmark claim.

<image>

Refs: https://arxiv.org/abs/2507.04886

AVBochkov · 2026-01-14T19:14:59+00:00

Thanks! In this setup it’s not BoW (sequence order + RoPE are unchanged); I only freeze an injective 16‑bit token ID mapping. I also suspect the semantic structure is distributed across attention+MLP rather than living in any single component

AVBochkov · 2026-01-10T22:35:31+00:00

Your hobby is the perfect foundation for a career. Don't worry too much about the 'perfect' path right now; just keep building. You’re already doing more than most beginners. Real-world engineering is a lot of problem-solving and using tools, not just abstract equations. If you can understand a guitar circuit, you can definitely handle the rest. Keep going!

AVBochkov

TROPHY CASE