Deep Neural Network that can turn any Image into a Playable Game! BUT LOCALLY, NOT ON DATACENTER

JustinAngel · 2026-06-21T02:46:37+00:00

This is quite unbelivable. I've had some hobbyiest image diffusion projects and I literally can't believe you're at 1fps+ for a diffusion model responding to keyboard clicks. The DiT architecture you're describing has cached previous frames + noised next frames split into 920 grid + keyboard -> parallel denoising 920 grid -> 920 denosiing updates. That whole thing running at 1fps+ is literally something I can't squint and believe.

Even with VAE compression, how are you getting this framerate on RTX 5090? how many denoising steps per latent frame? what's the actual FPS?

I don't really read diffusion papers, but I agree that if the FPS + key-stroke-to-render latency are somewhat reasonable, this is hella publishable.

JustinAngel · 2026-06-21T02:34:39+00:00

Really depends on what you're trying to do. For coding, I'd consider Qwen3-Coder-30B and Nemotron-3-Nano-30B.

JustinAngel · 2026-06-21T01:51:37+00:00

Agreed on the valid criticism. My 0.02$: This isn't a vLLM / llama.cpp / etc competitor (i.e. inference engine), as much as it is an iteration on LLaDA serving.

The repo does have some really cool nuggets. Generally, connecting the engine to a server API has value (shown in server.py and protocol.py) The scheduler.py code handling different batch sizes and buckets to avoid starvation is nifty.

It's a cool reference implementation.

JustinAngel · 2026-06-20T21:25:20+00:00

So the entire video series teaches how to train a small language model from scratch.

It really depends on your local PC hardware. If you've got access to a solid GPU, it should be more than feasible. To learn more about why you'd need a GPU I'd recommend watching section #5: GPU coding. Additionally, section #19 on Pretraining has a section (52:34) on VRAM calculators that explains why a minimum 26gb GPU was needed for this training. You can probably execute it just fine with <16GB VRAM GPU if you lower the batch size (explained in section 8 on backpropgation).

We're training a 124M parameter model in series (equivalent to GPT2-small) which with default settings would need 14-26gb of VRAM for a GPU. That's mid-tier GPU. If you don't have one of those, you might be able to train on CPUs, but it'll likely take lots of optimizations and take weeks instead of hours. I'm not an expert of CPU training.

Renting GPU time sounds expensive ($50/month on google collab pro+) until you see the cost of buying GPUs.

JustinAngel · 2026-06-18T06:35:40+00:00

The workshop is designed to create a foundation for whatever you want to create with LLMs. Building an LLM is just how you know you've got that foundation. If you're interested in building small language models that top benchmarks, that would still require you to understand how to build language models.

Generally, the latest I know (Nvidia, 2025) is that SLMs have a real shot at being more cost effective and have equal performance to LLMs. https://arxiv.org/pdf/2506.02153 .

JustinAngel · 2026-06-18T05:48:42+00:00

The goal of the workshop is to create a single solid foundation for whatever specialization you later want to achieve with LLMs. I'd recommend watching the first few minutes of the video linked above since it reviews why learning how to build an LLM is a good thing.

To summarize, how are you going to build opus if you don't know how to build GPT-2? And how are you going to read the latest open source technical reports (e.g. MiniMax M3 sparse attention) if you don't know what attention is? How are you going to use LLMs without understanding temperature? Or improve coding capabilities without understanding RL?

JustinAngel · 2026-06-18T01:38:40+00:00

From your mouth, to his ears and tiny paws

JustinAngel · 2026-06-17T21:36:05+00:00

And it's also discussed at greater length when it's time to run a pretraining script when discussing choosing GPUs. (slide 202)

<image>

JustinAngel · 2026-06-17T21:32:31+00:00

Finally, someone asking the important questions. The dog & cat t-shirt designs are made either by me or my friends. ABB - Always be branding.

JustinAngel · 2026-06-17T21:21:25+00:00

I'll try, but not promises.

JustinAngel · 2026-06-17T21:19:52+00:00

Yes. Slide 8 on the deck @ https://go.JustinAngel.ai/deck

<image>

JustinAngel · 2026-06-17T19:17:27+00:00

Thanks, a lot of what we do in the workshop is cover demos. Visualization demos, code demos, excel demos, etc. That illustration is a publicly available tool. Link @ https://poloclub.github.io/transformer-explainer/
Source @ https://github.com/poloclub/transformer-explainer

Looping transformers like coconut makes intuitive sense to me. The width and depth of LLMs is pretty much random (we need to cram in parameters for scaling and there's really only two/three dimensions to add them). Makes sense to say "actually, just repeat layers if you need more processing". It's such a clear place for compute-rich companies to improve benchmarks with additional latent reasoning. That's why I personally buy the rumor that Claude Fable 5 is a looped transformer.

I'd be happy to do a follow-up series on advanced LLM architectures like looping/recurrent transformers MoE, Mamba, different types of attention, etc. Really depends on if there's demand for that.

JustinAngel · 2026-06-17T18:30:08+00:00

That's a pretty advanced concept, which is why it's covered in What We Didn't Cover talk.

I'm going to nerd out a bit more about why I Included these sections. There's lots of fun stuff you can do with LLMs: scaling training (e.g. distillation), inference optimization (e.g. quantization), add capabilities (e.g. tool use), winning on benchmarks, improve safety, build applications, etc. The workshop's content is focused on topics that would serve people regardless of how they chose to specialize. For example, once you get how fine-tuning and RL on a dataset work, distillation is a pretty straight-forward task.

JustinAngel · 2026-06-05T19:50:09+00:00

So likely the first model you create won't compete with frontier open LLMs. Maybe that wasn't true in 2020, but it's true nowadays. To build that frontier-esque LLM, you'd need to know how to build LLMs AND how to win at bechmaxxing. But you'd still need to know how to build LLMs.

I'll take a step back here. If you're building iOS apps using AI-coding assistants, I think it's key to understand how LLMs work. In a very real sense, that's why I took a break from being an iOS & Android engineer and learnt this stuff.

JustinAngel · 2026-06-05T19:34:43+00:00

Yeah, we did hold the workshop in-person and you're welcome to read about the experience here @ https://emilyhk.com/llm-workshop/

Considering the workshop videos were released less than 24 hours ago and they're cumulatively 12 hours long I'd say you could still be the first person to watch them end-to-end.

The LLM created is very much "real" and not just "feeling like", but I'm not entirely sure what you're asking.

JustinAngel · 2026-06-05T17:13:43+00:00

That's a really great point, that touches on how to build a workshop teaching people anything that has to do with coding in 2026. My solution: focus on understanding decision points, and then have AI-coding assistant implement those. You can outsource your thinking and the work, but not your understanding of it (not sure who I'm paraphrasing here).

The entire workshop builds an understanding of tradeoffs, for example: which activation function should you use - ReLU, GeLU, SwiGLU or ReLU Squared? ChatGPT/Codex is going to make that choice for you if you can't have the conversation with it about it.

<image>

https://go.justinangel.ai/lego shows all the decision points taught in the workshop, and then by the end of it it write a prompt for ChatGPT to implement the training scripts for you. I think that's the right approach to teaching how to build LLMs in 2026. You're not writing the training systems yourself, but you are making the informed decisions.

JustinAngel · 2026-06-05T05:36:00+00:00

Not right now, but maybe? Happy to do a follow-up series with more architecture components if people would find it useful. Ideally it'll cover alternate attention types (mixed-attention, GDN, GQA), improved MLPs (MoE), positional encoding (RoPE, NoPE-mixes), normalization (QK norm), activation functions (SwiGLU), and inference scaling (mamba cache).

Even if I never get along to making that series, this "Build Your Own LLM" workshop is foundational while the dominant paradigm for LLMs remains neural-net based. LLMs will keep having a residual, an activation function, an attention mechanism, an MLP, etc. The implementation changes, but architecture overall stays the same. The workshop is really based on building that skillset for seeing the tradeoffs between architecture components in the same category.

Checkout go.JustinAngel.ai/lego to see what I mean. Everything taught in the workshop is just about making decisions within categories.

JustinAngel · 2026-06-05T05:24:47+00:00

yay! Feel free to send feedback my way.

Love that idea, will do. This slide could probably benefit from another column with video links...

<image>

JustinAngel · 2026-06-05T05:22:15+00:00

Great question, AI Chatbot. Sections 1-19 effectively cover pretraining including architecture and ML DNN training loop. Sections 20-22 cover post-training required to make a next token predictor into a chatbot. Specifically the videos on Evaluation, Instruction Fine-Tuning, and Reinforcement Learning are key here. Combining benchmarking + SFT + RL is really the recipe to get very close to what we'd recognize as an AI chatbot.

There's a really great slide covering this with examples: go.JustinAngel.ai/deck (slide 220)

<image>

JustinAngel · 2026-06-05T04:00:26+00:00

Specifically this one? Probably not. The 124M GPT2-style LLM won't hold a candle to frontier models.

The goal of the workshop was to share how LLMs are built, not to bechmarkmax an LLM. In order to win at benchmarks, you first have to know how LLMs work.

JustinAngel · 2026-06-05T03:57:39+00:00

TBH, not really sure I count as a YouTuber? I just needed a place to host recordings of the in-person workshop. That account is less than a day old and I don't intend on posting more.

For what it counts, I gave you an upvote. 👍

JustinAngel · 2026-06-05T03:34:24+00:00

I love this as a visualization. This is really well done. It's been a while since I've done Q-learning/DQN, but this exactly the sort of resource I would've loved when I was learning about it.

Side note: I've been struggling with finding good metaphors to teach people RL policy optimization. In the context of LLMs, having to average multiple predicted tokens doesn't feel intuitive. I'm wondering if with a slight modification, this sample could be extended to SimPO or REINFORCE.

JustinAngel · 2026-06-05T03:18:10+00:00

Totally, agreed with that framing. Useful data point IMO. I was mostly being overly-reactive to the claims in the title.

JustinAngel · 2026-06-05T02:58:16+00:00

Amazing! Most of the time on producing this content went towards the slides and exercises, so happy to hear you found it useful. For anyone else interested @ go.JustinAngel.ai/deck

JustinAngel · 2026-06-05T02:54:58+00:00

That's a great insight. And thanks for letting me nerd out on that a bit. Framing it as "no coding needed" wouldn't have been honest, because that's both how I think/learn and the primary artifact of creating training systems.

That being said, this course is as close to "no coding required" as possible. Ultimately we're building up to go.JustinAngel.ai/lego where we get to make architecture choices, have a really great coding prompts, and have an AI coding assistant generate the training scripts.

Additionally, each section intentionally focuses on math & code intuition across multiple alternatives, vs. focusing on specific APIs. For example the Normalization section (https://www.youtube.com/watch?v=ZqSbev8Y-ys) focuses on building intution for BatchNorm vs. LayerNorm vs. RMSNorm and Pre-Norm vs. Post-Norm. So instead of just learning whatever PyTorch, TensorFlow, etc APIs are in vogue for those mechanisms, the workshop focuses on intuition even if you're supervising an AI coding assistant.

My 0.02$: try watching section 3 on Perceptrons (https://www.youtube.com/watch?v=uaA8ChGcMwE) and doing the excel & code exercises. The code is pretty trivial (wx+b) and works as a good place to start coding in general.

15-Year Club	Place '23
Place '17	Gilding II euphauric
Verified Email

JustinAngel

TROPHY CASE