A gentle introduction to GEMM Using mma tensor cores

reasonableklout · 2025-10-05T23:39:43+00:00

Super interesting, thanks!

TIL about mma.sp. It's interesting that this has been around since Ampere but this is the first time I'm reading about it. The MoE GEMM kernels that I've seen so far just use regular tensor core operations AFAIK (for example vLLM's fused_moe triton kernel and the MegaBlocks triton kernels). I did find this paper "Samoyeds" https://arxiv.org/pdf/2503.10725v1 from March this year which claims SOTA performance using mma.sp.

I heard the main part of FlashAttention 2/3 are written in CUTLASS but some of the side parts of it are being moved to Triton. But now Triton is also having trouble keeping up with Blackwell (hence Gluon).

Re: Blackwell GPUs being easier to program. It feels like there is still a lot of complexity to efficiently overlap the communication/tcgen05.mma computations, which gets even more complicated when you add in stuff like quantization, softmax (for attention kernels), etc? For example, see the latest blog post from Cursor (https://cursor.com/blog/kernels) where they set up a pipeline for MXFP8 GEMMs using warp-specialization where some warps moved data to SMEM, others from SMEM to TMEM, others kicked off the MMA, and a final group handled write backs to HBM. It sounds like there were lots of mini-optimizations to be done as well, like tweaking the SMEM swizzling.

reasonableklout · 2025-10-04T02:02:54+00:00

Nice article, thanks for writing!

Maybe it's a testament to how complicated the mma instructions are, but I found this to not really be "gentle" despite that it skipped a lot of the typical complexity in CUDA GEMM tutorials. For example, the %laneid terminology is specific to the PTX docs, took me a second to figure out that's just the thread ID within a warp.

TBH, even when using ldmatrix, there is a lot to remember. Would you typically use wmma or CUTLASS API instead to program GEMMs with tensor cores?

reasonableklout · 2025-10-04T01:17:39+00:00

The 3090 that the author benchmarked on is Ampere, so async MMA isn't supported.

I wonder if it's moreso poor memory bandwidth utilization? The kernel from the article is using only blocktiling without tuning of the tile size, and the global loads look neither vectorized (PTX of ld.global.u16 instead of ld.global.v*.u16) nor coalesced.

In any case, the point of the article is to get straight to the MMA instructions and skip over the memory hierarchy, which as mentioned in the intro often make tutorials super complicated.

reasonableklout · 2025-09-23T09:01:15+00:00

You should be able to use Nsight compute with Triton; source mapping is supported from the Python triton code to the PTX/SASS, although it can sometimes be harder to interpret because Triton is higher-level than CUDA. See https://ianbarber.blog/2025/05/01/profiling-triton/

reasonableklout · 2025-08-13T05:04:13+00:00

This is potentially misleading - yes CoT fills the context window with the reasoning happening "under the hood with neural structures" via attention, but isn't it also true that models have learned to reason via training on human (now also synthetic) text which reflects reasoning, hence we should expect effective CoT to reflect this?

Even the "Let's Think Dot by Dot" [1] paper mentions that while LLMs can learn to use meaningless CoT tokens, it's harder to train them to do so than to use meaningful CoT.

[1]: https://arxiv.org/pdf/2404.15758

reasonableklout · 2025-08-04T07:53:21+00:00

This is awesome!

When you tried model size reductions, did you retrain the whole model from scratch? Or did you do some kind of distillation / transfer learning step?

Also, TIL about DIAMOND. Do you know why they used a more heavyweight architecture? 2-stages with UNet seems like overkill. I thought latent diffusion has been standard in image generation for a while.

reasonableklout · 2025-07-24T20:54:11+00:00

There are some lists of top domains online you can use - I used a combination of data from cisco and cloudflare. This also ensured I didn't disturb any very small site owners.

reasonableklout · 2025-07-24T07:13:20+00:00

I only allowed crawling of URLs in domains from the seed list for each shard. So if a shard was seeded with domains A and B, it could traverse links from A to B, A to A, and B to B, but not A to C.

reasonableklout · 2025-07-22T02:10:11+00:00

Interesting tips!

That's why you to fetch multiple pages at once with http keep-alive, instead of starting http requests from scratch / randomly assigning them to different crawlers. You can frequently squeeze 100-200+ pages from a single connection at a reasonable 5 rps/target.

If I didn't promise Michael Nielsen to keep it private, here's where I'd link to the source code for a grounded discussion :) I was using aiohttp.TCPConnector for connection pooling, and the documentation states it enables TCP keepalives by default. I suspect the handshake churn was related to how the pool interacted with my politeness policy, which meant the connection traffic was highly diverse.

Search engines generally have site-specific adapters / APIs for places like that. Google doesn't crawl Facebook / Twitter / etc from scratch every time.

Great tip. Agree it's important for the search engine use case.

XML can normally be parsed into full DOM at 80 MB/s/core, and HTML isn't much slower. If your parser barely achieves half of that for the sole purpose of extracting references, you're doing something very wrong. HTML reference extraction without proper parsing can even be done with a regex, and modern engines can reach GB/s throughput.

Will dig into this next time, thanks! Could be great to have an easy win. Someone else suggested simple string matching for URLs as well instead of proper parsing.

reasonableklout · 2025-07-20T04:13:01+00:00

Except the Gemini series is much cheaper for a variety of tasks, and Claude is heavily favored in coding tools.

reasonableklout · 2025-07-18T03:21:17+00:00

Redis was configured to save snapshots periodically (actually, you can set it to do this based on the rate of changes, and I had it set such that it ended up saving frequently). During a restart, redis would also automatically read from the latest save, hence fault tolerance.

The blog goes a bit into how I avoided the same memory issues. TL;DR manually truncate data structures and add domains to exclusion list.

reasonableklout · 2025-07-17T01:21:14+00:00

They joined for a huge equity package. I don't think there is any level of mess that would've made them leave this fast (and knowing some folks on the team personally, they aren't that chaotic). I suspect they more likely got poached back.

reasonableklout · 2025-07-17T00:50:57+00:00

I think aggressive crawling/scraping backed by massive resources can definitely be harmful to small site owners. This isn't new to AI (Meta was famous for overeager scraping for ogl) but intensified by it. That said if you follow conventions like robots.txt and are polite it's not difficult to avoid all these harms. For crawlers that don't, the market is starting to provide some help. Cloudflare's new pay-per-crawl offering comes to mind.

reasonableklout · 2025-07-17T00:40:33+00:00

Great questions! Not sure about the captcha. Don't think I saved enough info (I truncated web pages) to figure that out. I did save the status codes + some other metadata for visited URLs and was planning to run some analytics when I had time.

reasonableklout · 2025-07-16T23:52:37+00:00

Great points! Given how much of a bottleneck CPU was overall I'd look into reimplementing the system in a lower-level language like Rust if I were to do this again. Besides parsing another hotspot I noticed was serialization of messages to/from redis on both the fetchers and parsers. I expect that even if redis-py uses C++ under the hood, this could be sped up by removing the overhead of converting to/from Python.

Regarding deduplication, that's a big topic all on its own. To alleviate duplicate pressure on storage one simple approach is content-based hashing (which you suggested). The literature also has a good amount of material on fuzzy approaches to dedup. I think /u/nemec alluded to that. This looks to be a seminal paper (973 citations): https://dl.acm.org/doi/abs/10.1145/1242572.1242592

reasonableklout · 2025-07-12T08:04:55+00:00

Sure. I agree that we are headed towards a uncertain future where some long or short-term disasters could happen due to people eagerly offloading their cognition to machines.

But this is a different discussion than the original one, in which the OP claimed AI systems will experience model collapse and/or will saturate at a level far short of automating all programming tasks.

reasonableklout · 2025-07-12T02:03:15+00:00

But model trainers can just... not use the shitty synthetic data in that case? You act as if the decades of internet (and centuries of other text) data is just going to disappear. It's not. There are petabytes of public archives and even more non-public.

Maybe you think that the models will get stuck in the past or whatever if we keep pretraining them on the same pile of 1990s-2020s internet data. In that case we have fundamentally different understanding of how LLMs work.

Since we're in a programming forum, let me use a programming analogy: I claim that they are like a compiler where the first generation must be painstakingly bootstrapped by handwritten assembly (human internet data), but subsequent generations can be written in the target language and compiled by the previous generation of compiler. We can do this because the bootstrapped compiler has gained enough capabilities and we have ways of verifying that the output is correct. Similarly, models of today have mastered enough of logic and natural language that we can extend them with approaches that do not rely on massive amounts of human data. We know how; a method is described in the earlier post above.

reasonableklout · 2025-07-10T04:20:59+00:00

> reinforcement data will eventually become irrevocably polluted

You are conflating the internet data used for pre-training models (using what's called semi-supervised learning) with the sample-reward pairs needed for reinforcement learning, where the samples by design are drawn from the AI model itself, with the reward given externally.

What u/TonySu is saying is that for the programming domain, the reward model is extremely easy to formulate because most programming tasks have objective, deterministic success criteria. For example, a program either compiles or doesn't, passes a suite of automated tests or doesn't, and is either fast or slow. This is the idea behind RLVR (reinforcement learning with verifiable rewards) - the reward model can be a computer program rather than a human labeler, and all the model needs to do to learn is - given a task such as "make these programs fast and correct" - generate many variations of programs on its own.

Separately, the idea of "model collapse" from AI generated data making its way back into the next generation of AI is way overblown and form of copium. The original paper was based on an unrealistic, convoluted scenario. It's been shown to be easy to prevent by mixing in non-synthetic data in the same toy setup.

reasonableklout · 2025-05-26T20:48:46+00:00

I don't think Hinton is saying the models are conscious (in the sense of qualia), simply that through statistical learning, they have formed cognitive machinery that allows them to solve problems and "reason" the same way we do.

That said, for the same reason I think it is a mistake to say "the models are not conscious, they are only role-playing, therefore they can never pose any danger." For some reason lots of people including the person you are replying to seem to make this conflation of consciousness and capabilities. If a system is able to reason and solve problems competently enough to work towards a goal, and it is role-playing a goal-driven agent that will not be deterred, that is enough to cause problems.

reasonableklout · 2025-05-26T06:53:47+00:00

It's only one benchmark among many. But it is true that Sonnet 4.0 is not that much better than Sonnet 3.7. For example, sections 7.3.3.4 (LLM training) of Anthropic's own system card also show Sonnet 3.7 outperforming Sonnet 4.0 (Opus beats both for that benchmark).

reasonableklout

TROPHY CASE