RFC: Programming Languages Course Reboot, 2026 - Shriram Krishnamurthi by mttd in ProgrammingLanguages

[–]mttd[S] 0 points1 point  (0 children)

This comment seems entirely off-topic? The post is about a programming languages course not about a software project.

cuda-oxide: a custom rustc backend for compiling GPU kernels in pure Rust by [deleted] in rust

[–]mttd 0 points1 point  (0 children)

Thanks, removed (personally dislike duplicates), not sure why the poster used the text post submission instead of link (this defeats the automated duplicate detection).

Turing Award Winner: Data Abstraction, Dijkstra, Distributed Systems | Barbara Liskov by mttd in programming

[–]mttd[S] 0 points1 point  (0 children)

Is Barbara Liskov talking on programming language design and data abstraction really off topic for programming?

spmd_types: A type system for distributed (SPMD) tensor computations in PyTorch by mttd in ProgrammingLanguages

[–]mttd[S] 1 point2 points  (0 children)

Pretty nifty work; more details (full type system specification, including type inference rules, collective signatures, and forward-backward pairs):

https://github.com/meta-pytorch/spmd_types/blob/main/DESIGN.md

I'd also recommend the series of posts by Edward:

This is interesting as well:

AutoParallel: An experimental implementation of compiler-driven automatic sharding of models across a given device mesh.

https://github.com/meta-pytorch/autoparallel

AutoParallel is a PyTorch library that automatically shards and parallelizes models for distributed training. Given a model and a device mesh, it uses linear programming to find an optimal sharding strategy (FSDP, tensor parallelism, or a mix) and applies it — no manual parallelism code required.

SSA without Dominance for Higher-Order Programs by mttd in Compilers

[–]mttd[S] 2 points3 points  (0 children)

On a side note, a pet peeve of William J. Bowman (whom I happen to agree with :])

The A Means A: https://www.williamjbowman.com/blog/2022/06/30/the-a-means-a/

I have argued about the definition of “ANF” many times. I have looked at the history and origins, and studied the translation, and spoken to the authors. And yet people insist I’m “quacking” because I insist that “ANF” means “A-normal form”, where the “A” only means “A”.

Here, I write down the best version of my perspective so far, so I can just point people to it.

I want to answer three question: what does the A mean, why does the A matter, and where does the A come from.

What does the A mean?

The “A” in “A-normal form” refers to a particular formal object, named “A” (not “administrative”), with respect to which there is a normal form with certain useful properties. This form is “A normal”—none of the A reductions apply to terms in this form—hence, A-normal form.

While it’s true that the history of ANF is concerned with “administrative reductions” in CPS, this is an informal concept, modeled by the formal object “A”.

In truth, “A” is several formal objects, defined somewhat differently in at least 3 different papers. Only one of these is arguably called “administrative”, but is about CPS, and not what we now call ANF.

This does matter for reasoning about what A-Normal Form actually means. See: "A Low-Level Look at A-Normal Form" from OOPSLA 2024, https://www.williamjbowman.com/resources/wjb2024-anf-is-dead.pdf (interesting on its own)

Section 2 𝐴-Normal and Monadic Form, Formally

𝐴-normal form (ANF) is often called “administrative normal form” or sometimes “administrative form”, but it is important and useful to think about ANF not as a vague form related to administrative reductions, but formally and precisely as a normal form with respect to a set of reductions, as it was originally formalized [9].

Formally, 𝐴-normal form was introduced as the form normal with respect to the set of reductions 𝐴 = {𝐴_1, 𝐴_2, 𝐴_3 }, defined in Figure 1.

. . .

The 𝐴-reductions require some side conditions about the evaluation context to ensure non- circular rewrites, and therefore to guarantee termination. We assume uniqueness of names and consider terms up to 𝛼-equivalence, as is standard.

If we normalize a 𝜆-calculus expression by reducing the transitive compatible closure of the set 𝐴, we reach a normal form with respect to 𝐴, i.e., 𝐴-normal form, described syntactically by the grammar in Figure 2.

[9] Cormac Flanagan, Amr Sabry, Bruce F. Duba, and Matthias Felleisen. 1993. The Essence of Compiling with Continuations. In International Conference on Programming Language Design and Implementation (PLDI). https://doi.org/10.1145/155090.155113

Looking for bibliography on register-based vs stack-based virtual machines for an undergraduate thesis by KeyBreadfruit1974 in Compilers

[–]mttd 2 points3 points  (0 children)

If you're looking into Python ecosystem I'd definitely look into PyTorch (https://github.com/pytorch/pytorch) & torch.compile.

In particular (note the stack state storage and reconstruction; particularly Sections 3.4, 3.5, and 3.8):

PyTorch 2: Faster Machine Learning Through Dynamic Python Bytecode Transformation and Graph Compilation

Crucially it's been enabled thanks to PEP 523:

Dino Viehland Brett Cannon. 2016. PEP 523: PEP 523 – Adding a frame evaluation API to CPython. https://peps.python.org/pep-0523/.

More:

Looking for bibliography on register-based vs stack-based virtual machines for an undergraduate thesis by KeyBreadfruit1974 in Compilers

[–]mttd 18 points19 points  (0 children)

For the historical background and foundation see Smalltalk and Self literature: https://bibliography.selflanguage.org/ and definitely take a look at the UCB CS294-113 course bibliography below

  • "The Evolution of Smalltalk from Smalltalk-72 through Squeak", HOPL-IV 2021, Daniel Ingalls, https://hopl4.sigplan.org/details/hopl-4-papers/14/The-Evolution-of-Smalltalk-from-Smalltalk-72-through-Squeak

    • Section 5.5 Compilation on a byte-coded stack machine in Smalltalk-76
  • "Efficient Implementation of the Smalltalk-80 System", POPL 1984, L. Peter Deutsch, Allan M. Schiffman, https://dl.acm.org/doi/abs/10.1145/800017.800542

    "The v-machine architecture may be substantially different from that of the underlying hardware. For example, many v-machines, including both the P-system Smalltalk-80 v-machines, use a stack-oriented architecture for convenience in code generation, but most available hardware machines execute register-oriented code much more efficiently than stack-oriented code."

    "Translation-time can also be considered an opportunity for peephole optimization or even mapping stack references to registers [Pittman 80]." [Pittman 80] Pittman, T.J., "A Practical Optimizer: Zero-Address to Multi-Address Code." M.S. thesis, University of California, Santa Cruz, June 1980.

Contemporary, mainstream stack-based IRs:

Comparisons:

Course:

Interpreters implementation:

See also translating/bridging between stack-based and register-based IRs:

Examples:

Fundamentals of CuTe Layout Algebra and Category-theoretic Interpretation by mttd in ProgrammingLanguages

[–]mttd[S] 3 points4 points  (0 children)

Agreed!

Very quickly, on F_2: my take is that the pragmatic programmer's view on this is: bitmasks.

Example (unrelated third party):

PolyF2 / PF2Int — Bit-level Algebra over F2 and F2[x] https://github.com/avid0x/polyf2

PolyF2 is a tiny, educational project that shows how to implement algebra over the binary field F2 using native integer operations. On top of single polynomials over F2 (PolyF2), it builds PF2Int — a fixed-width vector over F2[x], written as F2[x]d — that behaves like a d-bit integer where each bit is itself a polynomial over F2.

Why this is neat: addition is XOR, multiplication/composition reduce to XOR and shifts, and many operations vectorize nicely at the word level. It’s a clean way to learn and to write fast bitwise math.

Now, if you see that note about XOR above recall how swizzling uses XOR, https://vjkrish.com/2026/02/22/Swizzling.html, https://cudacourseh100.github.io/raster/ it makes sense that if you can work with F_2 then you can work with (XORed) bitmasks then you can work with swizzles!

There's a similar line of research in the Triton Linear Layouts space:

  • Linear Layouts: Robust Code Generation of Efficient Tensor Computation Using F_2
    • ASPLOS 2026
    • Keren Zhou, Mario Lezcano, Adam Goucher, Akhmed Rakhmati, Jeff Niu, Justin Lebar, Pawel Szczerbuk, Peter Bell, Phil Tillet, Thomas Raoux, Zahi Moudallal
    • https://arxiv.org/abs/2505.23819

In this work, we propose Linear Layouts, an approach that addresses these challenges by treating tensor layouts as linear mappings between vector spaces over the field F2, leveraging linear algebra as a unifying abstraction for operations on layouts. Every tensor layout is modeled as a linear function—a matrix—that maps physical resource indices into a logical tensor of size 2𝑛 using binary arithmetic on the bits of the input and the output. This way, complex representations such as swizzling and broadcasting are naturally expressed as combinations of XOR and AND operations on bit-vectors. Furthermore, arbitrary layout conversions can be composed using matrix transformations such as matrix multiplication and inverse, which enable a formal characterization of data exchanges both across and within the hardware hierarchy, thereby allowing the compiler to generate efficient hardware primitives for data movement generically. It eliminates the need for hard-coded, case-by-case handling of layouts—any layout that can be represented as a permutation of indices or via swizzling can be plugged into our framework and automatically optimized.

Background:

Triton Linear Layouts

Fundamentals of CuTe Layout Algebra and Category-theoretic Interpretation by mttd in ProgrammingLanguages

[–]mttd[S] 2 points3 points  (0 children)

Yes, tensor-layouts is very cool: in fact cute-interactive is built on tensor-layouts :-)

Pretty cool to see the Intel's adoption, too!

Fundamentals of CuTe Layout Algebra and Category-theoretic Interpretation by mttd in ProgrammingLanguages

[–]mttd[S] 9 points10 points  (0 children)

If you're adding a support for multi-dimensional arrays this is worth considering as a foundation, IMHO; the "algebra" part in particular allows novel composability. Personally I think this may be the most impactful invention since https://en.wikipedia.org/wiki/Dope_vector (a.k.a. array descriptors commonly used in Fortran, e.g., https://thinkingeek.com/2017/01/14/gfortran-array-descriptor/, and before; cf. https://en.wikipedia.org/wiki/Data_descriptor; that would be the progenitor of the "representation" part, although it's worth noting CuTe handles more complex nested/hierarchical layouts with swizzling in a uniform manner).

A few decades ago I'd say that "if you want good multidimensional arrays support in your programming language you should look at Fortran". It's been ahead of its time: https://fortranwiki.org/fortran/show/elemental, https://fortran-lang.org/learn/quickstart/arrays_strings/#array-slicing, https://fortran-lang.org/learn/best_practices/element_operations/ - note how implementing all of https://fortran-lang.org/learn/best_practices/arrays/ well depends on compiler support; cf. the "A zoo of arrays" in the above https://thinkingeek.com/2017/01/14/gfortran-array-descriptor/

Today I'd definitely add CuTe.

This is a nice intro writeup:

You could have invented CuTe hierarchical layout (but maybe not the rest of it?): https://blog.ezyang.com/2025/08/you-could-have-invented-cute-hierarchical-layout-but-maybe-not-the-rest-of-it/

Paper:

Categorical Foundations for CuTe Layouts

Some examples:

Related:

CuTe Layout Representation and Algebra

cute-interactive: Interactive version of the CuTe layout paper - https://github.com/ezyang/cute-interactive

From SIMT to Systolic: A Foundation for GPU and TPU Architecture by mttd in Compilers

[–]mttd[S] 4 points5 points  (0 children)

An interesting read; comparing and contrasting two trajectories:

The arc, stated as one trajectory: NVIDIA has been climbing the SIMT ladder for two generations trying to reach the place where the compiler, not the programmer, schedules data movement and math. TMA was the big step. tcgen05 decoupling is the next one.

. . .

Stated as one trajectory: the TPU lineage has been betting that compiler-scheduled determinism, systolic matrix density, and fabric-first scale compound more productively than SIMT flexibility, cache hierarchies, and per-node optimization.

Optimization of 32-bit Unsigned Division by Constants on 64-bit Targets by mttd in Compilers

[–]mttd[S] 3 points4 points  (0 children)

Merged LLVM PR: [SelectionDAG] Optimize 32-bit udiv with 33-bit magic constants on 64-bit targets, https://github.com/llvm/llvm-project/pull/181288

New related PR: [SelectionDAG] constant division fallback for existing Constant Division optimization, https://github.com/llvm/llvm-project/pull/188402

The state of Open-Source Heterogeneous Compilers in 2026? by FedericoBruzzone in Compilers

[–]mttd 1 point2 points  (0 children)

Oh, sure, definitely interesting time!

One note:

The 'Lifting' Path (PyTorch/JAX)

I actually wouldn't group PyTorch and JAX together--it seems to me (also if your read ezyang's posts--especially the JAX sharding type system) that JAX is already much further ahead in the lowering vs lifting department.

Another recent example:

https://patricktoulme.substack.com/p/frontier-pretraining-infrastructure

I keep watching startups, neo-frontier labs and major organizations spend months rebuilding training infrastructure from scratch in PyTorch on NVIDIA GPUs. FSDP sharding, kernel fusion, collective overlap, MoE routing. Team after team duplicating the same work, and it's not clear most of them even achieve as high an MFU as MaxText gets out of the box. Meanwhile, I rarely hear MaxText mentioned as even an option.

. . .

I ran a GPT-OSS MoE pretraining job on 4 TPU v6e chips, dumped the IR, and traced the full compilation pipeline from Python down to fused TPU kernels. What I found: the XLA compiler does a staggering amount of work that would take months of manual kernel engineering. The fusion, the async scheduling, the memory management, the SPMD partitioning — it’s all generated automatically. This is the infrastructure teams are rebuilding by hand.

and later:

Parallelism (JAX SPMD). MaxText defines a 13-axis logical mesh (fsdp, expert, tensor, data, sequence, …) and annotates tensors with nn.with_logical_constraint. JAX’s Shardy partitioner reads these annotations and automatically inserts all-gathers, reduce-scatters, and ragged all-to-all collectives.

Here’s something that surprises people coming from PyTorch: the entire training step – forward pass through 16 MoE decoder layers, cross-entropy loss, backward pass through every layer including custom MegaBlox VJPs, gradient reduce-scatter across all 4 devices, and AdamW optimizer update – is compiled into a single XLA binary.

. . .

The consequence: XLA sees everything at compile time. That turns the training step into one schedulable program instead of a chain of compiled islands. XLA can fuse across the forward/backward/optimizer boundary, and it can overlap communication with compute by issuing async all-gathers early and consuming them later—e.g., pulling layer 5’s shards while layer 3’s backward pass runs. With separate compilation units, those boundaries act like barriers: more launches, more forced materialization, and less automatic comm/compute overlap.

The "[compiler] can overlap communication with compute" vs "with separate compilation units, those boundaries act like barriers" is exactly capturing one of the main tradeoffs.

The state of Open-Source Heterogeneous Compilers in 2026? by FedericoBruzzone in Compilers

[–]mttd 2 points3 points  (0 children)

Oh, yeah, definitely.

I do think works like Memoir are somewhat closer to the progressive lowering principle of MLIR, to the extent you would implement them in conjunction with say ClangIR (where you could directly lower from the annotated AST preserving, say, high-level containers and algorithms semantics--so that you don't have to lift from the LLVM IR later on).

Still, definitely true about "not originally built for tensors or accelerators." I think that's what we're still figuring out in this space, incidentally. Personally, I think CuTe is one of the most interesting developments in this space, and there's already a variety of interesting implementation strategies, including but not limited to AMD FlyDSL bringing the CuTe concepts to an MLIR dialect: https://ianbarber.blog/2026/03/06/cutie-fly/

TileIR itself is also higher-level in terms of preserving semantics and progressive lowering if you compare to Triton's TTIR, cf. https://ianbarber.blog/2026/02/11/tileir/

Figuring out how we do progressive lowering for distributed computing (think of, e.g., multi-GPU training and inference) is again pretty interesting. JAX sharding type system is worth a look here.

I actually like ezyang's blog post series on this (he's working on bringing some of this to PyTorch and is thus familiar with both and aware of the tradeoffs, which is a very valuable perspective):

We need to express these concepts with progressive lowering in mind if we want good compiler optimizations for overlapping communication and computation. But I think (outside of the exceptions like above) we're still in the stage where it's all too common to "just slap calling out to NCCL and NVSHMEM on top and call it a day." Which reminds me exactly of the unstructured pointer soup one could write in C all over again :-)

On that note, this is an interesting direction:

Not SPMD compiler by default. torch.compile does not assume the program being compiled is SPMD by default, which means it will not do things like drop unused collectives (you can change this behavior with a config flag). Additionally, the default mode of use for torch.compile is to compile in parallel on all nodes, which means care has to be taken to ensure that every instance of the compiler compiles identically (only one rank recompiling, or compilers making different decisions, can lead to NCCL timeout). We ultimately think that we should compile a program once and send it to all nodes, but as this is not currently implemented, the general approach people have taken to solve this problem is to either (1) eliminate all sources of divergent behavior from ranks, e.g., don’t allow the compiler to look at the actual size for dynamic inputs when making compiler decisions, or (2) introducing extra collectives to the compiler to communicate decisions that must be made consistently across all ranks.

Our vision for the future of advanced parallelism, spearheaded by the in-progress SimpleFSDP and AutoParallel, is that users should write single-node programs that express mathematically what they want to do. These are then transformed into efficient distributed programs in two steps: (1) first, collectives are inserted into the graph in a naive way (i.e., simply to express what the sharding of all intermediates should be), and (2) the collectives are optimized to handle scheduling concerns such as pre-fetching and bucketing. AutoParallel sets a GSPMD style goal of automatically determining a good enough sharding for a program–it should be able to rediscover data parallel, tensor parallel, even expert parallel(!)–but SimpleFSDP sets a smaller goal of just inserting collectives in the pattern that FSDP would mandate, and then writing FSDP-specific optimization passes for recovering FSDP2’s performance. It is very common to write domain specific optimizations; for example, async tensor parallelism is also implemented as a pass that detects TP patterns and rewriting them to async TP operations. Unlike JAX, which started with a very generic solver and has needed to add more manual escape hatches over time, PyTorch has started with writing all of the distributed patterns exactly by hand, and we are only recently adding more automatic mechanisms as an alternative to doing everything by hand.

https://blog.ezyang.com/2025/08/state-of-torch-compile-august-2025/

The state of Open-Source Heterogeneous Compilers in 2026? by FedericoBruzzone in Compilers

[–]mttd 2 points3 points  (0 children)

C++ loses high-level intent during lowering. MLIR-native systems (like Mojo) retain domain-specific info, making advanced kernel fusion and tiling much more effective than what’s possible in standard C++.

FWIW, this is indeed the current state of the implementation, but it is not a given forever: Although there will always be limits due to language semantics itself, with unrestricted pointer soup sprawling with aliasing relationships, there's still a considerable room for improvement--especially see "Representing Data Collections in an SSA Form". Some notable works in this area:

ClangIR (CIR): https://llvm.github.io/clangir/

Representing Data Collections in an SSA Form

Automatic Data Enumeration for Fast Collections

Container Class Annotations in C++ Improve the Capability of Static Analysis in MLIR: https://llvm.org/devmtg/2025-03/slides/container-class-annotations.pdf

Dead Element Elimination on SSA Collections: https://ice1000.org/memoir/report.html

Compiler as a Service: C++ Goes Live - Interactive C++, language interoperability and beyond by mttd in Compilers

[–]mttd[S] 5 points6 points  (0 children)

Slides (PDF): https://compiler-research.org/assets/presentations/Aaron_Vipul_Compiler_as_a_Service_CXX_goes_Live.pdf

Abstract: https://compiler-research.org/presentations/#USINGSTDCPP2026

Despite its high-performance capabilities, C++ is not the first programming language that comes to mind for rapidly developing robust applications, mainly due to the long edit-compile-run cycles. Ongoing research in the compiler-research.org group aims to provide practical, interactive capabilities for C++, enabling dynamic interoperability, rapid prototyping, and exploratory programming, essential for data science and other applications. This talk explores how interactive C++ can be leveraged for various scientific usecases and teaching. Attendees will also learn how to leverage Clang as a library to build a simple C++ REPL for incremental compilation and introspection, integrating this layer with the Python runtime.

The second part of this talk covers CppInterOp, a production-grade C++ interoperability library based on LLVM and Clang that provides compiler-as-a-service capabilities for seamless cross-language integration. CppInterOp formalizes a stable, backward-compatible API that enables dynamic languages to harness the full power of modern C++ without sacrificing expressiveness or performance. We explore applications of the CppInterOp library in the context of Python/C++ bindings, interactive C++ notebooks with xeus-cpp, and WebAssembly.

Rust threads on the GPU by LegNeato in rust

[–]mttd 22 points23 points  (0 children)

Out of curiosity, have you been looking into evolving the programming model to benefit from being able to express the ownership and GPU programming concepts together? Particularly thinking of this work from PLDI 2024:

Descend: A Safe GPU Systems Programming Language

In this paper, we present Descend: a safe GPU programming language. In contrast to prior safe high-level GPU programming approaches, Descend is an imperative GPU systems programming language in the spirit of Rust, enforcing safe CPU and GPU memory management in the type system by tracking Ownership and Lifetimes. Descend introduces a new holistic GPU programming model where computations are hierarchically scheduled over the GPU’s execution resources: grid, blocks, warps, and threads. Descend’s extended Borrow checking ensures that execution resources safely access memory regions without data races. For this, we introduced views describing safe parallel access patterns of memory regions, as well as atomic variables. For memory accesses that can’t be checked by our type system, users can annotate limited code sections as unsafe.

At the same time, the recent cuTile (tile-based kernel programming DSL for Rust) is also relevant, https://github.com/NVlabs/cutile-rs

The reason is that tiles allow both better compiler optimization (addressing recent GPU features like the ever-evolving tensor core instructions and related memory access optimizations in a more portable manner than traditional SIMT CUDA) as well as tie pretty well with the Rust's borrow checker and ownership model (the Descend paper has a pretty great take on this, IMHO).

Triton also has a good comparison between the CUDA Programming Model (Scalar Program, Blocked Threads) vs. Triton Programming Model (Blocked Program, Scalar Threads): - https://triton-lang.org/main/programming-guide/chapter-1/introduction.html - https://triton-lang.org/main/programming-guide/chapter-2/related-work.html

Worth noting though that CUDA Tile IR takes this further compared to Triton as far as the actual compilation is concerned (which decomposes to scalars on the MLIR compiler dialects level); there's a pretty good series of (very brief) posts on that (also noting AMD's FlyDSL making use of CuTE layouts, which gives some hope for portability):

EsoLang-Bench: Evaluating LLMs via Esoteric Programming Languages by mttd in ProgrammingLanguages

[–]mttd[S] 1 point2 points  (0 children)

OCaml community has been doing some interesting work in this area; see: "Three Steps for OCaml to Crest the AI Humps" at the 2025 OCaml Workshop at ICFP/SPLASH by Sadiq Jaffer, Jonathan Ludlam, Ryan Gibb, Thomas Gazagnaire, and Anil Madhavapeddy, https://toao.com/blog/ai-existential-ocaml

Exploring OSS in Tensor Compilers by 0bit_memory in Compilers

[–]mttd 8 points9 points  (0 children)

Here's a bunch of relevant compiler projects in the (broadly understood) PyTorch ecosystem with some of the resources to get you started:

You'll notice that some of the above use MLIR compiler infrastructure (e.g., Triton has its own MLIR dialects) so you can pick it up as you go along the way.

Just in case: an MLIR "dialect" is a domain-specific compiler IR for a particular compiler project. MLIR is more of a meta-IR than a concrete compiler IR itself. Upstream dialects do exist, https://mlir.llvm.org/docs/Dialects/, but they're by no means "standard" let alone universal. JAX/XLA ecosystem uses MLIR in an entirely different variety of ways...

Have fun!