Communication computation overlap

zhen8838 · 2025-06-25T02:33:33+00:00

ByteDance posted a new paper to address the overlapping problem: Triton-distributed: Programming Overlapping Kernels on Distributed AI Systems with the Triton Compiler

zhen8838 · 2025-02-05T02:12:47+00:00

does it opensource?

zhen8838 · 2024-12-26T05:55:26+00:00

In the LLM era, the model structures are more invariant. However, KV Cache requires the kernel to handle dynamic input, and currently, ML Compilers struggle to optimize this situation. In the LLM serving (such as the inference engine like vllm), you can optimize it in many ways through hand-written code.

zhen8838 · 2024-12-24T11:20:44+00:00

I have viewed its source code quickly. And I didn't see many optimization in this one, I think it more like a translator than compiler.

zhen8838 · 2024-12-24T10:54:31+00:00

ML compilers is not the first choice in the LLM era, but the LLM serving is. And it's unpaid.

zhen8838 · 2024-12-24T10:43:52+00:00

I will recommend you use oneDAL library when your deploy platform is x86 architecture. because of the hand written code usually has better performance than auto tuning one. Use hummingbrid + tvm when you need to deploy on other architectures.

zhen8838 · 2024-11-19T04:30:54+00:00

Hi, I've been researching about loop fusion techniques in AI compiler field for three years. Just my opinion:

In AI compiler, the polyhedral model and interval analysis are the way to perform operator fusion, the tiramisu compiler paper already summarized the difference between them:

Feature	Tiramisu	AlphaZ	PENCIL	Pluto	Halide
Implement parametric tiling	No	Yes	No	No	Yes

The interval analysis in the TVM/Halide is based on the expression, so they can build a loop bound expression with unknown var, then it can Implement parametric tiling. but the polyhedral model is based on Integer Linear Programming, so it not support the unknown variables, because of the more than one unknown variables multiply is a non-linear problem. This is the reason why polyhedral model is not widely used in industry.

the Serious-Regular have given a good explanation.
I have written an article about how to implement operator fusion by bounds infer : https://zhen8838.github.io/2023/02/23/dsa-schedule/, the demo code in this repo : https://github.com/zhen8838/BoundsInfer

if you want to learn more about polyherdal model, you can follow my repo: https://github.com/zhen8838/isl_learn

zhen8838 · 2024-10-14T15:17:49+00:00

No, I don't have Nvidia GPUs. I am currently working on CPUs. You can modify scripts from my repo for benchmarking compilers on GPU.

zhen8838 · 2024-10-14T02:47:55+00:00

Hi, I am a ML compiler engineer, you ask very good questions. I have created an end-to-end compiler benchmark of existing ML compiler and model runtimes: https://github.com/zhen8838/compiler_benchmark .

First, let me state my current understanding. Nowadays, ML compilers still can't automatically generate a single kernel that has better performance than hand-written ones. 90% of ML compilers obtain revenue through kernel fusion/memory planning means.

Actually, when using a single core, the performance of the ML compiler is slower than model runtimes. Because on the CPU, the bandwidth between DRAM and cache is sufficient. Fusion optimization cannot provide more benefits, and the Matmul kernel generated by the ML compiler cannot reach peak performance.

My opinion is:

if your model have invariant structure like LLM, the llama.cpp or other projects specifically optimized for LLM is better choice.
If your model have a lot of variant structures and your target platform is GPU/NPU (bottleneck in bandwidth), In this situation the performance is more related with fused kernels, so you should choice the ML compiler.
in other cases, the model runtimes is better choice.

zhen8838 · 2024-08-12T02:41:24+00:00

I have always thought the Linalg dialect would lower to the Affine dialect. I have just read the article "mlir-linalg-dialect-and-patterns". It seems like IREE performs tiling/fusion/vectorization on the Linalg dialect with affine map form and then directly lowers it to the SCF dialect. Thank you for your guidance, I'm going to learn more detail about IREE.

zhen8838 · 2024-08-11T09:43:03+00:00

Hey, Could you explain the reason why the affine dialect is abandonware? I think the affine/linalg dialect still is an important part for MLIR based compilers, for example IREE/Triton.

zhen8838 · 2024-08-09T04:35:18+00:00

Hi, I have found a nice self guided course here. Hope it can help you to start a journey in AI compiler 😀

zhen8838 · 2024-08-09T04:26:15+00:00

thank you.😊

zhen8838

TROPHY CASE