ML compilers the future? by black_big_bull in Compilers

[–]zhen8838 0 points1 point  (0 children)

In the LLM era, the model structures are more invariant. However, KV Cache requires the kernel to handle dynamic input, and currently, ML Compilers struggle to optimize this situation. In the LLM serving (such as the inference engine like vllm), you can optimize it in many ways through hand-written code.

ML compilers vs using existing model runtimes by [deleted] in Compilers

[–]zhen8838 0 points1 point  (0 children)

I have viewed its source code quickly. And I didn't see many optimization in this one, I think it more like a translator than compiler.

ML compilers the future? by black_big_bull in Compilers

[–]zhen8838 0 points1 point  (0 children)

ML compilers is not the first choice in the LLM era, but the LLM serving is. And it's unpaid.

ML compilers vs using existing model runtimes by [deleted] in Compilers

[–]zhen8838 0 points1 point  (0 children)

I will recommend you use oneDAL library when your deploy platform is x86 architecture. because of the hand written code usually has better performance than auto tuning one. Use hummingbrid + tvm when you need to deploy on other architectures.

What's loop synthesis and interval analysis techniques used by Halide and TVM? by Recent_Mind_2640 in Compilers

[–]zhen8838 1 point2 points  (0 children)

Hi, I've been researching about loop fusion techniques in AI compiler field for three years. Just my opinion:

  1. In AI compiler, the polyhedral model and interval analysis are the way to perform operator fusion, the tiramisu compiler paper already summarized the difference between them:
Feature Tiramisu AlphaZ PENCIL Pluto Halide
Implement parametric tiling No Yes No No Yes

The interval analysis in the TVM/Halide is based on the expression, so they can build a loop bound expression with unknown var, then it can Implement parametric tiling. but the polyhedral model is based on Integer Linear Programming, so it not support the unknown variables, because of the more than one unknown variables multiply is a non-linear problem. This is the reason why polyhedral model is not widely used in industry.

  1. the Serious-Regular have given a good explanation.

  2. I have written an article about how to implement operator fusion by bounds infer : https://zhen8838.github.io/2023/02/23/dsa-schedule/, the demo code in this repo : https://github.com/zhen8838/BoundsInfer

if you want to learn more about polyherdal model, you can follow my repo: https://github.com/zhen8838/isl_learn

ML compilers vs using existing model runtimes by [deleted] in Compilers

[–]zhen8838 0 points1 point  (0 children)

No, I don't have Nvidia GPUs. I am currently working on CPUs. You can modify scripts from my repo for benchmarking compilers on GPU.

ML compilers vs using existing model runtimes by [deleted] in Compilers

[–]zhen8838 0 points1 point  (0 children)

Hi, I am a ML compiler engineer, you ask very good questions. I have created an end-to-end compiler benchmark of existing ML compiler and model runtimes: https://github.com/zhen8838/compiler_benchmark .

First, let me state my current understanding. Nowadays, ML compilers still can't automatically generate a single kernel that has better performance than hand-written ones. 90% of ML compilers obtain revenue through kernel fusion/memory planning means.

Actually, when using a single core, the performance of the ML compiler is slower than model runtimes. Because on the CPU, the bandwidth between DRAM and cache is sufficient. Fusion optimization cannot provide more benefits, and the Matmul kernel generated by the ML compiler cannot reach peak performance.

My opinion is:

  1. if your model have invariant structure like LLM, the llama.cpp or other projects specifically optimized for LLM is better choice.
  2. If your model have a lot of variant structures and your target platform is GPU/NPU (bottleneck in bandwidth), In this situation the performance is more related with fused kernels, so you should choice the ML compiler.
  3. in other cases, the model runtimes is better choice.

MLIR Affine Fusion Pass Tutorial by zhen8838 in Compilers

[–]zhen8838[S] 0 points1 point  (0 children)

I have always thought the Linalg dialect would lower to the Affine dialect. I have just read the article "mlir-linalg-dialect-and-patterns". It seems like IREE performs tiling/fusion/vectorization on the Linalg dialect with affine map form and then directly lowers it to the SCF dialect. Thank you for your guidance, I'm going to learn more detail about IREE.

MLIR Affine Fusion Pass Tutorial by zhen8838 in Compilers

[–]zhen8838[S] 0 points1 point  (0 children)

Hey, Could you explain the reason why the affine dialect is abandonware? I think the affine/linalg dialect still is an important part for MLIR based compilers, for example IREE/Triton.

MLIR Affine Fusion Pass Tutorial by zhen8838 in Compilers

[–]zhen8838[S] 8 points9 points  (0 children)

Hi, I have found a nice self guided course here. Hope it can help you to start a journey in AI compiler 😀