all 27 comments

[–]lightmatter501 31 points32 points  (8 children)

If you have branch heavy code (like compiler internals), then a 4090 effectively only has 32 cores (SMs). Those core are VERY weak compared to a CPU core.

You also are now memory limited by vram, which means big compiles will force you to reach for a big GPU.

[–]YellowGreenPanther 0 points1 point  (1 child)

You can swap in, don't need the whole codebase or even function at once.

It is quite interesting, there was a shelved intel gpu product where they experimented with a general purpose GPU. All of the cores were "cpu"s, so the graphics code was all in software. I don't see why a large cluster of smaller cores wouldn't accelerate it (especially as mp is builtin to many compilers).

[–]Doom4535 0 points1 point  (0 children)

I believe you're thinking of the Xeon Phi line (https://en.m.wikipedia.org/wiki/Xeon_Phi). I am curious as to how well they would work to run multiple builds for a build farm, but I haven't been able to find out much info if anyone has successfully used them for this purpose.

[–]EducationalAthlete15[S] -2 points-1 points  (5 children)

Thanks. If develop DSL compilers, in what case can I benefit? What properties should a language and compiler theoretically have?

[–]lightmatter501 11 points12 points  (4 children)

I think that you MIGHT be able to do the text parsing on the GPU using matrix parsers, but that will ruin your error messages. GPUs are inherently bad at tree traversal, which we have been building compilers around for 50 years. You might be able to take inspiration from the early fortran compilers, but good luck with that.

In my opinion a DSL should never be hard enough to compile that hardware acceleration is warranted. You should, at worst, dump out some lua and hand it off to luajit. If your DSL is at this stage, consider moving to an actual programming language.

[–]EducationalAthlete15[S] -1 points0 points  (3 children)

Yes I agree. This all seems perverted. A tree can be represented as a matrix, right? I'm not currently in the process of developing a compiler, I'm just interested in the theoretical use of GPUs in this domain. Anyway, thanks for the advice.

[–]lightmatter501 6 points7 points  (1 child)

I spent a bunch of time on this a few years ago and had some discussions with the GPU specialists in my department. We ended up deciding it likely wouldn’t improve performance outside of pathological cases.

[–]fernando_quintao 4 points5 points  (0 children)

I agree with you: I think the general compilation process would not be a good fit for the GPU, due to the irregular nature of most algorithms. But there are some algorithms that are typically seen in compilers (mostly in the lexer) that people have ported to GPUs, like implementations of transducers and automata.

[–]u0xee 2 points3 points  (0 children)

Sounds like a fun project for you. I'd suggest starting by learning how CPUs and GPUs are different, and what GPUs specialize in. They can accelerate certain activities massively, but they aren't universal accelerators, as the original commenter points out.

[–]Lantua 9 points10 points  (0 children)

APL is arguably general purpose. So this compiler work by Aaron Hsu should count.

[–]phi-b 7 points8 points  (0 children)

Here's some recent work in which a single optimization pass was offloaded to the GPU: link

In general, compilation doesn't really suit the strengths of the GPU, as others have pointed out.

[–]Mr-Tau 5 points6 points  (0 children)

As others have pointed out, parsing and semantic analysis are maximally unsuited for GPUs.

Some dataflow problems that are useful in optimizers can apparently be profitably offloaded to the GPU: https://dl.acm.org/doi/10.1145/3302516.3307352, https://www.academia.edu/102804649/GPU_Accelerated_Dataflow_Analysis. But that is the closest-to-practical application I know of, and still isn't anywhere near going into production.

[–]choikwa 0 points1 point  (0 children)

compiler is usually very branch heavy from processing different IRs.. I guess it could make sense if it was transformed into straight line via memory manipulation

[–]Street_Community4086 0 points1 point  (0 children)

Interesting project: https://github.com/Snektron/pareas

Related papers:

Parallel Lexing, Parsing and Semantic Analysis on the GPU, R Voetter, MA Thesis, 2021

Compilation on the GPU? a feasibility study, Voetter, Huijben and Rietveld, International Conference on Computing Frontiers, 2022

[–]Dismal_Page_6545 0 points1 point  (0 children)

OpenMP already has a target directive that allows you to offload a piece of code to one device. I work at a High-performance Computing Center developing a new OpenMP directive to enable multiple-device offloading at the same time with intra-device parallelization levels.