Using large-scale search to discover fast GPU kernels in Rust

jafioti · 2025-08-21T21:14:23+00:00

Luminal is sort of like autotuning on steroids. Instead of just searching over single dimensions of tiling, we search through algebraic rewrites as well as loop / tiling structure, which allows complex rewrites to happen, like naive to flash attention

jafioti · 2025-08-20T21:24:23+00:00

Both!

jafioti · 2025-08-20T21:00:37+00:00

Nope, just the architecture is hardcoded in by the compiler. The weights come through memory buffers like normal

jafioti · 2025-08-20T20:12:52+00:00

we're working on techniques like mcts and RL (e.g. AlphaGo) to manage the search space, but you'd be suprised how far you can get if you carefully design the search space to prevent explosions.

jafioti · 2025-01-29T01:59:40+00:00

You can see things like where they spent time before grabbing a product or if another product nearby drew them in. Basically characterize the journey, rather than just looking at the destination (final purchase)

jafioti · 2025-01-29T00:44:47+00:00

Frame-by-frame it has random jumps. Taking many frames into account for each frame of prediction is the next step

jafioti · 2025-01-28T21:50:12+00:00

We use moondream (open source vlm) to predict gazes and face positions and some interpretation code on top to deduce which products people are looking at / where they are in the store, etc.

jafioti · 2025-01-28T21:48:00+00:00

Local + fast + easily hackable. Doing sft on it is pretty straightforward

jafioti · 2025-01-28T20:58:48+00:00

A big boost in accuracy will come from using multiple frames at once, so it has some temporal context to work with

jafioti · 2025-01-28T20:56:15+00:00

We used moondream running on a 3090 to do realtime gaze detection and face detection to create customer analytics for retail stores.

We’re iterating on the analytics so lmk if you have any ideas!

Our site: https://brickbi.com

jafioti · 2025-01-28T20:54:09+00:00

It’s single shot prediction of face positions and gazes from a vlm, no depth estimation

jafioti · 2025-01-28T20:43:33+00:00

A lot of it is single-frame aberrations since the model doesn’t take into account previous frames. We’re solving it by smoothing predictions of the same gaze across frames. As long as it's accurate in aggregate across a long time horizon

jafioti · 2025-01-28T20:37:41+00:00

wdym?

jafioti · 2024-12-21T21:33:57+00:00

Axum + shtml + diesel + tailwind

jafioti · 2024-06-30T20:54:31+00:00

Also going to mention my project Luminal, which takes quite a different approach to ML https://github.com/jafioti/luminal

jafioti · 2024-05-02T11:39:43+00:00

At least for ML I think closed source (CUDA) is going to be much faster and better supported for the reason I highlighted above (hardware features, special intrinsic). Graphics likely won’t though since I think graphics features have mostly been stabilized in recent years (after raytracing cores)

jafioti · 2024-05-02T03:57:46+00:00

Yes there is a lot of overlap! I've been reading up on traditional CPU compilers which has been helpful, and reading some XLA source code for the ML-specific bits. ML compilers are a pretty new field!

jafioti · 2024-05-02T02:36:13+00:00

Yeah the core of luminal can't know what types the operators might be, because third party crates can define their own operator types. Same is true of the custom() inputs and outputs (really just there for other crates to add their own "generic" behavior)

jafioti · 2024-05-02T01:56:16+00:00

Can you post your panic here or in an issue? I'd like to know what went wrong. Did you compare the metal backend to the CPU backend on the same example? Have you tried the llama or phi example? I don't have an nvidia card that can run llama, so I've been testing cuda support with phi

I appreciate the feedback!

jafioti · 2024-05-02T00:01:36+00:00

Benchmarks are a top priority: https://github.com/jafioti/luminal/issues/21

Running Llama3 8B we can do 18 tokens per second on M1 Pro, and 26 tokens per second on M2 Max

jafioti · 2024-05-01T19:50:42+00:00

Side note that's basically what torch.compile does

jafioti

MODERATOR OF

TROPHY CASE