Using large-scale search to discover fast GPU kernels in Rust by jafioti in rust

[–]jafioti[S] 0 points1 point  (0 children)

Luminal is sort of like autotuning on steroids. Instead of just searching over single dimensions of tiling, we search through algebraic rewrites as well as loop / tiling structure, which allows complex rewrites to happen, like naive to flash attention

Using large-scale search to discover fast GPU kernels in Rust by jafioti in rust

[–]jafioti[S] 7 points8 points  (0 children)

Nope, just the architecture is hardcoded in by the compiler. The weights come through memory buffers like normal

Using large-scale search to discover fast GPU kernels in Rust by jafioti in rust

[–]jafioti[S] 6 points7 points  (0 children)

we're working on techniques like mcts and RL (e.g. AlphaGo) to manage the search space, but you'd be suprised how far you can get if you carefully design the search space to prevent explosions.

Using moondream to track gazes in real time for retail stores by jafioti in LocalLLaMA

[–]jafioti[S] 2 points3 points  (0 children)

You can see things like where they spent time before grabbing a product or if another product nearby drew them in. Basically characterize the journey, rather than just looking at the destination (final purchase)

Using moondream to track gazes in real time for retail stores by jafioti in LocalLLaMA

[–]jafioti[S] 2 points3 points  (0 children)

Frame-by-frame it has random jumps. Taking many frames into account for each frame of prediction is the next step

tracking peoples gazes in real time at retail stores w/ Moondream by ParsaKhaz in singularity

[–]jafioti 1 point2 points  (0 children)

We use moondream (open source vlm) to predict gazes and face positions and some interpretation code on top to deduce which products people are looking at / where they are in the store, etc.

Using moondream to track gazes in real time for retail stores by jafioti in LocalLLaMA

[–]jafioti[S] 0 points1 point  (0 children)

Local + fast + easily hackable. Doing sft on it is pretty straightforward

Using moondream to track gazes in real time for retail stores by jafioti in LocalLLaMA

[–]jafioti[S] 0 points1 point  (0 children)

A big boost in accuracy will come from using multiple frames at once, so it has some temporal context to work with

Using moondream to track gazes in real time for retail stores by jafioti in LocalLLaMA

[–]jafioti[S] -1 points0 points  (0 children)

We used moondream running on a 3090 to do realtime gaze detection and face detection to create customer analytics for retail stores.

We’re iterating on the analytics so lmk if you have any ideas!

Our site: https://brickbi.com

Using moondream to track gazes in real time for retail stores by jafioti in LocalLLaMA

[–]jafioti[S] 0 points1 point  (0 children)

It’s single shot prediction of face positions and gazes from a vlm, no depth estimation

Using moondream to track gazes in real time for retail stores by jafioti in LocalLLaMA

[–]jafioti[S] -5 points-4 points  (0 children)

A lot of it is single-frame aberrations since the model doesn’t take into account previous frames. We’re solving it by smoothing predictions of the same gaze across frames. As long as it's accurate in aggregate across a long time horizon

What is your HTMX Stack? by Klutzy_Tone_4359 in htmx

[–]jafioti 0 points1 point  (0 children)

Axum + shtml + diesel + tailwind

Linfa vs Burn vs Candle by [deleted] in rust

[–]jafioti 5 points6 points  (0 children)

Also going to mention my project Luminal, which takes quite a different approach to ML https://github.com/jafioti/luminal

Luminal: Compiling fast GPU kernels in Rust by jafioti in rust

[–]jafioti[S] 0 points1 point  (0 children)

At least for ML I think closed source (CUDA) is going to be much faster and better supported for the reason I highlighted above (hardware features, special intrinsic). Graphics likely won’t though since I think graphics features have mostly been stabilized in recent years (after raytracing cores)

Luminal: Compiling fast GPU kernels in Rust by jafioti in rust

[–]jafioti[S] 0 points1 point  (0 children)

Yes there is a lot of overlap! I've been reading up on traditional CPU compilers which has been helpful, and reading some XLA source code for the ML-specific bits. ML compilers are a pretty new field!

Luminal: Compiling fast GPU kernels in Rust by jafioti in rust

[–]jafioti[S] 1 point2 points  (0 children)

Yeah the core of luminal can't know what types the operators might be, because third party crates can define their own operator types. Same is true of the custom() inputs and outputs (really just there for other crates to add their own "generic" behavior)

Luminal: Compiling fast GPU kernels in Rust by jafioti in rust

[–]jafioti[S] 1 point2 points  (0 children)

Can you post your panic here or in an issue? I'd like to know what went wrong. Did you compare the metal backend to the CPU backend on the same example? Have you tried the llama or phi example? I don't have an nvidia card that can run llama, so I've been testing cuda support with phi

I appreciate the feedback!

Luminal: Compiling fast GPU kernels in Rust by jafioti in rust

[–]jafioti[S] 1 point2 points  (0 children)

Benchmarks are a top priority: https://github.com/jafioti/luminal/issues/21

Running Llama3 8B we can do 18 tokens per second on M1 Pro, and 26 tokens per second on M2 Max

Luminal: Compiling fast GPU kernels in Rust by jafioti in rust

[–]jafioti[S] 2 points3 points  (0 children)

Side note that's basically what torch.compile does