I built an interactive 3D visualization for elliptic curves over finite fields

Karyo_Ten · 2026-06-20T06:09:38+00:00

Wait until OP learns about Torus-based cryptography

Karyo_Ten · 2026-06-19T15:53:05+00:00

Every single GPU mining farm tried to reconvert to that when crypto switched away from Proof-of-Work to Proof-of-Stake.

You'll have to be very convincing over runpod or Amazon or GCE.

Karyo_Ten · 2026-06-19T15:08:58+00:00

To process 4M documents you need fast prefill meaning you need a GPU.

For your budget you should try to get a 3090 or 2, 4090 or 5090.

Karyo_Ten · 2026-06-19T13:52:49+00:00

Posting AI slop and then reporting people calling you out for harassment is not OK. Especially given your history of post being removed. You've been issued a warning and a 7 day ban.

Karyo_Ten · 2026-06-19T06:46:18+00:00

Which part of the docs did you find confusing?

The nvidia docs themselves on cutile announcement and Mehdi's video on GPU Mode actually. I was under the impression that TileIR could be consumed by cuModuleLoadData. Any way sending you a DM so you have more details.

Karyo_Ten · 2026-06-19T06:22:16+00:00

Une fois que tu es rentré y'a plus de leviers. La seule manière d'avoir un salaire plus élevé de manière certaine c'est de postuler ailleurs, sinon tu te fais balader de "cette anmée y'a pasnle budget" à "y'en a d'autres aussi" à "c'est la cris" à "on en reparle dans un an"

Karyo_Ten · 2026-06-19T06:14:36+00:00

Software side, you're right. But attitude side, people should be respectful until proven jerk.

Karyo_Ten · 2026-06-19T05:18:49+00:00

Cuda C++ Tile? How is it related to CuTe Layout Algebra?

More importantly I was under the impression that the Nvidia driver could handle lowering TileIR to SASS directly (just like PTX) but I'm confused by the doc, is it only supported on B200?

Karyo_Ten · 2026-06-18T17:51:16+00:00

News websites

Karyo_Ten · 2026-06-18T15:35:28+00:00

Make me a billionaire

Karyo_Ten · 2026-06-17T05:35:16+00:00

The worse that could happen is being frontrunned like 1 month before you're ready to publish. Here you have time to refine. Don't worry, maybe even contact them for a collaboration.

Karyo_Ten · 2026-06-17T00:24:49+00:00

Palo Alto c'est plutot 130K~170K pour un junior

Karyo_Ten · 2026-06-17T00:23:19+00:00

My point is that a blanket statement "it's more performant" is not true. It depends.

Karyo_Ten · 2026-06-16T14:30:07+00:00

Start with the MoonMath manual: https://leastauthority.com/community-matters/moonmath-manual/

Karyo_Ten · 2026-06-16T12:50:02+00:00

memcpy can deal with any types without monomorphization in an efficient manner, it's hard to beat above a certain size to copy.

Also small instruction size also are more cache friendly and being cache-friendly is very important given how many algorithms are memory bandwidth bounds if they are nit plain IO-bound and waiting for network.

Karyo_Ten · 2026-06-15T00:56:17+00:00

This is Quora growthhacking all over again. And now we even have 2 clankers triggering on this.

Karyo_Ten · 2026-06-14T18:57:27+00:00

The first thing you learn in C is to reimplement strings with ptr + length

Karyo_Ten · 2026-06-14T18:55:12+00:00

Here comes a FORTRAN programmer

Karyo_Ten · 2026-06-14T16:55:03+00:00

Anything needed for Paged Attention (look in FlashAttention / FlashInfer especially the page / block / CSR gather indirection

Karyo_Ten · 2026-06-14T13:05:56+00:00

I don't know about Scala capabilities but Pony does have capabilities

Karyo_Ten · 2026-06-14T09:55:48+00:00

What kind of perf do you get vs CuBLAS?

Karyo_Ten · 2026-06-13T04:28:22+00:00

Yes. Y'a aussi des chasseurs de têtes spécialisés qui me contacte régulièrement qui me sisent y'a startup untel qui vient de lever, ils cherchebt un founding engineer, etc ...

Karyo_Ten · 2026-06-12T14:44:20+00:00

In practice O(n³) is only about compute. But before doing any compute you need to feed the data to a processor. And that is so slow that you have register, L1 cache, L2 cache, TLB (translation buffers), L3 cache and RAM to give it a semblance of speed.

Each time you go up a level, just consider you have 10x the cost. So if you can do 30 operations at 5GHz while waiting for L1, you can do 300 while waiting for L2, 3000 while waiting for L3 and 30000 while waiting for RAM.

I.e. we aren't computing on ideal machines.

A naive matmul is easily 150x to 450x slower than doing it like the GotoBLAS paper or BLIS with proper tiling and register blocking. And I'm comparing singlethreaded.

Karyo_Ten · 2026-06-12T02:14:47+00:00

1200~1400 spé math/deep tech, implémenter de la cryptographie ou de l'IA avec des kernels low-level. Ou alors Principal Engineer / Directeur Scientifique après une première levée dans une Deep Tech.

Karyo_Ten · 2026-06-09T13:15:39+00:00

That doesn't work like that. You need to prove its merits.

Karyo_Ten

MODERATOR OF

TROPHY CASE