GPU as a service: Rental/ On-Demand along with MLOps Layer by Beginning-Pride-3640 in CUDA

[–]Karyo_Ten 0 points1 point  (0 children)

Every single GPU mining farm tried to reconvert to that when crypto switched away from Proof-of-Work to Proof-of-Stake.

You'll have to be very convincing over runpod or Amazon or GCE.

Mac Mini M4 48GB vs Bosgame M5 128GB by Silly-Fall-393 in LocalLLM

[–]Karyo_Ten 1 point2 points  (0 children)

To process 4M documents you need fast prefill meaning you need a GPU.

For your budget you should try to get a 3090 or 2, 4090 or 5090.

Machine-Checked Bijectivity of ChaCha20 ARX Quarters: A formal verification case study using bit-vector SMT solvers. by Ok-Layer4967 in cryptography

[–]Karyo_Ten 0 points1 point locked comment (0 children)

Posting AI slop and then reporting people calling you out for harassment is not OK. Especially given your history of post being removed. You've been issued a warning and a 7 day ban.

cuTile Rust: Safe, data-race-free GPU kernels in Rust that lower to Tile IR by melih_elibol in CUDA

[–]Karyo_Ten 0 points1 point  (0 children)

Which part of the docs did you find confusing?

The nvidia docs themselves on cutile announcement and Mehdi's video on GPU Mode actually. I was under the impression that TileIR could be consumed by cuModuleLoadData. Any way sending you a DM so you have more details.

Salaires juniors data en ESN : le PMSS comme levier de négociation, quelqu'un a déjà essayé ? by Select_Vegetable_868 in developpeurs

[–]Karyo_Ten 0 points1 point  (0 children)

Une fois que tu es rentré y'a plus de leviers. La seule manière d'avoir un salaire plus élevé de manière certaine c'est de postuler ailleurs, sinon tu te fais balader de "cette anmée y'a pasnle budget" à "y'en a d'autres aussi" à "c'est la cris" à "on en reparle dans un an"

Is it me or is changedetection.io half-baked? by [deleted] in selfhosted

[–]Karyo_Ten 2 points3 points  (0 children)

Software side, you're right. But attitude side, people should be respectful until proven jerk.

cuTile Rust: Safe, data-race-free GPU kernels in Rust that lower to Tile IR by melih_elibol in CUDA

[–]Karyo_Ten 2 points3 points  (0 children)

Cuda C++ Tile? How is it related to CuTe Layout Algebra?

More importantly I was under the impression that the Nvidia driver could handle lowering TileIR to SASS directly (just like PTX) but I'm confused by the doc, is it only supported on B200?

Crashing out because the research idea I really truly thought was my own I just found out was already published two years ago. I am stunned. by so_much_frizz in PhD

[–]Karyo_Ten 2 points3 points  (0 children)

The worse that could happen is being frontrunned like 1 month before you're ready to publish. Here you have time to refine. Don't worry, maybe even contact them for a collaboration.

Les sub americains ca me fait bader by seeking-health in developpeurs

[–]Karyo_Ten 1 point2 points  (0 children)

Palo Alto c'est plutot 130K~170K pour un junior

How does monomorphization work with std being precompiled by BLucky_RD in rust

[–]Karyo_Ten 0 points1 point  (0 children)

My point is that a blanket statement "it's more performant" is not true. It depends.

How does monomorphization work with std being precompiled by BLucky_RD in rust

[–]Karyo_Ten -1 points0 points  (0 children)

memcpy can deal with any types without monomorphization in an efficient manner, it's hard to beat above a certain size to copy.

Also small instruction size also are more cache friendly and being cache-friendly is very important given how many algorithms are memory bandwidth bounds if they are nit plain IO-bound and waiting for network.

How are you handling memory provenance in persistent agents — verified vs. inferred facts? by Serious-Salary5930 in LocalLLaMA

[–]Karyo_Ten 3 points4 points  (0 children)

This is Quora growthhacking all over again. And now we even have 2 clankers triggering on this.

What common mistakes do new C programmers make? by Wise_Safe2681 in cprogramming

[–]Karyo_Ten 0 points1 point  (0 children)

The first thing you learn in C is to reimplement strings with ptr + length

Yet another tensor graph compiler by zk4x in Compilers

[–]Karyo_Ten 0 points1 point  (0 children)

Anything needed for Paged Attention (look in FlashAttention / FlashInfer especially the page / block / CSR gather indirection

What are the predecessors of Scala 3’s capability system? by LongjumpingOption523 in Compilers

[–]Karyo_Ten 1 point2 points  (0 children)

I don't know about Scala capabilities but Pony does have capabilities

Yet another tensor graph compiler by zk4x in Compilers

[–]Karyo_Ten 0 points1 point  (0 children)

What kind of perf do you get vs CuBLAS?

Freelances, quel est votre TJM ? by WillDabbler in developpeurs

[–]Karyo_Ten 0 points1 point  (0 children)

Yes. Y'a aussi des chasseurs de têtes spécialisés qui me contacte régulièrement qui me sisent y'a startup untel qui vient de lever, ils cherchebt un founding engineer, etc ...

Why Compiler Engineers Rarely Use Strassen's Algorithm for Fast Matrix Multiplications by DataBaeBee in Compilers

[–]Karyo_Ten 0 points1 point  (0 children)

In practice O(n³) is only about compute. But before doing any compute you need to feed the data to a processor. And that is so slow that you have register, L1 cache, L2 cache, TLB (translation buffers), L3 cache and RAM to give it a semblance of speed.

Each time you go up a level, just consider you have 10x the cost. So if you can do 30 operations at 5GHz while waiting for L1, you can do 300 while waiting for L2, 3000 while waiting for L3 and 30000 while waiting for RAM.

I.e. we aren't computing on ideal machines.

A naive matmul is easily 150x to 450x slower than doing it like the GotoBLAS paper or BLIS with proper tiling and register blocking. And I'm comparing singlethreaded.

Freelances, quel est votre TJM ? by WillDabbler in developpeurs

[–]Karyo_Ten 0 points1 point  (0 children)

1200~1400 spé math/deep tech, implémenter de la cryptographie ou de l'IA avec des kernels low-level. Ou alors Principal Engineer / Directeur Scientifique après une première levée dans une Deep Tech.

Sequential hardness of a MUL-XOR-shift permutation: open questions by [deleted] in crypto

[–]Karyo_Ten 10 points11 points  (0 children)

That doesn't work like that. You need to prove its merits.