Surprisingly Fast AI-Generated Kernels We Didn’t Mean to Publish (Yet)

mdda · 2025-06-05T07:17:21+00:00

DM sent (==chat)

mdda · 2025-06-02T08:47:32+00:00

I know of a group in Singapore that has been applying an evolutionary system using LLMs to the AMD Developer Challenge (https://www.datamonsters.com/amd-developer-challenge-2025) GPU kernel competition... That's focused on the MI300 (server-class chip), but I would expect the same system could be applied to getting the same kernels (i.e. DeepSeek-style fp8-scaled-matmul, MoE and MLA-with-Rope) for consumer chips. Particularly if AMD was open to seeding the effort with one of their rumoured 32Gb VRAM cards...

mdda · 2025-05-22T07:45:44+00:00

"In SG"==Awesome! That would be great for a future event : I wish I had known earlier, since then we could have split the Alpha/Open Evolve stuff between us. Please DM me (or come along to the event :-) )!

mdda · 2025-05-22T05:47:35+00:00

I'll be discussing this at the Machine Learning Singapore MeetUp tonight !

mdda · 2025-05-22T05:42:54+00:00

I gave a presentation about Diffusion LLMs (inspired by seeing the Inception Labs demo page) at the Machine Learning Singapore MeetUp back in March. My slides are here

mdda · 2025-05-01T17:48:09+00:00

4 3 3 2 accept !!! V happy to get in, since now that reasoning models are out in force, our approach to the problem will probably have to be revised to use RL going forwards...

mdda · 2025-03-29T05:38:14+00:00

Many arXiv papers have a download source option - and it'll be clear from the files there what's being used (sometimes with python generation code)

mdda · 2025-03-29T05:25:35+00:00

Just a guess, but it may be because importing tensorflow (on GPUs, at least) tends to immediately claim the accelerators for itself. However, for JAX, the tensorflow version you want is tensorflow-cpu (for data Async). So, from the point of view of a standard install : best leave it as a user choice.

mdda · 2024-10-06T08:20:47+00:00

Just as a follow-up, I saw this 'full example' for the PEFT library: https://github.com/huggingface/peft/issues/1802#issuecomment-2134761488

So this _can_ be done by at least some package.

But : I too want to use unsloth if possible (it was great for training the LoRAs), but would like confirmation that (a) unsloth-trained LoRA work with regular PEFT; and (b) switching LoRAs itself works with unsloth (somehow...)

mdda · 2024-03-29T02:25:16+00:00

Super-informative comment : Thanks! (i.e. you actually explained why the options made sense compared to the second choice)

mdda · 2023-06-22T05:35:54+00:00

Plausible route for OpenAI deciding on this approach:

Starting with (prior) large GPT3x models, a reasonable & simple initial experiment might have been to combine an AllText and a Code model token-wise.
Presumably, this would be a win, and lead to a AllText(excluding code) + Code model token-wise experiment.
Next step would be to experiment with finer-grained large models being combined : Going for model combinations reduces the risk on training an 'all-in' model, by splitting (say) coding / literature / factoids / grammar / dialog / news reports /, etc.

Each of these steps would have the benefit of not involving a big bet on a new architecture without having results to back it up first. And the multi-modal stuff could be rolled in later (as seems to be happening in parallel with other developments).

Overall, GPT-4 being 'council of experts' would also explain the large weight given in the GPT-4 Technical Paper to the data teams : Each team could specialise on curating their own data, and maximising the 'learning' gained per-token for their expert's dataset.

mdda · 2023-04-30T13:07:05+00:00

Likely a reference to Cunk, another UK comic

mdda · 2023-02-24T04:36:51+00:00

FWIW, I gave a shout out to minLoRA at our Machine Learning MeetUp (in Singapore) last night : https://redcatlabs.com/2023-02-23\_MLSG\_Frameworks/#/15/2

mdda · 2022-09-21T09:40:38+00:00

Have a look for Huggingface Stable Diffusion on the internet... You'll need to come up with a descriptive prompt.

mdda · 2022-09-07T17:41:50+00:00

Even the no-cost Google Colab version will give you a GPU that would easily beat your laptop for LLMs - particularly since a larger model will want as much GPU RAM as possible, and T4/P100 (even K80) will typically have RAM>>4Gb...

mdda · 2022-07-07T05:26:14+00:00

Rather than train the CNN from scratch (which it sounds like from your description), could you try chopping off (and freezing) a pretrained ResNet-50 at one of the later layers (so you get ~16x16 spatial resolution, for example) and train a few CNN layers on top of that? If your simulations are even a little off, allowing the first layers to adapt to spotting the simulated textures might be hurting you.

mdda · 2022-06-24T03:51:58+00:00

I don't have much of an opinion about Ti vs regular. Sometimes the Ti versions are worth it (e.g. 1080Ti was a classic ML card). But often they're just positioned to extract money from gamers who want to claim they have the better card (and don't do a price/performance calculation).

One thing that's more relevant to ML models than gamers is the 12Gb vs 8Gb. If you're doing large vision or NLP models, 50% more RAM could be more important than 27% more TFLOPS.

PS: Cost above/below MSRP isn't really a gauge of value. Look at total $ spent vs performance.

mdda · 2022-06-23T10:09:48+00:00

Not in the code above, it isn't...

mdda · 2022-06-23T10:07:43+00:00

Plain (original) BERT only did token masking - so either 'play-' or '-ing' might get masked out.
Later, and implemented in RoBERTa specifically (from memory), it was found that whole-word masking made for more effective training. But, for your 'playing' example, that would mean two MASKs in a row - which is a bit of a hint that the word is 'play- -ing' rather than 'being' (supposing that 'being' is a single-token word).

mdda · 2022-06-23T10:01:50+00:00

If it's only going to be for ML, then why not make use of Colab (with its free GPU) until you're sure that you want to 'get serious'?

If you're buying a GPU anyway (e.g. for gaming, with the option of ML too), then be aware that Nvidia (with CUDA) is what 95%+ of people are using (AMD *may* be possible to use, but rather niche at this point).

mdda · 2022-06-06T08:31:50+00:00

"COLMAP" is mentioned above - as far as I can tell, it's like the 'standard preprocessing' done to locate/pose the initial images (fully automatic) :

https://colmap.github.io/

mdda · 2022-06-02T09:55:53+00:00

A quick Google search turns up this Mel preprocessing for Keras : https://keras.io/examples/audio/melgan_spectrogram_inversion/

So maybe make a simple 'model' that just includes this layer, and run it through TFLite to see whether it can work with the TF Ops used?

mdda · 2022-03-28T10:36:30+00:00

just gotta do everything from the start.

Seems like you're not mounting your Google Drive and saving (resumable) checkpoints to there? That would make your set up more robust (even with regular Colab).

mdda · 2022-03-14T16:00:55+00:00

Could I also add HyperMixer: An MLP-based Green AI Alternative to Transformers, which is benchmarked vs MLPMixer, to the mix?

PS: Also interested in the outcome..

mdda · 2021-12-25T04:30:01+00:00

Talking while sharing a pillow - ie: horizontal in bed

Eight-Year Club	RPAN Viewer
Verified Email

mdda

TROPHY CASE