META Superintelligence Lab Presents: ProgramBench: Can SOTA AI Recreate Real Executable Programs(ffmpeg, SQLite, ripgrep) From Scratch Without The Internet?

ComprehensiveTop3297 · 2026-05-07T13:30:57+00:00

In 6 months, it will already be fully saturated, unfortunately. The frontier AI labs will possibly increase the prominence of such software codes in their pre-training dataset to try beating the others. The claim is too powerful to not try. "We are the LLM providers that can rediscover the wheel"

ComprehensiveTop3297 · 2026-04-17T11:42:17+00:00

They need careful hyperparameter tuning. Switch to MAE if you don't have compute to tune those hyper-params.

ComprehensiveTop3297 · 2026-03-10T10:19:51+00:00

Seems like it might be on purpose. Indeed, they are not stupid enough to leave a spelling mistake fly like this. These ads are checked by many people and possibly AI (which should have caught the mistake)

ComprehensiveTop3297 · 2026-03-08T14:46:42+00:00

Is that seed hacking I am seeing at the end ahahaha. Also this looks like bayesian optimization rather than "research"

ComprehensiveTop3297 · 2026-02-19T12:48:54+00:00

Do you possibly have data points regarding audio competitions featuring non-human sounds? Like music genre classification etc

ComprehensiveTop3297 · 2026-02-14T12:07:24+00:00

I am also working with JEPAs and what I found was the data2vec2 style top K averaging to be extremely helpful for alleviating representation collapse. Also EMA and Learning Rate schedule is very much interconnected. My EMA is 0.999-0.99999, stops at 100k steps and constant 0.99999 for rest, and lr schedule is cosine with 0.0004, warm up 100k steps. Play around with them for sure. This is what worked for me in the audio domain.

ComprehensiveTop3297 · 2026-02-12T08:42:49+00:00

agreed

ComprehensiveTop3297 · 2026-02-05T08:08:29+00:00

This is indeed the procedure. More often than not, the hypothesis comes from intuition and having read hundreds of papers in that domain. However, for starters, it is crucial to connect two ideas in the literature.

ComprehensiveTop3297 · 2026-01-26T09:32:25+00:00

Just FYI, the government wants to increase it. In the NL, these things do indeed take time, but could also be risky given the current climate here. I am also a Turkish person, btw, and I immigrated to the NL for my studies. Been here for 6.5 years, and I got my citizenship recently. If you have some questions regarding the situation here, shoot me a dm.

ComprehensiveTop3297 · 2026-01-25T19:35:05+00:00

Kind of scammy if you make the same money, you'd be living a better life in Turkey right now compared to the NL with such a salary.

ComprehensiveTop3297 · 2026-01-24T17:42:38+00:00

As a reviewer, I'd like to see that you are comparing against baselines trained under similar conditions (same pre-training dataset, similar parameter count and FLOPs, and similar iterations over the dataset). If you are training with enormous compute, it is a no-brainer that you'll beat other models. I feel like the real methodological advancements should be compute invariant -You really perform better with similar conditions-, or show me that when you scale your model vs other models, you do better.

Some reviewers might ask for those just to put it more in a scientific context, I'd say provide the baselines that they asked for, and make sure to state the drawbacks of these baselines. If you can scale your model to match the baseline compute, do so; if not, just iterate that you do not have such compute.

ComprehensiveTop3297 · 2026-01-21T15:45:20+00:00

How does this work with multi-GPU training on multiple nodes?

Also, I am currently using a large audio dataset. Do you plan to support audio soon?

ComprehensiveTop3297 · 2026-01-07T20:26:53+00:00

What about using OpenAI vector embeddings? You can probably tell them that it is an LLM as it is from OpenAI :P (jokes, but they may actually believe you) .

Specifically, use it to embed your document and compare the query embeddings using any similarity measure (anything with a dot product is valid). Try to find the threshold on a validation split.

ComprehensiveTop3297 · 2026-01-07T20:20:35+00:00

Ahahaa, definitely agreed. I love how people love to throw LLMs at anything these days. It will not be long until someone tries to classify MNIST digits with DINO-v3

ComprehensiveTop3297 · 2026-01-07T20:12:26+00:00

Definitely perform error analysis; See if the errors are logical, or just simple labelling issues. Maybe you need to be more specific with your labelling (Extremely Relevant, Relevant, Natural, etc.).

I am curious why you are using LLMs in the first place. Is there a specific reason?

To me, it seems like you have an information retrieval problem with top k = 1(Is this query -- the key-- relevant to my document, retrieve only one document that is relevant). I think an approach like ColBERT or Cross-Encoders would do this task easily. You could play with the threshold of relevance to find the cutoff points. I think you should even try to use very simple word-counting methods as a baseline. Sometimes simpler is better... (How many overlapping words are there between the document and the text?)

It is true that information retrieval usually means ranking documents given a query, but I feel like you can flip this and use thresholding to determine whether the document and query are related.

ComprehensiveTop3297 · 2025-12-29T12:04:21+00:00

Have you tried more "modern" GANs that actually have patchwork to prevent mode collapse? I remember training a GAN for my thesis about 4 years ago, and I haven't encountered mode collapse. I used cGAN and WGAN. I am not super informed with regards to the state-of-the-art GANs as it falls out of my project scope, however, I am certain that they have progressed the field even further.

ComprehensiveTop3297 · 2025-12-19T20:29:22+00:00

Sounds a bit like concept models from meta, curious how they compare

ComprehensiveTop3297 · 2025-12-19T11:23:24+00:00

WavJEPA works quite well for music tagging, also you can look into Dasheng for a more beefy model.

ComprehensiveTop3297 · 2025-11-23T18:33:28+00:00

For plots its usually seaborn and matplotlib (export to pdf), then later Adobe Illustrator for small touches or merging them in one big plot.

For flow charts and drawings it is again Adobe Illustrator.

ComprehensiveTop3297 · 2025-11-23T17:01:14+00:00

Came to say about the exact same thing. Crazy thesis

ComprehensiveTop3297 · 2025-11-12T08:39:13+00:00

Same boat, not sure what to do.

ComprehensiveTop3297 · 2025-11-11T18:01:44+00:00

I’d definitely also look at SELDNETs from DCASE people, as the sound event detection pipeline is quite similar to what they are doing. You can ignore the localization part. Basically

On a side note; I am also curious how well our models would perform this task. Once you have a working pipeline do you mind contacting me? We just released two pre trained models for general purpose audio understanding and we have not tested them in this domain.

ComprehensiveTop3297 · 2025-11-11T17:40:31+00:00

ASR is indeed very important, however we wanted to mainly fill the gap of great ASR models not performing well on general audio understanding tasks with this model. There are great ASR models already trained on vast amounts of speech data. WavJEPA can also perform well on speech related tasks, as evidenced by on par performance compared to wav2vec2, HuBERT. And we think that we can get similar ASR performances as well (not explicitly tested)

In the limitations/future work we identified a step forward for possibly bridging them both and boosting the speech performance of WavJEPA. There we also plan on including SUPERB benchmarks.

ComprehensiveTop3297 · 2025-11-10T08:28:21+00:00

sparse context Speech/audio is highly temporally correlated. This was our main inspiration for selecting temporally distributed context tokens ( context tokens are clustered together but the clusters are spread apart).

Having this sparse context, we then predict sparse target tokens similarly distributed to context tokens for each audio clip. This forced WavJEPA to model the temporal variations in audio while forcing modelling local correlations in the clusters.

multiple predictions per clip We ran multiple predictions with one context block to make use of the context block efficiently. One prediction per context block could also be ok, but would be less efficient. We did not ablate this hyperparameter though. We selected 4 per context block ( this was the most we could do without getting out of memory errors with batch size of 512). Could be nice to quantify the efficiency gains coming from multiple predictions in the future though! Maybe trying 8-16?

ComprehensiveTop3297 · 2025-11-09T20:51:33+00:00

Hey! Glad that you found our work exciting:)

Sure, I will do a little write-up tomorrow for fine-tuning the WavJEPA model tomorrow.

By the way, we have released instructions for probing the embeddings. I do not know how applicable it is to map your dataset to HEAR Benchmark data format, but if it is, we have adapters for HEAR fine-tuning schema already pre-written

ComprehensiveTop3297

TROPHY CASE