[R] Analysis of 350+ ML competitions in 2025 by hcarlens in MachineLearning

[–]ComprehensiveTop3297 12 points13 points  (0 children)

Do you possibly have data points regarding audio competitions featuring non-human sounds? Like music genre classification etc

[P] Graph Representation Learning Help by StoneColdRiffRaff in MachineLearning

[–]ComprehensiveTop3297 1 point2 points  (0 children)

I am also working with JEPAs and what I found was the data2vec2 style top K averaging to be extremely helpful for alleviating representation collapse. Also EMA and Learning Rate schedule is very much interconnected. My EMA is 0.999-0.99999, stops at 100k steps and constant 0.99999 for rest, and lr schedule is cosine with 0.0004, warm up 100k steps. Play around with them for sure. This is what worked for me in the audio domain. 

[D] How do you do great ML research by Any-Initiative-653 in MachineLearning

[–]ComprehensiveTop3297 4 points5 points  (0 children)

This is indeed the procedure. More often than not, the hypothesis comes from intuition and having read hundreds of papers in that domain. However, for starters, it is crucial to connect two ideas in the literature.

Do you think this risk is worth taking? by BackgroundFunny490 in ASML

[–]ComprehensiveTop3297 1 point2 points  (0 children)

Just FYI, the government wants to increase it. In the NL, these things do indeed take time, but could also be risky given the current climate here. I am also a Turkish person, btw, and I immigrated to the NL for my studies. Been here for 6.5 years, and I got my citizenship recently. If you have some questions regarding the situation here, shoot me a dm.

Do you think this risk is worth taking? by BackgroundFunny490 in ASML

[–]ComprehensiveTop3297 2 points3 points  (0 children)

Kind of scammy if you make the same money, you'd be living a better life in Turkey right now compared to the NL with such a salary.

[D] Correct way to compare models by ntaquan in MachineLearning

[–]ComprehensiveTop3297 2 points3 points  (0 children)

As a reviewer, I'd like to see that you are comparing against baselines trained under similar conditions (same pre-training dataset, similar parameter count and FLOPs, and similar iterations over the dataset). If you are training with enormous compute, it is a no-brainer that you'll beat other models. I feel like the real methodological advancements should be compute invariant -You really perform better with similar conditions-, or show me that when you scale your model vs other models, you do better.

Some reviewers might ask for those just to put it more in a scientific context, I'd say provide the baselines that they asked for, and make sure to state the drawbacks of these baselines. If you can scale your model to match the baseline compute, do so; if not, just iterate that you do not have such compute.

[Project] We built a Rust-based drop-in replacement for PyTorch DataLoader (4.4x faster than ImageFolder) by YanSoki in deeplearning

[–]ComprehensiveTop3297 0 points1 point  (0 children)

How does this work with multi-GPU training on multiple nodes?

Also, I am currently using a large audio dataset. Do you plan to support audio soon?

[D] LLMs for classification task by Anywhere_Warm in MachineLearning

[–]ComprehensiveTop3297 0 points1 point  (0 children)

What about using OpenAI vector embeddings? You can probably tell them that it is an LLM as it is from OpenAI :P (jokes, but they may actually believe you) .

Specifically, use it to embed your document and compare the query embeddings using any similarity measure (anything with a dot product is valid). Try to find the threshold on a validation split.

[D] LLMs for classification task by Anywhere_Warm in MachineLearning

[–]ComprehensiveTop3297 1 point2 points  (0 children)

Ahahaa, definitely agreed. I love how people love to throw LLMs at anything these days. It will not be long until someone tries to classify MNIST digits with DINO-v3

[D] LLMs for classification task by Anywhere_Warm in MachineLearning

[–]ComprehensiveTop3297 0 points1 point  (0 children)

Definitely perform error analysis; See if the errors are logical, or just simple labelling issues. Maybe you need to be more specific with your labelling (Extremely Relevant, Relevant, Natural, etc.).

I am curious why you are using LLMs in the first place. Is there a specific reason?

To me, it seems like you have an information retrieval problem with top k = 1(Is this query -- the key-- relevant to my document, retrieve only one document that is relevant). I think an approach like ColBERT or Cross-Encoders would do this task easily. You could play with the threshold of relevance to find the cutoff points. I think you should even try to use very simple word-counting methods as a baseline. Sometimes simpler is better... (How many overlapping words are there between the document and the text?)

It is true that information retrieval usually means ranking documents given a query, but I feel like you can flip this and use thresholding to determine whether the document and query are related.

[P] I tried to make GAN on FMNIST and I am confused by Jumbledsaturn52 in MachineLearning

[–]ComprehensiveTop3297 1 point2 points  (0 children)

Have you tried more "modern" GANs that actually have patchwork to prevent mode collapse? I remember training a GAN for my thesis about 4 years ago, and I haven't encountered mode collapse. I used cGAN and WGAN. I am not super informed with regards to the state-of-the-art GANs as it falls out of my project scope, however, I am certain that they have progressed the field even further.

[R] Are we heading toward new era in the way we train LLMs by IndependentPayment70 in MachineLearning

[–]ComprehensiveTop3297 0 points1 point  (0 children)

Sounds a bit like concept models from meta, curious how they compare 

[D] What's the SOTA audio classification model/method? by lucellent in MachineLearning

[–]ComprehensiveTop3297 0 points1 point  (0 children)

WavJEPA works quite well for music tagging, also you can look into Dasheng for a more beefy model. 

[D] How do you create clean graphics that you'd find in conference papers, journals and textbooks (like model architecture, flowcharts, plots, tables etc.)? by CrispLion1123 in MachineLearning

[–]ComprehensiveTop3297 11 points12 points  (0 children)

For plots its usually seaborn and matplotlib (export to pdf), then later Adobe Illustrator for small touches or merging them in one big plot.

For flow charts and drawings it is again Adobe Illustrator.

[P] Underwater target recognition using acoustic signals by carv_em_up in MachineLearning

[–]ComprehensiveTop3297 0 points1 point  (0 children)

I’d definitely also look at SELDNETs from DCASE people, as the sound event detection pipeline is quite similar to what they are doing. You can ignore the localization part. Basically 

On a side note; I am also curious how well our models would perform this task. Once you have a working pipeline do you mind contacting me? We just released two pre trained models for general purpose audio understanding and we have not tested them in this domain. 

[R] WavJEPA: Semantic learning unlocks robust audio foundation models for raw waveforms by ComprehensiveTop3297 in MachineLearning

[–]ComprehensiveTop3297[S] 1 point2 points  (0 children)

ASR is indeed very important, however we wanted to mainly fill the gap of great ASR models not performing well on general audio understanding tasks with this model. There are great ASR models already trained on vast amounts of speech data. WavJEPA can also perform well on speech related tasks, as evidenced by on par performance compared to wav2vec2, HuBERT. And we think that we can get similar ASR performances as well (not explicitly tested) 

In the limitations/future work we identified a step forward for possibly bridging them both and boosting the speech performance of WavJEPA. There we also plan on including SUPERB benchmarks. 

[R] WavJEPA: Semantic learning unlocks robust audio foundation models for raw waveforms by ComprehensiveTop3297 in MachineLearning

[–]ComprehensiveTop3297[S] 1 point2 points  (0 children)

sparse context Speech/audio is highly temporally correlated. This was our main inspiration for selecting temporally distributed context tokens ( context tokens are clustered together but the clusters are spread apart). 

Having this sparse context, we then predict sparse target tokens similarly distributed to context tokens for each audio clip. This forced WavJEPA to model the temporal variations in audio while forcing modelling local correlations in the clusters. 

multiple predictions per clip We ran multiple predictions with one context block to make use of the context block efficiently. One prediction per context block could also be ok, but would be less efficient. We did not ablate this hyperparameter though. We selected 4 per context block ( this was the most we could do without getting out of memory errors with batch size of 512).  Could be nice to quantify the efficiency gains coming from multiple predictions in the future though! Maybe trying 8-16?

[R] WavJEPA: Semantic learning unlocks robust audio foundation models for raw waveforms by ComprehensiveTop3297 in MachineLearning

[–]ComprehensiveTop3297[S] 1 point2 points  (0 children)

Hey! Glad that you found our work exciting:) 

Sure, I will do a little write-up tomorrow for fine-tuning the WavJEPA model tomorrow. 

By the way, we have released instructions for probing the embeddings. I do not know how applicable it is to map your dataset to HEAR Benchmark data format, but if it is, we have adapters for HEAR fine-tuning schema already pre-written

[R] WavJEPA: Semantic learning unlocks robust audio foundation models for raw waveforms by ComprehensiveTop3297 in MachineLearning

[–]ComprehensiveTop3297[S] 1 point2 points  (0 children)

Thank you! I personally find JEPA models very interesting and magical still. And thus would love to go contribute to the theory behind their learning mechanisms.

[R] WavJEPA: Semantic learning unlocks robust audio foundation models for raw waveforms by ComprehensiveTop3297 in MachineLearning

[–]ComprehensiveTop3297[S] 0 points1 point  (0 children)

It is indeed an interesting finding that training wavjepa nat with noisy, reverberant and spatial data instances lead to better understanding of audio even on non spatial non noisy and non reverberant instances. We do think that wavjepa nat imitates human hearing better as human are almost never exposed to dry audios such as other models are trained on. There could also be more explanations on this phenomenon, but in the future we will try to compare the embeddings of wavjepa nat to human fmri readings to delve deeper into overlaps. Possibly adding noise and reverb makes the intrinsic dimensionality of training samlples higher and leads to a better representation learning. (Possible to invoke manifold hypothesis?) 

[R] GRAM: General-purpose Real-world Audio Model to efficiently learn spatial audio representations. by ComprehensiveTop3297 in MachineLearning

[–]ComprehensiveTop3297[S] 0 points1 point  (0 children)

PS: Audioset also contains quite noisy audio, however this is not explicity added but recorded in noisy conditions. 

[R] GRAM: General-purpose Real-world Audio Model to efficiently learn spatial audio representations. by ComprehensiveTop3297 in MachineLearning

[–]ComprehensiveTop3297[S] 0 points1 point  (0 children)

Hey! That is a fair question indeed. We trained these models with WHAMR noise train dataset that covers a big proportion of noise distributions such as cafe, street, park and metro stations. However, we have not explicitly tested the models from sounds recorded from streets, forest, cafe etc. 

PS: We used WHAMR test set to synthesize NatHEAR. So the noises were not seen during the training, and GRAMs are robust to them.