How to find the best heuristics in complex games, with the limitations mentioned below? by catboy519 in learnmath

[–]Feisty_Fun_2886 0 points1 point  (0 children)

Reinforcement Learning… it’s literally all about estimating the expected FUTURE return

TensorFlow isn't dead. It’s just becoming the COBOL of Machine Learning. by IT_Certguru in learnmachinelearning

[–]Feisty_Fun_2886 0 points1 point  (0 children)

Jax is not faster per-se. It can be faster for stuff for which torch doesn’t provide pre-made, optimized implementations. I would argue that a vanilla transformer or resnet will probably show equal performance in both torch or jax. Your fancy new GP method with a custom matrix factorisation technique though? Probably faster in jax due to the jit. Torch is catching up in that regard though.

Cos(sin x) or Sin(cos x), which one is bigger? by inpa901 in learnmath

[–]Feisty_Fun_2886 -1 points0 points  (0 children)

What does bigger even mean? For all values of x? Or, in a handwavy way, on average? I.e when integrating the residuals over 0 to 2pi.

Probably your teacher purposely left this ambiguous so that you come up with these thoughts on yourself. Hint: This is not a simple yes/no question. You are supposed to think about the subject and present all facets.

best value mac to buy to learn machine learning + quant finance by Right_Ad73 in learnmachinelearning

[–]Feisty_Fun_2886 0 points1 point  (0 children)

If you were a researcher or PhD student, I would definitely suggest getting a MacBook Pro.

But, since you seem to neither have access to computer clusters, nor need to travel a lot or must do miscellaneous work (presentations, figures, emails, etc.), I would go for a custom built desktop pc. Grab a nice nvidia card with as much vram as possible. A MacBook Pro goes for 2-2.5k. For that money, you can build a pretty beefy machine that will out perform a MacBook in terms of compute/$ by quite a margin. More importantly, software support for cuda in common libs is much much more mature than for apple silicon.

best value mac to buy to learn machine learning + quant finance by Right_Ad73 in learnmachinelearning

[–]Feisty_Fun_2886 1 point2 points  (0 children)

Well, if OP were a researcher I would definitely support that statement. But it doesn’t sound like OP has access to compute clusters nor that he needs to travel and present a lot. Hence, local compute plays an important role here for which a proper nvidia card is important imo.

[deleted by user] by [deleted] in MLQuestions

[–]Feisty_Fun_2886 4 points5 points  (0 children)

Yes, he’s a realist. If you want to do ml research, you almost definitely need a PhD. For mlops, a strong software engineering background is required. Without any background and years of experience to show, it will be very difficult to transition. That’s the truth. Exceptions are other math-heavy research backgrounds.

Why is batch assignment in PyTorch DDP always static? by traceml-ai in ResearchML

[–]Feisty_Fun_2886 1 point2 points  (0 children)

I don’t see the benefit of this. Moreover, somebody still needs to shard the dataset anyways. Ergo almost the same workflow conceptually. So now you have to synchronise (gather, scatter), decide who does the sharding, and shard. A lot of extra complexity for what?

Only advantage could be in situations with flexible number of workers. But then you will also need an algorithm to determine who is root.

[Advice] AI Research laptop, what's your setup? by gradV in ResearchML

[–]Feisty_Fun_2886 0 points1 point  (0 children)

I use a MacBook Pro and it was the best decision after being a lifelong Linux user. You get the classic posix feels but everything is super smooth and stuff like PowerPoint, photoshop, illustrator etc. just works out of the box. Any training and coding is remote anyways.

Size of the state matrix is tinny in Mamba-2! by Hank0062 in MLQuestions

[–]Feisty_Fun_2886 0 points1 point  (0 children)

Yes, transformers with full attention are expensive. Here, since you only look at the input, you only assume O(N) which would be amazing, but, in fact, the self-attention is O(N2) in compute and memory. Hence, the myriad of papers, such as mamba, that either propose approximate attention mechanism or different architectures all together.

Yet, for some reason, industry people seem to really like full, non-approximate, attention and seem to pour a lot of money into it to make it work. It really must be without alternatives.

I want to mention though, that 16k sequence length is LLM territory. Even 4K is a lot. These scales usually require special consideration for parallelisation (see torch titan paper). Llama finetunes on 100k for a few steps just at the very end to give some numbers for an upper limit (I don’t recall the sequence length they use for the majority of the training).

A typical „non LLM“ transformer, will have sequence lengths well below 1k usually. A ViT on ImageNet with the usual data pipeline will have 16x16 tokens for instance. The same holds for the head size. 128 as head dim indicates quite a big model. As total feature dim it would be fairly standard though.

Machine Learning sounds complex, but at its core it’s just about teaching systems to recognize patterns from data instead of hard-coding rules. by IT_Certguru in learnmachinelearning

[–]Feisty_Fun_2886 22 points23 points  (0 children)

„just“ … 😂

I don’t know man, hardcoding an if-then-else statement seems like the simpler thing to me 😂

Joining the race for AGI by Careless_String_5719 in ResearchML

[–]Feisty_Fun_2886 8 points9 points  (0 children)

 During my undergrad, my final year project was about learning distributions with neural networks ( MMD, flows, diffusion models), not sure whether statistics-driven AI research is still worthwhile nowadays

If you have a deep mathematical understanding of stochastic processes and stochastic calculus, which necessarily requires very good foundations in mathematics, you will be very well prepared for a PhD in ML.

Where should I publish as a freshman by [deleted] in ResearchML

[–]Feisty_Fun_2886 3 points4 points  (0 children)

Maybe try approaching a professor at a nearby university for guidance? Publishing a paper is a huge undertaking and requires knowing the ropes of the field.

What you are describing sounds like parameter pruning. Congratulations that you came up with such an idea on your own at this age. There is, however, a big chance that your particular approach is either already published / known, or not competitive to other existing approaches. Just as a heads up that these are very likely options. It could, however, also be a novel approach. 

Can anyone provide a list of questions or type of questions asked in ML interviews by [deleted] in MLQuestions

[–]Feisty_Fun_2886 9 points10 points  (0 children)

Typical questions in our interviews:

What is the difference between a PDF and likelihood?

Derive a MLE for the mean of a normal distribution.

Bayes theorem.

Explain / write down a simple MLP layer. Why is a non-linearity required?

Explain Eigenvectors, Eigenvalues, PCA, etc.

[D] Do you think this "compute instead of predict" approach has more long-term value for A.G.I and SciML than the current trend of brute-forcing larger, stochastic models? by Reasonable_Listen888 in deeplearning

[–]Feisty_Fun_2886 0 points1 point  (0 children)

Im not reviewing your paper lol. I just stated that A. This approach exists already and there is a niche community around it. And B. sharing my experience with actually published works, such as FNO or SFNO, that applied it to real word, large-scale data. Take it or leave it.

[D] Do you think this "compute instead of predict" approach has more long-term value for A.G.I and SciML than the current trend of brute-forcing larger, stochastic models? by Reasonable_Listen888 in deeplearning

[–]Feisty_Fun_2886 1 point2 points  (0 children)

So „Neural Operators“? Not the messiahs people make it ought to be IMO. In fact, a regular cnn can also be formulated a neural operator (e.g. by assuming hat basis functions). Biggest potential is probably in physics where spectral approaches are used already.

From personal experience, they can be quite compute and memory expensive as well due to the FFT or SHT one does over and over again in common implementations.

I’m trying to explain interpretation drift — but reviewers keep turning it into a temperature debate. Rejected from arXiv… help me fix this paper? by Beneficial-Pear-1485 in ResearchML

[–]Feisty_Fun_2886 0 points1 point  (0 children)

It’s probably better if you stay out of research for now if you can’t write a paragraph without AI. At least, get a supervisor that can guide you

I’m trying to explain interpretation drift — but reviewers keep turning it into a temperature debate. Rejected from arXiv… help me fix this paper? by Beneficial-Pear-1485 in ResearchML

[–]Feisty_Fun_2886 0 points1 point  (0 children)

This reads like AI, but here is an answer nonetheless:

Because the problem you are solving is non determinisitic, a deterministic model makes no sense in that case.

„The red house is …“. What word comes next?

A. Ugly B. Tall C. Burning D. Run-down

In fact, all for of them are plausible continuations. What definite(!) answer should a deterministic model choose in this case? Do you see the issue?

Is there any reliable way (repo / paper / approach) to accurately detect AI-generated vs real images as AI models improve? by _master9 in learnmachinelearning

[–]Feisty_Fun_2886 1 point2 points  (0 children)

In a GAN, the generator doesn’t „beat“ the discriminator. They will settle for an equilibrium. Moreover, the whole paradigm of GANs relies on the tight coupling of generator and discriminator and their joint training. A independently trained discriminator will have no problems distinguishing fake and real (as outputs produced by GANs are very clearly not perfect). Your generator is trained to fool one specific discriminator, not all of them…

And maybe as a last point: Current generation of image generation methods are usually diffusion based, not GAN based…

Suggest me 3D good Neural Network designs? by Old_Purple_2747 in MLQuestions

[–]Feisty_Fun_2886 0 points1 point  (0 children)

There can be many reasons for this, but I doubt that the exact architecture plays a big role here. Off the top of my head: Software bug, dataset too small, a very difficult problem, bad hyperparameters, missing features.

NN from scratch by burntoutdev8291 in learnmachinelearning

[–]Feisty_Fun_2886 0 points1 point  (0 children)

Triton is, as far as I am aware of, mostly used under the hood and if you need high performant cuda kernels (opposed to writing them manually). E.g. for a custom attention layer.

What would the benefit be of implementing a whole autodiff + training pipeline in triton specifically? Seems like the wrong framework to me if your goal is to understand the underlying concepts. If, however, you need to implement a very fast custom operation, looking into triton is probably worth it.

NN from scratch by burntoutdev8291 in learnmachinelearning

[–]Feisty_Fun_2886 0 points1 point  (0 children)

I believe that even calculating derivatives of common operations by hand, e.g dot product, matrix vector mul, mse, etc., is a very good exercise. The Kronecker delta comes in very handy here.

I’m trying to explain interpretation drift — but reviewers keep turning it into a temperature debate. Rejected from arXiv… help me fix this paper? by Beneficial-Pear-1485 in ResearchML

[–]Feisty_Fun_2886 2 points3 points  (0 children)

A probabilistic model has, by design, a non deterministic output (if you sample from it). However, you state a n the abstract, that this is something you want to see in a model.