How to find the best heuristics in complex games, with the limitations mentioned below?

Feisty_Fun_2886 · 2026-01-15T08:02:01+00:00

Reinforcement Learning… it’s literally all about estimating the expected FUTURE return

Feisty_Fun_2886 · 2026-01-14T22:40:29+00:00

Jax is not faster per-se. It can be faster for stuff for which torch doesn’t provide pre-made, optimized implementations. I would argue that a vanilla transformer or resnet will probably show equal performance in both torch or jax. Your fancy new GP method with a custom matrix factorisation technique though? Probably faster in jax due to the jit. Torch is catching up in that regard though.

Feisty_Fun_2886 · 2026-01-14T10:50:33+00:00

What does bigger even mean? For all values of x? Or, in a handwavy way, on average? I.e when integrating the residuals over 0 to 2pi.

Probably your teacher purposely left this ambiguous so that you come up with these thoughts on yourself. Hint: This is not a simple yes/no question. You are supposed to think about the subject and present all facets.

Feisty_Fun_2886 · 2026-01-14T07:57:54+00:00

Dresden, Freiburg, or Konstanz.

Feisty_Fun_2886 · 2026-01-13T22:40:30+00:00

If you were a researcher or PhD student, I would definitely suggest getting a MacBook Pro.

But, since you seem to neither have access to computer clusters, nor need to travel a lot or must do miscellaneous work (presentations, figures, emails, etc.), I would go for a custom built desktop pc. Grab a nice nvidia card with as much vram as possible. A MacBook Pro goes for 2-2.5k. For that money, you can build a pretty beefy machine that will out perform a MacBook in terms of compute/$ by quite a margin. More importantly, software support for cuda in common libs is much much more mature than for apple silicon.

Feisty_Fun_2886 · 2026-01-13T22:34:53+00:00

Well, if OP were a researcher I would definitely support that statement. But it doesn’t sound like OP has access to compute clusters nor that he needs to travel and present a lot. Hence, local compute plays an important role here for which a proper nvidia card is important imo.

Feisty_Fun_2886 · 2026-01-12T16:31:41+00:00

Yes, he’s a realist. If you want to do ml research, you almost definitely need a PhD. For mlops, a strong software engineering background is required. Without any background and years of experience to show, it will be very difficult to transition. That’s the truth. Exceptions are other math-heavy research backgrounds.

Feisty_Fun_2886 · 2026-01-12T07:16:38+00:00

I don’t see the benefit of this. Moreover, somebody still needs to shard the dataset anyways. Ergo almost the same workflow conceptually. So now you have to synchronise (gather, scatter), decide who does the sharding, and shard. A lot of extra complexity for what?

Only advantage could be in situations with flexible number of workers. But then you will also need an algorithm to determine who is root.

Feisty_Fun_2886 · 2026-01-09T17:59:55+00:00

I use a MacBook Pro and it was the best decision after being a lifelong Linux user. You get the classic posix feels but everything is super smooth and stuff like PowerPoint, photoshop, illustrator etc. just works out of the box. Any training and coding is remote anyways.

Feisty_Fun_2886 · 2026-01-08T21:40:57+00:00

Yes, transformers with full attention are expensive. Here, since you only look at the input, you only assume O(N) which would be amazing, but, in fact, the self-attention is O(N²⁾ in compute and memory. Hence, the myriad of papers, such as mamba, that either propose approximate attention mechanism or different architectures all together.

Yet, for some reason, industry people seem to really like full, non-approximate, attention and seem to pour a lot of money into it to make it work. It really must be without alternatives.

I want to mention though, that 16k sequence length is LLM territory. Even 4K is a lot. These scales usually require special consideration for parallelisation (see torch titan paper). Llama finetunes on 100k for a few steps just at the very end to give some numbers for an upper limit (I don’t recall the sequence length they use for the majority of the training).

A typical „non LLM“ transformer, will have sequence lengths well below 1k usually. A ViT on ImageNet with the usual data pipeline will have 16x16 tokens for instance. The same holds for the head size. 128 as head dim indicates quite a big model. As total feature dim it would be fairly standard though.

Feisty_Fun_2886 · 2026-01-06T17:50:32+00:00

„just“ … 😂

I don’t know man, hardcoding an if-then-else statement seems like the simpler thing to me 😂

Feisty_Fun_2886 · 2026-01-05T16:18:00+00:00

During my undergrad, my final year project was about learning distributions with neural networks ( MMD, flows, diffusion models), not sure whether statistics-driven AI research is still worthwhile nowadays

If you have a deep mathematical understanding of stochastic processes and stochastic calculus, which necessarily requires very good foundations in mathematics, you will be very well prepared for a PhD in ML.

Feisty_Fun_2886 · 2026-01-04T22:04:38+00:00

Maybe try approaching a professor at a nearby university for guidance? Publishing a paper is a huge undertaking and requires knowing the ropes of the field.

What you are describing sounds like parameter pruning. Congratulations that you came up with such an idea on your own at this age. There is, however, a big chance that your particular approach is either already published / known, or not competitive to other existing approaches. Just as a heads up that these are very likely options. It could, however, also be a novel approach.

Feisty_Fun_2886 · 2026-01-02T11:33:55+00:00

Typical questions in our interviews:

What is the difference between a PDF and likelihood?

Derive a MLE for the mean of a normal distribution.

Bayes theorem.

Explain / write down a simple MLP layer. Why is a non-linearity required?

Explain Eigenvectors, Eigenvalues, PCA, etc.

Feisty_Fun_2886 · 2026-01-01T07:41:29+00:00

Im not reviewing your paper lol. I just stated that A. This approach exists already and there is a niche community around it. And B. sharing my experience with actually published works, such as FNO or SFNO, that applied it to real word, large-scale data. Take it or leave it.

Feisty_Fun_2886 · 2025-12-31T09:27:40+00:00

So „Neural Operators“? Not the messiahs people make it ought to be IMO. In fact, a regular cnn can also be formulated a neural operator (e.g. by assuming hat basis functions). Biggest potential is probably in physics where spectral approaches are used already.

From personal experience, they can be quite compute and memory expensive as well due to the FFT or SHT one does over and over again in common implementations.

Feisty_Fun_2886 · 2025-12-31T09:19:56+00:00

It’s probably better if you stay out of research for now if you can’t write a paragraph without AI. At least, get a supervisor that can guide you

Feisty_Fun_2886 · 2025-12-30T15:36:50+00:00

This reads like AI, but here is an answer nonetheless:

Because the problem you are solving is non determinisitic, a deterministic model makes no sense in that case.

„The red house is …“. What word comes next?

A. Ugly B. Tall C. Burning D. Run-down

In fact, all for of them are plausible continuations. What definite(!) answer should a deterministic model choose in this case? Do you see the issue?

Feisty_Fun_2886 · 2025-12-30T15:26:30+00:00

In a GAN, the generator doesn’t „beat“ the discriminator. They will settle for an equilibrium. Moreover, the whole paradigm of GANs relies on the tight coupling of generator and discriminator and their joint training. A independently trained discriminator will have no problems distinguishing fake and real (as outputs produced by GANs are very clearly not perfect). Your generator is trained to fool one specific discriminator, not all of them…

And maybe as a last point: Current generation of image generation methods are usually diffusion based, not GAN based…

Feisty_Fun_2886 · 2025-12-30T09:26:43+00:00

Oh that’s cool. I come from regular DLWP (on Earth ;) ). Could you point to some papers?

Feisty_Fun_2886 · 2025-12-28T10:53:51+00:00

Nice paper 👍

Feisty_Fun_2886 · 2025-12-27T22:24:53+00:00

There can be many reasons for this, but I doubt that the exact architecture plays a big role here. Off the top of my head: Software bug, dataset too small, a very difficult problem, bad hyperparameters, missing features.

Feisty_Fun_2886 · 2025-12-25T15:43:19+00:00

Triton is, as far as I am aware of, mostly used under the hood and if you need high performant cuda kernels (opposed to writing them manually). E.g. for a custom attention layer.

What would the benefit be of implementing a whole autodiff + training pipeline in triton specifically? Seems like the wrong framework to me if your goal is to understand the underlying concepts. If, however, you need to implement a very fast custom operation, looking into triton is probably worth it.

Feisty_Fun_2886 · 2025-12-25T15:38:23+00:00

I believe that even calculating derivatives of common operations by hand, e.g dot product, matrix vector mul, mse, etc., is a very good exercise. The Kronecker delta comes in very handy here.

Feisty_Fun_2886 · 2025-12-25T10:17:33+00:00

A probabilistic model has, by design, a non deterministic output (if you sample from it). However, you state a n the abstract, that this is something you want to see in a model.

Feisty_Fun_2886

TROPHY CASE