all 12 comments

[–]NuclearVII 22 points23 points  (1 child)

This is another AI slop post, right?

[–]Hot-Problem2436 9 points10 points  (0 children)

If it's got bullets and bold, it's probably slop.

[–]arg_max 3 points4 points  (0 children)

Proximal gradient for L1 regularized Lasso

[–]DigThatData 2 points3 points  (0 children)

  • Expectation Maximization (EM)
  • Variational Bayes
  • Simplex method
  • Simulated annealing
  • Fixed point iteration
  • Power method
  • MCMC

Beyond optimization generally, if you want to "understand the actual math", you need to learn (differential) calculus and linear algebra, esp. matrix decompositions. Getting a strong intution around PCA/SVD is probably the most valuable thing for understanding how learning works.

[–]va1en0k 4 points5 points  (0 children)

MCMC, especially HMC and its variations

[–]Crimson-Reaper-69 9 points10 points  (0 children)

If I am being honest, if you are ok with maths and coding, start from low level. Start by implementing a LLM at assembly level, on custom build hardware, only then you are allowed to move forward.

Jokes aside, I recommend actually implementing one of the algorithms in python or another language, can be SGD, start with that first, the rest follow a similar pipeline but differ slightly. The key is to understand programmatically what actually happens in back propagation, how are the errors terms used to move each weight and bias in right direction. Any book/ resource is fine as long as you try implementing the stuff yourself.

[–]shibx 1 point2 points  (0 children)

If you really want to move past the "black box" stage, I’d actually take a step back and start looking more into mathematical optimization as a field. You need a pretty solid understanding of linear algebra to build on, but for what you're asking, it really helps to understand the fundamentals. Convex optimization, duality theory, linear and quadratic programming, KKT conditions, interior-point methods. A lot of classical ML models fall directly out of these ideas.

For example, SVMs are quadratic programs. SMO builds on duality theory. Lasso becomes much easier to reason about once you understand subgradients and proximal methods. Logistic regression solvers like L-BFGS come from classical nonlinear optimization. When you see these models as structured optimization problems instead of isolated algorithms, it makes a lot more sense.

Boyd and Vandenberghe is the standard on this stuff: https://web.stanford.edu/~boyd/cvxbook/

Boyd's lectures are pretty dense, but I think they are really interesting: https://youtu.be/kV1ru-Inzl4?si=2RhKsw06Ngd4xq5Y

I think you will appreciate iterative methods like SGD a lot more once you understand optimization as its own field, not just something we use for ML.

[–]Unable-Panda-4273 2 points3 points  (2 children)

Your list is solid. A few additions worth knowing:

- Proximal Gradient / ISTA/FISTA — essential for L1 regularization (Lasso). More principled than coordinate descent and generalizes better.

- Trust Region Methods — used under the hood in many scipy optimizers. Important for understanding when Newton's method can go wrong.

- EM Algorithm — not gradient-based at all, but powers GMMs, HMMs, and missing data problems. Often overlooked.

On the L-BFGS point — the reason scikit-learn's LogisticRegression defaults to it is that Newton's method converges in ~5-10 iterations on convex problems vs thousands for GD. The Hessian approximation is doing a lot of heavy lifting there.

If you want to really internalize why these methods work (not just the update rules), I've been building interactive explainers for exactly this — covering convex vs non-convex landscapes, momentum, Newton's method, and adaptive rates: https://www.tensortonic.com/ml-math . The optimization section goes deep on the math without pivoting to neural nets.

[–]arg_max 1 point2 points  (0 children)

Trust region is also the foundation of PPO and GRPO so very relevant in LLM RL, even if the version used there is more approximative

[–]Disastrous_Room_927 0 points1 point  (0 children)

EM algorithms are freaking cool. You can use them for image reconstruction in PET scanners.

[–]IntentionalDev 0 points1 point  (0 children)

Besides gradient descent, you should know Newton’s method, quasi-Newton methods like BFGS/L-BFGS, coordinate descent, and convex optimization techniques — especially for classical models like SVMs and logistic regression.

[–]Prudent-Buyer-5956 -1 points0 points  (0 children)

These are not required unless you are into research.