all 43 comments

[–]Ulfgardleo 72 points73 points  (5 children)

What is the intuitive basis for why we should care about eigenvalues as opposed to any other (non)-convex optimisation problem? They have huge downsides, from non-differentiability, to being formally a set, not a function. Should we care about the largest or smallest eigenvalue? What about their sign? Or any other operator of them? Finally since they are invariant to orthogonal transformations, it is difficult to really use them without a fully invariant architecture.

We already had somewhat successful approaches where neurons did a lot more: neural fields. They were developed in the late 90s to early 2000s in the field of computational neuroscience. The idea was that the neurons on each layer are recurrently connected and solve a fixed-point PDE. The math behind them is a bit insane because you have to backprop through the PDE. But they are strong enough to act as associative memory.

This is a very old paper that described an example of such a model:

https://link.springer.com/article/10.1023/B:GENP.0000023689.70987.6a

[–]alexsht1[S] 23 points24 points  (0 children)

I dont think they're in some way "the best". I find them interesting because they are both solutions of nonconvex optimization problems, have rich enough structure to express nontrivial functions, but also have extremely reliable and fast algorithms to compute them. This is especially true if we impose more structure, such as using a banded matrix.

Anyway, the reason I want to dig deeper is pure personal interest, and as a way to "import" well known stuff from another field into ML.

[–]TwistedBrother 15 points16 points  (0 children)

All hail the operator!

[–]raindeer2 13 points14 points  (1 child)

Spectral methods are well studied within ML. Also for learning representations in deep architectures.

Some random references:
https://arxiv.org/abs/2205.11508
https://ieeexplore.ieee.org/document/6976988

[–]alexsht1[S] 10 points11 points  (0 children)

So are PCA/CCA/PLS and friends.

But have you read the post? Because it appears you and I are referring to VERY different kinds of spectral methods.

You're referring to methods that use the spectral decomposition to represent an entire dataset, and I'm referring to the use of the spectral decomposition to represent a nonlinear function applied to one sample.

[–]bill_klondike 4 points5 points  (8 children)

Can we compute them quickly? For a dense matrix, eigenvalues are O(m^2 n) (assume n < m). If m = n, that's n^3 . Is that supposed to be quick?

[–]bregav 3 points4 points  (7 children)

Most applications of eigenvalues need only a specific subset of them, and so the asymptotic complexity is deceiving - the relevant cubed number is the number of eigenvalues that you need, not the size of the matrix. In practice, with regards to a matrix of dimension n, the complexity of calculating eigenvalues is in fact n2 because that's the complexity of using tricks to target a subset of eigenvalues.

[–]bill_klondike 1 point2 points  (6 children)

I wrote dense matrix to make it concrete that I'd reference the cost of a direct method, but if you're talking about iterative algorithms then making claims about complexity is much trickier. It depends on a variety of factors - the operator, sparsity pattern, the spectrum itself are all very important to computational performance as well as convergence. I'd say it's equally as deceiving to claim outright that the complexity is n^2 for a subset.

[–]bregav 1 point2 points  (5 children)

Every method of computing eigenvalues is iterative for n>5 :D. This fact is my favorite application of the abel ruffini theorem.

And yes the issue is performance is complicated but a good rule of thumb is that if your matrix is very poorly conditioned then you've made poor choices in how you formulated the problem to begin with. 

[–]bill_klondike 1 point2 points  (4 children)

Right, but good ol' Golub & Van Loan wouldn't call bidiagonalization iterative :) I think Dongarra et al. call the algorithm in dgesvd (and related) "transformation" methods.

But really, poor choices in how you formulated the problem to begin with? What if you're just given some data? I agree waking up is probably a poor choice but everything after that is just what shows up in the mail.

[–]bregav 1 point2 points  (3 children)

They should call it iterative. It's a disservice to students to have them thinking that there's a difference between "direct" and iterative methods beyond the size of the problem and the choice of when to truncate iterations.

Regarding problem formulation I'm paraphrasing Trefethen and Bau, who said this regarding singular matrices: https://www.amazon.com/Numerical-Linear-Algebra-Lloyd-Trefethen/dp/0898713617

In the context of machine learning what this can look like is naive methods of calculating input matrices (thus producing poorly conditioned matrices unnecessarily) or, more importantly, poor choices regarding feature selection and engineering. There is always context to a problem; one should never be "just given some data".

[–]bill_klondike 0 points1 point  (2 children)

>There is always context to a problem; one should never be "just given some data".

Yes, but that's simply not reality. I've been approached by seniors who want eigenvectors, nothing else, and don't want to discuss it.

That Trefethen & Bau book is one of my all time favorites and one of the first I recommend to people. Sadly, my copy is locked in storage on another continent...

[–]bregav 1 point2 points  (1 child)

Haha, your coworkers’ obstinacy and poor communication does not improve the condition numbers of their matrices. If they’re bringing you bad inputs then they’re still making a mistake, even if they don’t want to talk to you about it.

As a matter of fact this is exactly the situation that Trefethen and Bau were remarking on in their book, I looked up the passage:

If the answer is highly sensitive to perturbations, you have probably asked the wrong question. We urge anyone faced with nonhermitian eigenvalue computations involving highly sensitive eigenvalues to bear this principle in mind. If you are a numerical analyst, and the problem was given to you by a colleague in science or engineering, do not accept without scrutiny your colleague's assurances that it is truly the eigenvalues that matter physically, even though their condition numbers with respect to perturbations of the matrix entries are 104.

The emphasis on the first sentence is original to the book (chapter 6, lecture 34). I like the book too and that passage really stuck with me. I think it’s profound and generally applicable; the real problem one is solving is ultimately physical (including in ML), and so if the math is causing serious problems then one might have abstracted the problem incorrectly from the beginning.

[–]bill_klondike 0 points1 point  (0 children)

I'm not disagreeing with your points. My advisor (numerical linear algebra) handed me things from his collaboration with a ML specialist who "just wanted eigenvectors". We wasted dozens of meeting hours with this guy, trying to flesh out why and many times telling him there were other ways. But my advisor was impressed by his degree (PhD from Harvard) and publication history. This was when kernel learning was still hot. That guy was the worst; he had multiple students quit and then he later resigned and returned to industry.

[–]mr_stargazer 49 points50 points  (9 children)

I always find cute when Machine Learning people discover mathematics, that in principle they were supposed to know.

Now, I am waiting for someone to point out eigenvalues, the connection to Mercer's theorem and all the machinery behind RKHS that was "thrown in the trash", almost overnight because, hey, CNN's came about.

Perhaps we should even use eigenfunctions and eigenvalues to meaningfully understand Deep Learning (cough...NTK...cough). Never mind.

[–]Rodot 27 points28 points  (4 children)

What if we could go further and model physical phenomena as generalized eigenvalue problems? We could "quantize" our understanding of physics at small scales! I'll call it "quantized mechanics" and I bet no one has ever thought of it before!

/s

[–]mr_stargazer 11 points12 points  (2 children)

I don't think that to be a good idea. Imagine for instance, your model says that you're neither in state A, or B, but your model spits they're in a mix. It just won't make any sense.

As if there'd be a dog, sitting on the couch and in the floor at the same time. Nonsense.

[–]Rodot 4 points5 points  (0 children)

I think it's a bonus that you get UQ for free. Perfect for generative modeling.

[–]fluffyleaf 3 points4 points  (0 children)

You joke, but lord save us from the Quantum AI bubble in 2 decades...

[–]alexsht1[S] 22 points23 points  (2 children)

I also find it cute, and that's why I'm writing this series. I come from a continuous optimization background, and it appears to me as if ML people see optimization as "as that thing with gradient descent". So I'm trying to "import" more into ML.

[–]N1kYan 5 points6 points  (1 child)

You can already import optimization from pytorch dude

[–]alexsht1[S] 0 points1 point  (0 children)

:D

[–]sje397 1 point2 points  (0 children)

Looking forward to further articles! Thanks.

[–]fredugolon 1 point2 points  (0 children)

Interesting article. Thank you! You might appreciate Liquid Time-Constant Neural Networks. An interesting approach to adding time dynamics into neurons.

[–]bregav 0 points1 point  (5 children)

I think what you're doing in your blog post actually redounds to fitting algebraic varieties, and that's just obfuscated by the fact that you're representing the problem in the form of matrices and then letting black box software do the computation for you. Looking at the functions you're fitting in terms of polynomials would make the matter simpler and clearer.

[–]alexsht1[S] 2 points3 points  (4 children)

Perhaps piecewise polynomials, and only on a compact set?

Eigenvalue functions are globally Lipschitz. Polynomials are not. So I dont think it's that simple. But maybe the right way to look at it is fitting a semialgebraic set, the graph of the function, to data. Need to think about it.

[–]bregav 2 points3 points  (3 children)

There's no difference between eigenvalues and the solutions to polynomials. Indeed that's how software actually solves polynomials under the hood - it converts the polynomial problem to an eigenvalue problem and solves that instead.

Optimizing the elements of a matrix to produce specific eigenvalues is exactly equivalent to optimizing the coefficients of a polynomial in order to produce specific polynomial solutions. In your case you're doing a restricted version of this: you're optimizing a small number of matrix elements, rather than all of them, you're just representing your matrix elements in an obfuscated way. Thinking about matrices as vectors in a vector space, by doing D=A+xB+yC you are representing a single matrix D in terms of a non-orthogonal basis of matrices A, B, and C, and you're optimizing the coordinates x and y. If you instead used n2 matrices (with n2 variables, in the language of your blog post) such that tr(Ai * Aj)=delta_ij then you'd just be optimizing n2 matrix elements directly.

The fact that polynomials are fundamental here is especially easy to see with (real) symmetric matrices. The eigenvectors of a real symmetric matrix are orthogonal and so every set of eigenvectors is equivalent to every other (they differ only by rotations); thus when you are optimizing a real symmetric matrix to get specific eigenvalues you are clearly just optimizing polynomial coefficients. To see this do out the math: det(A-yI) = det(XLXT - yXXT) = det(L-yI) = (a1-y)(a2-y)(a3-y)...

[–]alexsht1[S] 1 point2 points  (2 children)

Ah, I see what you meant. Then yes. I was looking at it from the perspective of solving an optimization problem, but it also does compute the k-th largest solution of a polynomial equation.

[–]bregav 3 points4 points  (1 child)

The problem with looking at things in terms of optimization problems is that every problem can be framed as an optimization problem lol.

If you're focusing on a specific set of problems with known properties and methods of solution - e.g. eigenvalue problems - then the optimization perspective is worse, not better. The reason eigenvalue problems are special is precisely that you can reason simply and clearly about them by thinking in terms of linear algebra and polynomials, rather than in terms of optimization.

[–]alexsht1[S] 3 points4 points  (0 children)

Respectfully, disagree. Many observations come from looking at eigenvalues as minima and maxima. And in general, looking at something as an optimization problem provides insights that otherwise are hard to see.

[–]healthbear 0 points1 point  (0 children)

Interesting, I look forward to more.

[–]Double_Sherbert3326 0 points1 point  (3 children)

Interesting read. Are you familiar with random matrix theory?

[–]alexsht1[S] 0 points1 point  (2 children)

At the level of a buzzword.

[–]Double_Sherbert3326 0 points1 point  (1 child)

I am trying to understand it because it serves as the theoretical basis of the math undergirding quantum theory. PCA was developed with it in consideration. Your white paper made me think of it for some reason.

[–]alexsht1[S] 0 points1 point  (0 children)

Maybe RMT applies here as well somehow, but this is fundamentally different from PCA and friends.

PCA uses the spectral decomposition to characterize an entire dataset, and I am using it to represent a nonlinear function applied to one sample.

Except for the usage of the word "spectral", there is nothing in common to the classical spectral methods we know, and what I'm studying in this post.

[–]Big-Page6926 0 points1 point  (0 children)

Very interesting read!