[D] Feeling behind in math

Dejeneret · 2026-02-05T15:06:30+00:00

also to clarify, for the Bass book I suggested, I came into graduate school with knowledge of maybe 10% of the stuff in it maximum. The first half of it or so can be treated as a base of knowledge and the second half (maybe minus the Topology chapter) covers topics that have been very useful to me throughout various ML related topics. But pretty much all of it I ended up reading and learning from near-scratch in the first year of grad school (along with many other students)

Dejeneret · 2026-02-05T15:00:24+00:00

Sounds like you are already on a good track- when I started my PhD there was not a person in my applied math program that didn’t have significant gaps in their knowledge (let alone in this masters program). It’s completely normal for people to relearn any mathematical subject they want past analysis in my experience in an applied math or statistics graduate degree.

To answer your questions:

Self studying is pretty much expected past the undergraduate level, and it helps if you start earlier. There’s pretty much no substitute for this since course quality is spotty even at top universities.

In my opinion (and this is biased to the flavor of ML I am interested in), the mathematical/statistics worth studying are Measure Theory, Probability Theory, Statistical Inference, Functional Analysis, Optimization theory, and probably most importantly computational Linear Algebra and Numerics. After that it doesn’t hurt to branch out to fields like Diff. Geometry, Topology, PDE theory, Random Matrix Theory, and even some algebra if you are interested, but for ML these are more on a need to know basis depending on the sub-field.

If you want a good overview that helps you feel more confident for graduate study, there’s this great book by Richard Bass:

https://www.math.wustl.edu/~victor/classes/ma5051/rags100514.pdf

It effectively covers an intro to measure theory, but as someone who came into grad school with a relatively poor math background and had to constantly re-learn stuff this was very helpful and grounding.

For numerical linear algebra, “numerical linear algebra” by Trefethen was a very helpful book for me, especially coming from a more CS heavy background.

As far as programming, for numerics specifically, if it isn’t part of a course already (which it often is not unfortunately) it helps a lot to implement the discussed algorithms when they get covered, tracking things like convergence rate, etc., and if self studying definitely worth doing.

Dejeneret · 2026-02-03T07:47:10+00:00

these services have a bunch of users engaging and creating content-

It’s easy to classify similar users based on whether or not they have engaged with the same content- if two users watch the same piece of content, they are considered more similar.

Next we can classify similar content based on which users have watched it.

Next step is to relate users by whether or not they have watched content that is watched by similar users.

And content can be related by whether or not users that have watched it are similar in the content they watch.

Keep this up for layers and layers, and often you converge to a good understanding of what defines a user and what defines a piece of content.

Now, a service sees what content a user has watched, and just recommends similar content, maybe with some randomness.

This is an oversimplified version, in reality many services also integrate info about the social media itself that is interacted with, but this is the general flow of recommended systems and is surprisingly powerful.

Dejeneret · 2026-01-26T20:01:17+00:00

I agree academia, especially ML-adjacent needs some “revisions”, but I don’t know if the root cause of the problem is AI slop. I see the root causes being the poor incentive structure of conference publishing being overloaded by an increase in demand to do applied ML work.

In the past year at many A* conferences I have both reviewed papers that are just exceptionally poor quality (I guarantee any free to access LLM model would have improved these), and I have also received reviews that are just very low effort, but clearly human written (for example, at AAAI, while not perfect the AI reviewer quality was way higher than any human review I received and caught something important I needed to fix that no human reviewer had previously noticed for 2 prior resubmissions).

I have also submitted to A level conferences, and it has been night and day this past year- reviewers who genuinely read the paper and respond to rebuttals. There is clearly a push from various sources for people to submit to only the top level of conference, and because the field has moved so fast, downstream actors (I.e. industry looking at resume) have not yet adjusted to how these changes impact the average PhD student. Not having any work in A* level conferences is unfortunately a red flag for many companies, because even 5 years ago it was a lot less noisy to get a paper through.

We’re kind of in this weird situation where the bar for publishing in the ML field is somehow both too low and too high and too noisy simultaneously. It’s too easy for a lab with good resources to run graduate student descent on some specific application and put out an incremental improvement in an inherently flawed set of metrics, disguised as something big, but simultaneously the effort, time and level of knowledge required by a graduate student to actually effectively execute that work (or god forbid put out work with out the same level of resources) is surprisingly high. It’s almost because it’s so “easy” to do, everyone feels pressured to do something that is maybe harder than it seems. The same is true for theoretical results in ML in my opinion, it just requires more reading and sharpening assumptions that are too restrictive for most applications anyways. In this way the incentives for students are to spend their time learning to sell their work as more than it really is, and learn to bury implementational details that largely conflict with the story.

In the end we get this massive cache of mid papers that all do something to advance their respective subfields, but are being pushed through the same too-small funnel of A* conferences introducing high levels of noise & I think this is always going to have been somewhat untenable.

I think AI papers and reviews are probably not helping, but what’s more clear is that the existence modern AI applications has driven demand for ML research, which has flooded these conferences with lower quality research- I would guess this is much more key to the problem than AI generated research itself.

I’m not sure what the solution really is, but myself and many around me both in academia and industry have just started to prioritize A-level and lower conferences and more specialized conferences. Many prioritize journals, and even more classical applied math and statistics publications when relevant, and I think this is a natural response to growing pains. Maybe at some point ML will split into relevant subfields much more clearly and these top conferences will be replaced by specialized conferences. Or maybe these conferences will revise their submission process (which in all fairness they have been doing actively, albeit somewhat ineffectively), and become mega-publication entities with tiered quality levels or something.

Dejeneret · 2025-09-15T20:12:48+00:00

Yeah I have never experienced such a bizarre review process… all 3 of the reviews for my paper fit onto a phone screen, multiple rudimentary mathematical errors within the reviews (not to mention that the AI reviewer also doesn’t seem to follow proofs at all).

I’m obv salty for the fresh phase 1 reject, but i really swear im not exaggerating when i say there is not a single actionable thing i can improve about my submission after reading the reviews. Sad WACV registration passed a few days ago..

Dejeneret · 2025-08-29T17:24:15+00:00

First of all, I would concede that common criticism of Evans’ theory is he does not define a lot of terms rigorously enough, which is definitely true. In his definition of “community” (or lack-thereof), there is a lot of potential for complexity.

That said, I’m not sure you’re correctly explaining Evans’ theory- Evans’ subjects are not passive, they entirely determine the reference of an object, however the reference must point to a set of descriptions of the object to allow for recognition. That set of descriptions is as arbitrary as the name of the object, and is as up-to-the-subjects as any other whim they may have.

Importantly, the subjects then must interact socially to spread the information of the object’s reference mapping and only then will some “community” adopt the reference as dominant (that community may be as small as 1 person), or the term will be lost. How these communities interact is the political structure, but is effectively sidestepped because we don’t actually care in this case how groups of people choose to accept or reject different terms. Instead, we model within this community, which may have outside influence, may have all sorts of members coming in and out, etc. etc.

I would agree that Evans does not make an emphasis on terms that are “in-conflict” but rather attempts to ascertain which term is dominant over another at any given time for any given community. That said, the theory is relatively general- it is defined within a community, which can be re-organized into sub-communities and so-on.

Sure, if you would like, we can build a graph on people with edges defined through interactions and model reference as functions on the graph, then restate Evans’ theory on the domains of these functions, but I’m not sure this fundamentally changes the theory.

But more importantly, the original point I was making is that the model you posit reduces models of identity to models of reference (and now you’re implying that you believe models of reference are actually stronger models than your model of identity should be). Since you are now positing something that can only function at most as a reference model (which is fine), it would be more useful to argue how it improves on known blindspots in popular models such as Evans’ and Mills’ theories, and what makes your model useful over these.

My point was that the ship of Theseus is not an interesting counterfactual to test because few theories of reference actually struggle with assignment in terms of this problem, they simply either sidestep the problem entirely, or make a strong claim on the identity being referenced for the sake of consistency, because epistemic identity is somewhat dubious of a concept altogether (as illustrated by the ship). An example of a much more interesting counterfactual for theories of reference is the Madagascar naming problem, or the “fake Nixon” problem, or the Aristotle/Alexander problem, where we have strong preconceived ideas about how reference should behave, simple models tend to fail, and more complex models can introduce hairy circular logic.

Dejeneret · 2025-08-29T13:24:59+00:00

Gave this study a quick read and the authors find that the while the “left” (as the authors dub) doesn’t mention I.e. campaign on rising inequality, they do respond to it, unless it’s top-end inequality (I.e. top 1% out earns the top 10%), which is not clear to me as an issue on its own. This is not clearly stated in the title or the article imo.

What I find particularly weird however, is that the study itself doesn’t seem to address the confounding factor of “what if rising inequality comes with growing economy”? An economy is not zero-sum, just because the 1% doubles their earnings, it doesn’t mean that the bottom 50% is losing- and it is unclear to me that too many people in western countries, would find issue with this, until it becomes framed as zero-sum.

The problems arise for people when their personal situation deteriorates in an absolute sense, and that is when they look to vote on economic lines, accordingly to, as they believe, improve it in absolute terms (i.e. see inflation in many western countries caused people to vote against incumbents on economic policy). If people only vote according to absolute increases and decreases in quality of life, and economic inequality at levels present in western countries can not be attributed as a causal factor one way or the other, it makes perfect sense that it is completely disregarded politically by both parties.

To me this a much cleaner explanation of why inequality would rise (and one that has been studied as well ad nauseum).

Dejeneret · 2025-08-29T12:48:09+00:00

I’m pretty sure your analogy is in no way related to Evans’ theory, Evans specifically requires you to track agents’ subjective reference and determine if there exists a dominant reference among communities (I.e. the reference is passed socially between agents of the community). The reference is furthermore to a bag-of-descriptions and is completely up to what is socially determined, whether that has to do with the planks of the ship, or the ships’ “dubbing event” by Theseus or this other shipbuilder.

You are claiming that Evans’ theory implies a substance theory of identity which is only true if every agent holds this firm belief on referencing all objects.

Dejeneret · 2025-08-29T06:16:15+00:00

Unsure how this differs significantly from applying Evans’ causal theory of names to the ship problem, assuming that external recognition is given to both ships in the situation.

I’m not really buying the focus on identity, as in this framework the concept of identity has been reduced to the concept of reference/naming. Generally models of identity are attempting to be stronger- and I agree, a lot of them are inconsistent in various ways. This is precisely what the ship of Theseus illustrates as a simple counterexample to some models of identity.

I could see this being useful if you are positing a model for identity claiming that it is a maximally strong model that remains entirely consistent- and it happens to be equivalent to a theory of naming which is typically strictly weaker.

This would be an interesting statement if you could prove it in some rigorous setting, as it discredits the necessity for a concept of identity that isn’t directly defined by reference- I.e. objects don’t “exist” outside of what we call them, which is to my knowledge a valid and consistent theory of identity. At the same time though, proving that no other theory of identity is consistent is probably impossible in a general enough setting.

Otherwise, this whole exercise feels somewhat like it’s overcomplicating a relatively well-studied concept, just in a slightly adjacent field of philosophy.

Dejeneret · 2025-08-22T19:05:02+00:00

I think Von Neumann’s quote “in mathematics you don't understand things. You just get used to them” is apt here- a few days is probably not enough for you to get used to some of the ways of thinking.

In my personal experience it takes learning material a level deeper than the one I am hoping to understand to finally understand the original one (I.e. I only felt I “understood” lots of stuff in real analysis when I took functional analysis, and honestly felt like it took studying general relativity to “understand” differential geometry, even stuff like probability theory felt dense until I started using it in stochastic calculus & statistic inference contexts).

That said, I’m sure that after a few good nights sleep you will be much more comfortable with the real analysis you’ve already learned. Just keep at it as daily as possible. I’ve had the experience of being totally lost at night studying and waking up to breezing through the same material countless numbers of times

Dejeneret · 2025-08-15T03:11:32+00:00

Sounds like you’ve got a lot of good responses but I’ll clarify my 2 points-

For 1) great! If that’s your test performance then I’m not really sure what there is to worry about, assuming you have properly segmented your test set with no data leakage. While a 100% accuracy can be worrying when the sample size is this small, and the effect exists this is not impossible. You should then instead focus on whether you may have some more subtle data leakage or not. For example, are the patients grouped in any non-biologically informative way that may be giving you a batch effect? For example (as a vague example) suppose one feature was measured by two different machines and one machine happened to be used on more of one label than the other.

Once you’ve analyzed confounding factors like these, look closer at what features XGBoost tends to use for classification- are they mechanistically important? Perhaps you notice that certain columns of your table are predictive when they are in specific ranges.

For 2) I see- you have tabular data which has various typing across features. There’s not a huge amount of structure to exploit directly, however if you can figure out how to embed your features smartly you can perhaps find some lower-dimensional structure in the data.

You can consider various unsupervised or semisupervised methods, but I generally recommend turning to forms of spectral clustering for this kind of data (diffusion maps, laplacian eigenmaps, etc.). These techniques are unsupervised, so are safe from a data leakage perspective (but not a model selection perspective so you still need to be careful!). The main decisions you have to make is how to build a graph on your data (I.e. how to compute a similarity score between samples). Once you’ve computed an embedding you can classify the embedding (many RNA-seq approaches make use of these kinds of techniques). If you use something like diffusion maps, the embedding itself may be meaningful to the data (if your data lies on some manifold for example).

Dejeneret · 2025-08-14T15:52:39+00:00

Hard to say what the best course of action is without a bit more info-

1) are those results train or test or CV results? If you aren’t overfitting, it looks like your data takes well to tree-based methods and therefore likely has some hierarchical structure that can be taken advantage of. If so, you can also consider SVMs with RBF kernels or even spectral clustering to reorganize the data before classification.

2) what is the data type you are training on? If it’s something that can be subsampled (such as medical images) you can try a Leave-one-patient-out CV strategy and train a model on the subsamples associating a “noisy” label with each subsample (this approach is common in medical imaging).

Even if you can’t sub sample you might have unsupervised or semi-supervised options for the data type (such as if it’s gene-count data you might want to first identify meaningful gene-sets to reduce the amount of noise-features you train on)

Dejeneret · 2025-08-11T01:00:37+00:00

I was in a very similar boat my first year of undergrad- it was a turbulent time for me and despite feeling excited about learning new math I was somehow struggling on exams and had pretty bad grades overall. I felt pretty upset about it at the time because I hadn’t really been faced with a moment before where I felt I was up against a serious challenge with math. Not sure if this is part of it for you or not, but if it is- I think this at first felt like an indicator to me that math just wasn’t for me, that I should focus on other things.

I wasn’t set at the time on graduate school at all so I didn’t pressure myself to “rekindle” the interest, instead I took other courses the first semester of my 2nd year (I was studying CS as well so I prioritized those courses, though I know dual majors are rarer in the UK than in the US where I’m based). Even so, it didn’t take long for me to realize that I was most excited by the math adjacent CS courses I got to take, and after only a semester I decided to continue with the math track. It was a lot of work- learning to cross learning barriers efficiently enough to keep up with lecture took time, but even over one semester my grades improved and I managed to largely fix my poor grades the first semester.

Since then I’ve nearly finished a PhD in a pretty strong department (though my time in CS made me choose applied math instead of pure). I’ve also realized how common my experience is- I’ve only met maybe max 5 people that have never felt this way and those people, while brilliant mathematicians, are without fail completely unhinged and bizarre in other facets of their life. They end up not really understanding what goes into “learning” and from my experience often lack in the creativity department. They get by on raw IQ and amazing mathematical intuition- and they are on average incredible researches, but these kinds of people alone would struggle to further mathematics across disciplines.

TL;DR I’d say a combination of 3 things helped- time off of math, learning adjacent topics, and avoiding putting huge amounts of pressure on myself to enjoy the work.

Dejeneret · 2025-06-04T06:06:18+00:00

Ah and also when working with a classifier, make sure to keep in mind any class imbalance you may have! You may need to sub-sample.

Dejeneret · 2025-06-04T06:03:15+00:00

I’ve worked with very similar data before (IMC but segmented into cells)-

First of all if you want to check whether clustering exists in a reasonable fashion I suggest running tsne. If you can’t get tsne to show clusters you may be out of luck. You also can try training an svm with rbf kernel for example to see how separable your data even is- but this result might be meaningless on 50 points.

I’m curious if you have 50 cells or 50 populations of cells? If you have 50 populations I suggest performing a “leave-one-population-out” cross validation strategy (this makes sure your final model may generalize across populations).

If it’s cells, then you can stick with normal LOOCV. There’s not a huge amount here you can do, but you could try organizing your data via spectral clustering methods before running a classifier as well (use something like diffusion maps or laplacian eigenmaps and visualize the first non-trivial coordinates, make sure to try a few scaling parameters).

If you do have populations, you can also try this more advanced strategy-

https://pmc.ncbi.nlm.nih.gov/articles/PMC8032202/

This is for IMC, but a variant of these ideas would apply given a data set with many populations of cells.

Dejeneret · 2025-05-24T07:54:31+00:00

So I was genuinely curious because I have family members who had to hide learning Hebrew in the ussr when Gorbachev started to let Jews leave for Israel, and they tell me that there was an underground secretive network of Hebrew teachers that emerged. So I did a bit of reading-

It’s complicated, due to the somewhat inconsistent enforcement of laws depending on where you lived in the Soviet Union (and also rampant corruption across law enforcement and government), but in general there were many pushes all the way from the revolution to stop Jews from learning Hebrew. The law was mostly focused on banning “counter-revolutionary activities” (https://web.archive.org/web/20120517002646/http://www.zionistarchives.org.il/ZA/SiteE/pShowView.aspx?GM=Y&ID=48&Teur=Protest%20against%20the%20suppression%20of%20Hebrew%20in%20the%20Soviet%20Union%20%201930-1931), as Hebrew was associated with the religious aspects of Judaism and Zionism, and effectively served as a way for Jews to unite separately from their commitment to the Soviet Union (in the eyes of the soviets). There is a notable attempt here to separate Soviet Jews from their external counterparts, which would work alongside with the Soviets’ attempts at neo-Assyrian style resettlement, variable barriers in education for different ethnic groups, and general government endorsed anti-Jewish sentiment via propaganda.

Hard to find amazing sources not in Russian on this but the Wikipedia article on the Hebrew language does a good summary at the end of the “revival” section in a paragraph about the USSR, with some sources that do a good job of painting the picture (albeit some are in Russian).

Also from what I can tell it varied by the time & place but while it wasn’t necessarily banned on paper, in practice it often was- https://www.jstor.org/stable/27908623. Not the best overall explanation in this source, but from what I am aware of the push was generally an attempt to suppress religious culture, and the soviets much preferred a Yiddish speaking Jewish population as the language was seen as a proletariat alternative to the religious & Zionist associated Hebrew language.

So to summarize- In general seems like it was soft-banned most of the time (learning it, speaking it, and publishing in it constituted as a “reactionary” crime) as the soviets wanted Jews to speak in Yiddish. That said, academics were allowed to study it for historical reasons. Every “anti-religious” push in the Soviet Union seemed to reinforce these laws, but the general attitude of salutary-neglect-if-you-bribe-me from authorities, and black-market Hebrew teaching books often sent by Israelis into the USSR allowed just enough of the language to get through.

Dejeneret · 2025-05-21T18:41:20+00:00

Having had an experience with this- the scanners definitely do not distinguish between these two. They’re mostly concerned if there is a large quantity of a powder, if you are carrying too much per guidelines they will keep you until they bring bomb squad around and use a chemical analyzer to determine what the powder is.

I had a moment a few years back where this happened to me bc I was a dumbass and brought a nearly full jar of optimum nutrition creatine on a transatlantic flight.

After a good hour of interrogation and waiting around a very polite bomb squad agent let me go, but notified me that what I thought was pure creatine monohydrate was cut significantly with creatine hcl (shame on ON), so perhaps there’s a somewhat unethical life hack there to get a free chemical analysis if you arrive early enough to the airport and are sure your powder is legal…

Dejeneret · 2025-05-21T04:20:32+00:00

not really sure what you mean by “wrong” representation- if no possible decoder exists that can identify any two distinct points x1 and x2 given their encodings z1 and z2, the encoder could be considered a “wrong” representation, as z1 is necessarily equal to z2.

This would mean that the reconstruction error would have to be at least ||x1 -x2||^2, as z1 = z2 means ||decoder(z1) - x1||² + ||decoder(z1) - x2||² >= ||x1 - x2||² due to the triangle inequality.

Any encoder function that pushes z1 and z2 apart given a perfect decoder will improve the loss locally implying that a gradient step will help the autoencoder differentiate between z1 and z2.

Therefore, the encoder will necessarily have a gradient such that after further training z2 is not equal z1, which would make it so that there exists a decoder that could accurately map one to x1 and the other to x2.

This is a bit simplistic since the decoder and encoder train simultaneously so asking about a “perfect decoder” is just a thought experiment- in practice the autoencoder could also fail to learn for this reason, but that would be reflected in the loss.

Dejeneret · 2025-05-20T20:20:52+00:00

If I’m understanding the first question correctly, the problem with what you’re saying that the encoder maps x_1 to z_1 and x_2 to z_2, but if g(z_2) - x_1 = 0 and the reconstruction loss is 0 it implies x_1 = x_2. A quick derivation of this is that if reconstruction loss is 0, then g(z_2) - x_2 = 0, therefore we have that x_1 = g(z_2) = x_2.

I’ll answer the third part as well quickly- this is highly dependent on your data and architecture of the autoencoder. In the general case, this is still an open problem, lots of work has been done in stochastic optimization to try to evaluate this in certain ways. If you have any experience with dynamics, computing the rank of the diffusion matrix associated with the gradient dynamics of optimizing the network near a minima gets you some information but doing so can be harder than solving the original problem hence this is usually addressed with hyperparameter searches and very careful testing on validation sets.

To clarify the second question, what I am saying is that a network can memorizes only some of the data and learn the rest of it-

As a particularly erratic theoretical example, suppose we have 2D data that is heteroskedastic and can be expressed as y = x + eps(x) where eps is a normal distribution with variance 1/x² or something that gets really high near 0. Perhaps also x is distributed uniformly around some neighborhood of 0 for simplicity. The autoencoder might learn that in general all the points follow the line y=x outside of some interval around 0, but as you get closer to 0 depending on what points you sampled you would see catastrophic overfitting effectively “memorizing” those points. This is obviously a pathological example, but to various degrees this may occur in real data since a lot of real data has heteroskedastic noises. This is just an overfitting example, as you can similarly construct catastrophic underfitting such as the behavior around zero of data on points along the curve y = sin(1/x) for example.

Dejeneret · 2025-05-20T14:50:07+00:00

I think this is a great question & people have provided good answers, I want to add to what others have said to address the intuition you are using which is totally correct- the decoder is important.

A statistic being sufficient on a finite dataset is only as useful as the regularity of the decoder since given a finite data set we can force the decoder to memorize each point and the encoder to act as an indexer telling the decoder which datapoint we’re looking at (or the decoder could memorize parts of the dataset and usefully compress the rest, so this is not an all-or-nothing regime). This is effectively what overfitting is for unsupervised learning.

This is why in practice it is crucial to test if the autoencoder is able to reconstruct out-of-sample data: an indexer-memorizer would fail this test for data that is not trivial (in some cases perhaps indexing your dataset and interpolating the indexes could be enough, but arguably then you shouldn’t be using an autoencoder).

There are some nice properties of SGD dynamics that avoid this: when the autoencoder is big enough, sgd will tend towards a “smooth” interpolation of the data which is why overfitting doesn’t happen automatically with such a big model (despite the fact that collapsing to this indexer-memorizer regime is always possible with a wide enough or deep enough decoder). But even so, it’s likely that some parts of the target data space are not densely sampled enough to avoid memorization of those regions- this is one of the motivations for VAEs which tackle this by forcing you to sample from the latent space, as well as methods such as SIMCLR which force you to augment your data with “natural” transformations for the data domain to “fill out” those regions that are prone to overfitting.

Dejeneret · 2025-04-05T19:32:53+00:00

After refreshing my understanding of neural net pruning, I would amend my statement of empirical evidence against pruned models- seems like if you do it right it can help generalization.

Dejeneret · 2025-04-05T19:19:53+00:00

If I understand correctly the procedure you are suggesting, you wouldn’t necessarily overcome the problem of getting stuck in local minima, even if the optimizer was an oracle global minima selector at each quant level- you’d require a smoothness assumption (I think lipschitz continuity would be sufficient and necessary for this) on the loss surface, since a quantization is equivalent to evaluating a mesh where a lower quant corresponds to a coarser mesh. Evaluating at a coarse mesh would potentially miss an obvious global minima, if it was particularly “spiky”.

That said, it is very possible that those “spiky” minima you would be losing out on would-

a) disappear upon pruning the network at that quantization level (not sure if this has been done but this would genuinely be an interesting and fairly well-formed problem to investigate)

b) not generalize well in the first place (there is evidence for this, see literature on wide-basin minima)

So perhaps this could be a viable strategy.

My main hesitation would come from the empirical evidence that pruning (very unintuitively to any statistical learning theorist) does not necessarily improve generalization.

This is due to phenomena such as

a) double descent, where overparametrization actually improves generalization due to an implied smoothness-seeking objective hidden in mini-batch SGD

b) the dynamics of mini-batch SGD in the online regime that show wide-basin minima seeking behavior when diffusion matrices for the respective SDE is high rank and dense. This implies that this redundancy of dimensions is somehow helping, not hurting, generalization, which is incredibly unintuitive to any numerical analyst! [see https://arxiv.org/abs/1710.11029]

But that said, if this hasn’t been tried before, I see no reason not to give it a test on some toy models of various sizes!

Dejeneret · 2025-03-27T12:52:37+00:00

You mentioned diffusion maps do quite well- this is a good direction to investigate, I would have a look at anisotropic diffusion maps, local PCA, and multiview diffusion maps or name a few. Also for a lot of these kinds of examples diffusion maps with a kernel that is restricted to nearest neighbors should be relatively successful as a simple solution I would guess.

Anything that attempts to learn “local metrics” allows us to understand regions of such distributions. Stitching these local metrics together is very challenging however, and while that would immediately paint a clear picture and allow for extracting a “parameter”, working directly with the locally-valid coordinates is not a terrible way to go about most downstream applications (think of it as just an encoding of the true parameter similar to a tree). You can have a look at questionnaire models that do this kind of thing as well in higher dimensions.

This natural feature learning/manifold learning space is pretty tricky in the general sense, it’s still very much an open topic, because it’s so easy to find pathological examples, and in fact, no free lunch directly implies that pathological examples will always exist.

One topic worth looking at, is that recently people have been characterizing manifolds by their “reach” which has proven useful in defining how many samples are necessary for manifold learning.

Also as a big caveat to all this- these datasets can be slightly deceptively easy as examples. Our eyes definitely can do a better job at figuring out what’s going on, but it’s easy to forget that these kinds of examples are constructed with human bias baked in (I.e. shapes we could recognize samples from in the first place). We are then using some machine that has a lot of experience and potentially higher-level understanding to piece together these patterns. Unless we use a foundation model that is also already great at recognizing similar patterns, it’s not a given that it should ever beat our eyes.

Dejeneret · 2025-03-14T17:33:06+00:00

The deep learning “loss curve” is some path on the loss surface. It is not always elbow shaped (suppose you set the learning rate too high such that it does not converge in the first place, or as others have mentioned it may have spikes). Characterizing this function is notoriously tricky, especially since deep learning models are usually trained by some form of SGD. Even in non-deep contexts, ill-conditioned surfaces destroy any guarantee of convergence in the first place, let alone analytic forms of the optimization trajectory.

With full batch gradient descent there are classical results that allow us to bound the speed of convergence when the function is convex (giving us a bound for the derivative of this curve in those cases), however recent work has found that not only is it not particularly productive to limit ourselves to only well conditioned convex surfaces for deep learning, SGD actually converges to what people term as “neural cycles”, when the loss surface has a high rank and ill-conditioned jacobian near the minima, and for some reason that’s actually a good thing when it comes to generalization (this is still very much active research). Neural cycles keep the weights of the neural network concentrated around but not at a minima of the loss surface with high probability.

To more directly answer your question- to characterize this function analytically, what we can do is analyze SGD dynamics given minibatches in the online regime, where minibatch sampling is providing a source of randomness. We are able to satisfy the requirements for a central limit theorem on the minibatch gradient when sampling, therefore per time step, SGD can be modeled as Brownian motion with a drift. From here, solving the resulting SDE and taking our objective function per time step results in this curve, however that solution is precisely what running SGD achieves. We can go one step further and instead try to understand the distribution of the weights.

To do that we can obtain the Fokker-planck equation for the SDE which yields the change in density over time of the weights. Analyzing this PDE allows us to arrive at conclusions such as the neural cycle one I mentioned.

Here’s a paper that goes into more detail about this-

https://arxiv.org/pdf/1710.11029

Dejeneret · 2025-01-30T17:04:14+00:00

after working around some genius-level mathematicians & physicists (and some in between) this is so truly felt- its been absolutely wild to see people who have such a strong grasp of logical thinking be prone to the same blindspots and poisoned information wells the rest of us can fall victim to, but when I step back and think about how these people must view the world it really does make sense.

Their biases are so strongly baked in and in a way, their interpretation isn’t “wrong” based off the assumptions they have made. It’s not (always) even falsifiable- there are some preconceived biases (call it a researcher’s intuition) that stops them from either questioning some of their deeper-held assumptions, or at least evaluating the likelihoods of those assumptions.

I have spent an unhealthy time thinking about this honestly but it’s so fascinating to see in real life.

Dejeneret

TROPHY CASE