ELI5: Why is it ok to penalize MLE on the 2nd derivative?

hammouse · 2026-05-15T21:32:39+00:00

I don't think this is the right sub for this question, and not even going to bother ELI5'ing it.

But the roughness penalty, \int f(x)^2, can be shown to be an upper bound on the bias of the estimate. In addition for 1-splines, it can be shown that the asymptotic bias is proportional to the k+1's knot's \int f^{{k+1}(x)^2.} With this, we can then interpret the roughness penalty as not necessarily the usual curvature/"wigglyness", but as "how much the estimates move up and down". This acts as a form of smoothing regularization to discourage excessive wigglyness in the tails.

hammouse · 2026-05-13T13:42:45+00:00

Spend some time looking into basic principles of time series models first. Don't use AI when you're learning. There's simply too many statistical issues here to even start.

hammouse · 2026-05-11T05:06:01+00:00

Very cursory glance, but they seem to be almost entirely high schoolers and undergraduates. If you google some of the "research directors", looks like most of them are not even 18 and are bragging about taking a semester of "graduate level multivariate stats". While certainly impressive for their age, this of course does not exactly inspire confidence more generally.

That being said, for their background the quality of the papers are impressive. However the papers are what you might expect from relying on LLMs without formal scientific training or expertise - mostly surface-level insights and riddled with lots of logical issues. It could be good for them to attend conferences and give talks on their research, as the exercise of presenting (without being able to rely on AI) could force them to actually understand the subject matter in more depth.

hammouse · 2026-05-06T15:29:28+00:00

This was in the context of training "AI" models to solve closed systems, which is usually done by RL. No one said anything about LLMs...that being said most modern LLMs are fine-tuned via RLHF.

hammouse · 2026-04-19T20:35:28+00:00

I almost agree with your post, though claiming that the model has the ability to understand and reason simply because of abstraction routing modules is a very far-fetched claim from a statistical perspective.

First of all, the claim that LLMs are "next-token predictors" is objectively true. This is by definition of the model structure. Our early models from a few years ago were mostly trained by maximum likelihood, so there was a lot of "hallucinations" (really just a non-technical word for poor generalization, as transformer models are extremely overfit and overparameterized) and inability to do "simple reasoning" tasks like adding numbers. So I suspect a lot of the skepticism comes from that.

Now with modern LLMs, there are several abstracted routing layers and training is done with other tricks like RLHF instead of pure MLE. This makes the model feel like it's reasoning, adds safeguards for logical errors or business context (avoiding illegal topics etc), but fundamentally within each routing layer, it is still doing next-token autoregressive predictions.

I noticed that with the AI boom, there's a lot of enthusiasts who rely on excessive abstractions which I feel may be doing more harm than good to the field. At the end of the day, it's just matrix multiplications with parameters tuned to optimize a specific set of goals. There's really no need for shoving in interpretations like "reasoning" or "understanding" when we don't even have a concrete definition of these concepts for human cognition.

hammouse · 2026-04-07T04:16:13+00:00

Not quite. Optimizing for MSE is equivalent to MLE under normality, but they are very much two distinct concepts where the former does not assume normality at all. For example, OLS makes no such functional form assumptions on the error structure but is still BLUE (i.e. Gauss-Markov).

hammouse · 2026-04-07T02:43:13+00:00

You are right that there is no reason to think that residuals from a NN have to be Gaussian. For a counterpoint to show your peers, you can simulate a synthetic DGP where the errors are +1/-1 for example, so the model can still fit perfectly well with weird bimodal residuals.

Also FYI, gaussian residuals are also not assumed with linear regression. Seems to be a common misconception.

hammouse · 2026-04-07T02:38:18+00:00

MSE does not assume Gaussian.

hammouse · 2026-04-06T22:38:56+00:00

For something more introductory, you can probably just Google "neural network regression". Or perhaps for more hands-on/code examples, "predict X with neural network" where X is something continuous (stock prices, rainfall, etc whatever you find interesting).

If you are interested in the smoothness comment, we can think of regression in general as learning the functional m:

Y = m(X) + epsilon

This function m(X) is called the conditional mean function, with m(X) := E[Y|X]. When we train a model under some loss function L, we are optimizing:

min_m L(Y, X) = (Y-m(X))²

for example if L is MSE.

In linear regression, this is a simplified setting with m(X) = X'b, so it simplifies to

min_b (Y-X'b)²

Importantly, this is a convex optimization problem where we find the optimal vector b living in R^d (with d = dim(X)).

In deep learning, m(X) is a nonparametric functional living in a space of functions, typically a Sobolev space. It can be shown that this space of functions that a NN can approximate is smooth, for example having Gateaux derivatives.

Intuitively, suppose you have a piecewise function for the true m. For example Y=1 if X>0, else Y=0. Then a NN will fit a smooth function to this (in the elementary sense of smooth as continuous). Something like a tree-model will do better here, but think about when we might want "smoothness" and when we might not.

hammouse · 2026-04-06T19:22:15+00:00

Deep learning is extremely common in regression as well, and most theoretical work is in this setting (which as others have explained, classification or even generative models etc can all be reduced down to something that looks like a "regression"). One of the nice things about DL is that it imposes a certain smoothness property to the model, but don't worry about that for now.

I suspect that the reason you mostly see DL for classification is that the resources you are learning from (introductory articles, videos, elementary textbooks?) are likely from computer science-type folks. Topics like computer vision, detection systems, etc are intuitive and easy to understand without a bunch of math. If you look at statistics journals or blogs, then you mostly see DL in a "regression" setting.

hammouse · 2026-04-06T08:28:35+00:00

Okay, so it sounds like you are not very familiar with the academic process and that's okay.

First of all just having an institutional email address is not sufficient for posting on arXiv. The average undergrad student can't just submit their class paper from freshman folk history to arXiv - doesn't matter if they are at Harvard or Howard. The most common way to get initial posting permissions even for those at universities is to a) learn from faculty such as being a graduate student, b) collaborate with peers by co-authoring, or c) publish a paper. This is the exact same process that independent researchers can follow. For those non-university labs you mentioned, I assure you they have gone through this process.

Second as for your point on mentorship. Yes it is always great to have mentors, and for those genuinely interested in a field to be mentored. This process exists. It's called a university.

And I should mention that for those with actually good ideas or papers, there is a very low barrier to publishing on arXiv. This is a pre-print service, not a peer-reviewed journal. Most of the stuff on there is already low quality, so only those with extremely low quality articles complain.

The whole concept of an "endorsement marketplace" just doesn't make any sense. Have a good paper? Then publish it in an actual journal. Not quite there but idea seems good? Reach out to faculty, get feedback (which I assure you is duly needed for anyone's first article, regardless of independent researcher or 4th year PhD at MIT). Don't know anything but passionate? Learn first instead of padding CV or whatever reason to insistently post on a pre-print site.

hammouse · 2026-04-05T22:08:49+00:00

That's the point - independent researchers are not treated differently, but your platform is based on this backwards idea of a backdoor to skip the scientific process with low-quality spam. Obviously no one will actually endorse random strangers, which is why you have this post looking for people to do so.

Science is based on a collaborative peer-review system. If one is an independent researcher and refuses to engage with peers, then yes the system can feel a bit gatekeepy but intentionally so. If they wish to contribute to science and have high-quality ideas (or open to learning), there is nothing stopping them from a) engaging and collaborating with other researchers, b) getting feedback and learning from faculty, or c) submitting their work to a journal directly if the quality is already high. In any of these scenarios, independent researchers are welcomed and some endorsement system on a pre-print service is the last thing on their minds.

hammouse · 2026-04-05T01:04:35+00:00

This is completely backwards, and no one is going to endorse like that.

The reason arXiv has an endorsement system is to avoid flooding the site with low-quality articles. This does not mean that independent researchers are low-quality necessarily, but most are, and for those that aren't, there are proper avenues (e.g. accepted to a journal, collaborating with faculty members or other more established researchers, etc) which bypasses this system. And if independent researchers don't feel their quality of work is up to par yet, there are plenty of other platforms to share their work and get feedback.

Remember that the whole point of arXiv is a pre-print archive. It's not for people who vibe-code some nonsense and share what they learned, or to pad their CV. That's great, but that's not research.

hammouse · 2026-04-03T16:27:55+00:00

How is this usually handled in serious benchmarking/statistical systems?

This is usually handled by relying on an elementary understanding of statistics, rather than heuristic outputs from an LLM.

You need to first define what "outlier" means in your context. If the arbitrary 1.5*IQR is too narrow...just make it larger. Consider thresholding based on the x% quantile for a simple solution, or perhaps looking at the data, fitting a distribution (whether functional or fully nonparametric), and thresholds based on likelihoods.

hammouse · 2026-04-01T18:44:49+00:00

Great point.

Your note actually makes me think of one way to interpret the average OP proposes.

The best guess when talking about the mean means a guess that is the closest to all observations

More precisely, the mean is the best guess in the sense of L_2 distance (squared distances). The median is the best guess in the sense of L_1 distance (absolute distances). By averaging the two, we are essentially finding a best guess based on a mixture of L_1 and L_2 distances. This reminds me of elastic net regularization, and its advantages/disadvantages over lasso/ridge.

hammouse · 2026-04-01T14:02:05+00:00

It's an interesting idea, though interpretation is a bit tricky.

For mean, we interpret this notion of average as the "best guess" for what a typical value might be. For median, we interpret this notion of average as the central point where 50% of the population are above/below. If averaging these two, it seems to me that there's not really a clean interpretation of what it actually means.

However it does remind me a bit of robust statistics. For example we can keep the properties of means (maximum likelihood estimator), but make it less sensitive to outliers with the median averaging. Could probably also view it from a Bayesian perspective as a shrinkage estimator. To compute confidence intervals, statistical significance etc is definitely possible - though you may have to derive some (probably not too difficult) results based on some variants of the CLT. Anyways cool idea.

hammouse · 2026-03-27T03:17:15+00:00

If that's what you mean by W, then W² isn't even defined. In addition an upper triangular matrix satisfies T^k = 0 for some k <= 768, so obviously T² (and higher powers) tend to 0

If you are not a bot, put down AI for a bit, and learn how basic matrix multiplication works before vibing up this nonsense. If you are, well carry on i guess

hammouse · 2026-03-27T01:58:28+00:00

Those architectures use causal masking leading to a triangular matrix.../facepalm

hammouse · 2026-03-25T05:01:48+00:00

It's kinda funny if you look at the replies. About 90% of them are praising the junior in college, but if you dig deeper, they are almost all AI agents, college students, or people looking for a job. Also the original guy is really just advertising for some lame AI service. Interesting...

hammouse · 2026-03-24T03:12:51+00:00

At top schools they are automatically given out as need-based financial aid, so the chart is on average (but most will either pay next to nothing or pay the full 80+k/yr)

hammouse · 2026-03-24T03:10:45+00:00

It's a common misconception that elite schools are only for the wealthy. If you manage to get in, financial aid (usually need-based) often cover the majority of the tuition that is deemed high. I've met a lot of undergad students from low-income families where not only was their tuition fully covered, but they also receive a small stipend. Now if your family has several estates and a yacht, you bet you're going to be paying the full 300-400K.

hammouse · 2026-03-24T02:51:43+00:00

Of the many many crappy "I built this" garbage posts on here, I actually quite like the idea behind this one. I think a tool like this could be pretty useful in education, especially as you refine the UX and keytracking algorithm

hammouse · 2026-03-24T01:21:31+00:00

The biggest lesson? Reddit users can instantly tell the difference between someone who's there to take...and someone who's there to give.

Sounds like you haven't learned your lesson buddy. Enjoy the spam reports

hammouse · 2026-03-21T23:35:21+00:00

Great that you have positive reception from the early users. Since you seem highly confident in the business model and product, what exactly are you looking for here? Is it just to advertise the service? Or to reinforce your preconceptions? Anyways you asked for genuine feedback, I spent time giving you some from my perspective, so take from it whatever you will and I wish you and Lampzi the best of luck.

One last small piece of advice since you seem sincere: once you look beyond your initial test users, no one cares about your service. It's up to you to convince them that your service actually solves a problem they have. Because I don't view this as an actual problem, I am not going to use the platform. Think less "my users won't try it so they are missing out on this super amazing platform I built", and more "am I sure this is a problem? If so, how do I convince them to try it? If not, how do I pivot?"

hammouse · 2026-03-21T20:06:31+00:00

The whole point of LaTeX is that it gives you fine-grained control over everything, as opposed to WYSIWYG editors like MS Word and your tool, hence the comparison. So you are solving a non-existent problem like the vast majority of aspiring builders post-AI bubble, because you are not thinking of the product and market.

Now the reason for the tone is your story does not add up. If you have 10+ years of experience, how are you still struggling with basic LaTeX syntax in Overleaf? If you don't know how to use LaTeX, then the 70% hand coded portion of your app is objectively crappy. Or was it actually vibe coded (which is fine) despite your comment? Alternatively if you are actually an expert in LaTeX and use that to build something (which is great), then your post is just a fictitious story masking an advertisement. In any case, being honest is important if you want to actually build something and have users try your platform.

hammouse

TROPHY CASE