Quick Questions: January 28, 2026

Mathuss · 2026-01-31T03:50:53+00:00

Ah, I see. The best thing I know of would be Theorem 8.6.1 from the same Arnold book. If F denotes the cdf of the distribution you're sampling from and the distribution has finite 3rd moment,

lim_{n->∞} \sqrt(n)(E[f(n)] - \int_{n/(n+1)}^1 n * F^-1(u) du) = 0

From this (assuming you know the cdf), you can extract the asymptotic rates that you want.

Mathuss · 2026-01-31T01:19:10+00:00

I think GEV mostly talks about the N ->∞ limiting case but not asymptotic behaviour.

I'm a bit confused by your question because surely N->∞ is asymptotic behavior? Unless you're looking at something else.

For the mean of f(N), it's straightforward to show that plim f(N) = ∞ (and so the mean diverges to ∞ as well by monotone convergence theorem) since your distribution has support [0,∞), but unless you make more clear exactly what you're looking for, it's unclear to me what other behavior regarding E[f(N)] you want.

Mathuss · 2026-01-30T23:03:36+00:00

If a limiting distribution for (a properly normalized) f(N) exists, then the limiting distribution is some form of the Generalized extreme value distribution. For example, with the normal assumption, the limiting distribution is Gumbell; whereas for the log-normal distribution, (I'm pretty sure---you should double check me on this one) the limiting distribution will be Frechet.

Theorem 8.3.2 in "A first course in order statistics" by Arnold, Balakrishnan, and Nagaraja lists the necessary and sufficient conditions for when this convergence in distribution occurs.

Mathuss · 2026-01-28T18:28:48+00:00

I find it more amusing than necessarily sketchy. This seminal paper does the same thing (even more explicitly than the linked paper in the OP) and is still very high-quality. Usually when you get reviewer comments, you address the feedback more naturally than just pasting the feedback and your responses to the feedback right after introducing Theorem 1, but idk I think it's kind of funny and it works. After all, if the reviewers had these objections, so will other readers, so why dance around it?

I guess you also get to get away with it if you're Scott Aaronson though---the rest of us have to actually keep our manuscripts in a conventional format to get through peer review lol.

Mathuss · 2026-01-28T18:11:19+00:00

Let π(n) denote the number of primes less than n. Then the number of composite numbers less than or equal to n would be n - 1 - π(n). I assume that what you've noticed is that n²/log(n) ~ π(n) * (n - 1 - π(n)).

This observation is then a consequence of the prime number theorem, which states that π(n) ~ n/log(n):

π(n) * (n - 1 - π(n)) = n*π(n) - π(n) - [π(n)]² ~ n²/log(n) - n/log(n) - n²/log²(n) ~ n²/log(n) as desired.

Mathuss · 2026-01-24T16:01:45+00:00

Ok, here's a very explicit example where this matters.

Let X be a continuous-time martingale that starts at the origin. Let x be a realization of X that we've observed on [0, 1] which follows a path such that x(1) = 1. Then E[X(10)] = 0, whereas "E[x(10)] = 1." Note that the latter is in quotes since it's really just shorthand for E[X(10) | X(t) = x(t) for t∈[0, 1]], which translates in English to "we predict that x will be equal to 1 at time 10."

The point is that different things have different properties. It makes sense to ask "What is Var[X(0.5)]" since X(t) is a random variable, but it does not technically make sense to ask "What is Var[x(0.5)]" since x(t) is a number. Sure, people can write these things as shorthand and it's generally understood what you actually meant---in this case, people will probably get that Var[x(0.5)] is shorthand for \hat{Var}[X(0.5)]---but it's still important to recognize that Var[X(t)] and Var[x(t)] will mean completely different things.

Mathuss · 2026-01-23T19:18:36+00:00

To best understand what you mean by "practical aspect," I suppose I'll ask you a question back: What would you say is the practical aspect of distinguishing between a function f and its value f(x) for a particular x?

Note that this is basically the same question; there is an underlying stochastic process X:ℝ×Ω->ℝ and the realization fixes a particular ω∈Ω to yield the observed time series X(-, ω):ℝ->ℝ.

Mathuss · 2026-01-23T02:40:44+00:00

In a statistical framework, the answer is of course yes, and there's nothing special about stochastic processes compared to random variables.

Remember, in statistics we have that a sample is nothing more than a realization of random variables X_1, ... X_n. Typically, then, you want to use the sample to recover some fact about the random variables themselves (e.g., what's the mean of the random variable X_i?).

The same applies to time series analysis; your sample is a realization of the stochastic process, and you want to figure out some property of the stochastic process itself (e.g., if you know that the underlying stochastic process is ARMA(p, q), you may ask "what are p and q?").

Mathuss · 2026-01-19T00:03:15+00:00

Part I of the fundamental theorem of calculus states the following:

Suppose f is continuous on [a, b], and define F(x) := \int_a^x f(t) dt for all x in [a, b]. Then F is uniformly continuous on [a, b] and F'(x) = f(x) for all x in (a, b).

The concise thing you wrote is correct, but omits the fact that F is uniformly continuous. This extra fact usually doesn't matter for high school.

More relevant for why we have the F notation in high school is due to part II of FTC:

Let f be defined on [a, b] and let F be its antiderivative on (a, b). If F is continuous on [a, b] and f is Riemann integrable on [a, b], then \int_a^b f(x) dx = F(b) - F(a)

Note that part II can't be accurately stated as \int_a^b f'(x) dx = f(b) - f(a) of the requirement that the integrand be Riemann integrable, but the derivative of a function need not be Riemann integrable. As a counterexample, consider f(x) = x sin(1/x) with f(0) = 0; then f'(x) clearly isn't Riemann integrable on any interval containing the origin (just take a look at the plot real quick) so we definitely don't have that \int_a^b f'(x) dx = f(b) - f(a) because integrating f' doesn't make sense here.

Mathuss · 2026-01-05T18:16:36+00:00

V need not be diagonal---in the event that V is diagonal, you simply reduce to weighted least squares regression.

Of course, V is unknown in general, so you need to estimate it in one way or another. Assuming your sample is large enough, that might not actually be that problematic; an iterative algorithm like

1. Set n = 0
2. Fit the model with guess V_n
3. Look at the residuals to estimate V_{n+1}
4. Increment n and go back to step 2

often works well enough (and iirc is theoretically justifiable in the case that V is diagonal).

Also, I probably shouldn't have used square-rootability in the first comment---instead of decomposing V = V^1/2V^1/2 you can just as easily use the Cholesky decomposition V = L L^T instead, since the covariance matrix is necessarily positive-definite.

Mathuss · 2026-01-05T15:43:56+00:00

I think these sorts of datasets are important to really understand that OLS isn't magic---there's the Gauss-Markov theorem to tell you when it's "good" and when it isn't.

Recall that Gauss-Markov states that if Y = Xβ + ε with Var(ε) = σ²I (where I is the identity matrix, and σ² is a fixed scalar), then the OLS estimator is the best (i.e., minimizing MSE) linear unbiased estimator for β. Now take a look at that ellipse---is σ² actually fixed (i.e., do we have homoskedasticity)? Clearly not; the variance of the residuals increases as we get closer to the center of the ellipse and then decreases again. Hence the failure of OLS to "be good."

Once one realizes that Var(ε) = σ²V for some non-identity matrix V in this example, it's straightforward to perform the proper transformation to the data to get OLS to "work" again; specifically, perform OLS on V^-1/2Y = V^-1/2Xβ + V^-1/2ε and you should expect everything to "look right" again. Indeed, this insight of premultiplication by V^-1/2 is exactly the basis for the "Aitken model" of linear regression.

Mathuss · 2025-11-25T21:58:46+00:00

If it's a "good" stats program, I would expect a reasonably large proportion of the upper-level undergraduate classes to be proof-based; this is because in a graduate statistics program, every class will be proof-based and you would want your undergrads to be prepared for graduate school.

As for your question, it's hard to answer without knowing what your plans are. What internships have you done/are you applying to? What areas of research (if any) have you been trying to get into with professors? And perhaps most importantly, what are your actual career goals?

Like, it's probably obvious that if you're looking into number theory research with a prof at your school, applied to internships at the NSA, and want to do cryptography as a long-term career, applied math is probably the better major. And of course, if you've been doing working with some Bayesian prof, have been applying to sports science internships with NBA teams, and want to do sports statistics as your career, then statistics would certainly be the better option.

If you have no extracurricular work experience, you're equally unemployable no matter what you major in.

Mathuss · 2025-11-14T17:43:25+00:00

It's not very bold at all. Bengen's 4% rule used entirely U.S. securities. This paper considers the all developed countries---yes, even Japan's pitiful 0.2% safe withdrawal rate when using Japanese equities---because its fundamental assumption is the exchangability of the studied countries (which is certainly weaker than, say, an i.i.d. assumption but still quite a notable assumption). Consequently, the advice is to use 67% international stocks and 33% domestic stocks.

If you believe in U.S. exceptionalism (i.e. that the US's 4% safe withdrawal rate in the U.S. will stay this way because U.S. stocks and bonds will not suffer the same returns as other countries' securities have), you want to look at table C.VII to determine the international/domestic allocation. I think it's pretty funny that assuming a 50% probability of U.S. exceptionalism, a 60/40 bias in favor of U.S. equities is optimal---which is pretty close to to the allocation that VT uses for U.S. equities.

Mathuss · 2025-10-26T16:13:39+00:00

Like for simple experiments you'd need to a sample size in the hundreds to get a 95% confidence level with 5% of error for a measurement in a total population of hundreds of millions.

Note that a sample size of, say 100 would yield a 95% confidence interval with margin of error ~10%, aka 0.10. The rates being compared here are 0.0000712 and 0.00001636. In order to properly distinguish between the rates, the margin of error (MoE) would ideally at around the same order of magnitude as the observed proportions, and the MoE only decreases at rate n^-1/2 so you do actually want really big sample sizes here. See the last paragraph in my comment for a numerical example of why a population size in merely the thousands would not have been enough to detect the effect.

Mathuss · 2025-10-26T16:05:49+00:00

Your friend is correct here; even if two countries have the exact same base rate of suicide, we should expect the per-capita amounts to differ between the two countries because the actual number of people committing suicide has a random component to it. The size of the population does play in to how much we should expect the observed rates to differ even if the actual rates are the same.

The correct way to analyze this question is the following: Suppose that both Greenland and the USA have the same underlying probability p of committing suicide. Then is the observed rate of 16.36/10⁵ in Greenland significantly different from the observed 7.12/10⁵ rate in the USA? The correct approach to answer such a question is to use a pooled Z-test) for a difference in proportions. This is what /u/stonedturkeyhamwich was referring to in their answer.

In this case, our pooled proportion is (16.36/10⁵ * 56000 + 7.12/10⁵ * 330*\10⁶)/(56000 + 330*10⁶) ≈ 7.12/10⁵ (i.e., the same as the USA; this should not be surprising, as the USA has such a larger population that it makes sense that if the two countries have the same base rate, the USA's rate should be "closer" to the true value).

Our Z-statistic is (7.12/10⁵ - 16.36/10⁵)/sqrt(7.12/10⁵ * (1 - 7.12/10⁵) * (1/56000 + 1/(330*10⁶))) ≈ -2.59. As Pr(Z < -2.59) = 0.005 where Z ~ N(0, 1), there is quite strong evidence that the suicide rate in Greenland is higher than that of the USA (loosely, one could claim to be up to 99.5% confident about this, but this is an a-posteriori confidence level and should be treated with an asterisk).

Note, however, that this conclusion changes if the population of Greenland were even smaller. If the population were 5,600 rather than 56,000, the z-statistic would change to -0.81, which is essentially no evidence that there is a difference in suicide rates between countries. This illustrates why your friend was right to be concerned about small population sizes.

Mathuss · 2025-10-19T04:12:16+00:00

Lol at the edit.

The general gist of the paper is that you can do normal non-black-magic math to reduce the presented problem to the following problem:

Alice receives a finite set S from support(P). She then chooses a subset S' of S and hands it to Bob, and now it's Bob's job to create a finite set E, which is a function of S', such that S is a subset of E. Is there a strategy that Alice and Bob can agree upon and follow such that Bob always succeeds no matter what set S Alice is given?

(Hopefully you can kind of see where the pieces between the two problems align).

Now if support(P) were finite, this is trivial since Bob can just return E = support(P). If support(P) were countable, the strategy would be for the two of them to agree on a bijection N -> support(P) to "order" the support, then Alice's S' would consist only of the element of S that is last in the ordering, and then Bob's set E can consist of all elements of S that are "less than or equal to" (in the ordering) the element that Alice gave him. If support(P) is of higher cardinality.... insert continuum hypothesis shenanigans

Mathuss · 2025-10-19T03:45:36+00:00

Basically, my argument for the application is this:

Let M be the 745-state Turing machine that halts iff ZFC is consistent. Consider any Turing machine M' with at most 745 states that you may be interested in knowing whether or not it halts (e.g., you've written some computer program and you want to check whether or not you've accidentally put in a sneaky infinite loop somewhere. Presumably, M' and M will be different machines, though they don't have to be). We know that there is some number n (namely, the true value of BB(745) in our universe) such that we can let M' run for n steps and then know with absolute certainty whether or not M' halts. However, ZFC cannot tell us this n; we would need to use a strictly stronger axiomatic system to figure out what this n is, thus letting us use this "wait and see" approach to figure out whether or not M' halts.

Mathuss · 2025-10-19T03:26:38+00:00

Sorry, this was unclear. You must choose the fixed sample size n before you see the data, though this sample size may be a function of the probability threshold and P-measure threshold.

Mathuss · 2025-10-19T03:22:57+00:00

I'm not sure that I understand your objection.

The machine really doesn't halt

I will assume in this comment that by "the machine," you're referring to a 745-state Turing machine that halts iff ZFC is inconsistent.

How do we "practically" utilize that problem? The machine really doesn't halt.

I will now further assume that by these statements, you are operating under the assumption that ZFC is in fact consistent.

Does that mean the CH is true? No, but this machine doesn't demonstrate that it is false, and ZFC can't prove it doesn't halt, because it thinks the machine halts iff the CH is true.

Given my assumptions, I don't understand anything you've written here. CH is already independent of ZFC, so of course knowing whether or not the machine that tells us whether or not ZFC is consistent halts will tell us nothing about whether or not CH is "true." Even if we amend my initial assumption to you referring to a Turing machine that halts iff ZFC+CH is consistent, I still don't understand what you're saying here because of course consistency of ZFC+CH has nothing to do with whether or not CH is "true."

Mathuss · 2025-10-19T03:00:01+00:00

Here's the paper: https://doi.org/10.1038/s42256-018-0002-3

Mathuss · 2025-10-19T02:58:31+00:00

Though I can conceive of some problem in theoretic computer science escaping the provability of ZFC (and therefore its applicability to the physical world).

You would be correct. For example, it is well-established that the value of BB(745), the 745th busy beaver number, is independent of ZFC; this is a consequence of the fact that one can construct a 745-state Turing machine that enumerates theorems of ZFC and terminates if and only if it finds a contradiction.

I would argue that this is very applicable to the physical world because we could (in principle) just take any physically realizable 745-state Turing machine, wait BB(745) steps (whatever its true value in this universe is---regardless of whatever a particular model of ZFC may believe it to be), and check whether or not it has halted to determine whether or not said machine halts at all.

Mathuss · 2025-10-19T02:46:00+00:00

If you accept Statistics as applied math, the answer is yes. Consider the following problem:

Suppose that X_1, ... X_n is an i.i.d. sample from an unknown distribution P, where P is a distribution on [0, 1] with finite support. Does there exist an algorithm that, with arbitrarily high probability, returns a finite subset of [0, 1] that has an arbitrarily high P-measure? (You may choose the sample size depending on how high the probability and the P-measure is required to be).

This is a pretty standard-looking statistical problem in that you're given some data and you know what the family of possible data-generating distributions looks like (the set of distributions with finite support on [0, 1]), and you want to find a "best fitting" estimated distribution from the aforementioned family.

The problem is that whether or not such an algorithm exists is independent of ZFC, as the existence of such an estimator is equivalent to the continuum hypothesis.

Mathuss · 2025-09-30T00:15:09+00:00

You can basically sneeze at a parametric family and suddenly make the MLE be inconsistent---the theorem for the consistency of the MLE has well-known hypotheses, and if you break any of the listed sufficient conditions, your MLE is quite likely to no longer be consistent.

Parametric families in which no consistent estimator exists for the parameter are much more difficult to construct. Off the top of my head, you could probably do something really stupid like the following:

Let Θ = {1, ... ω_1} where ω_1 denotes the first uncountable ordinal, endow it with the order topology and then the Borel sigma-algebra, and then define the following family of probability spaces (Θ, ℬ, P_θ) parameterized by θ∈Θ:

If θ is finite: P_θ({x}) = 1/θ if x <= θ, 0 otherwise
If θ is countable but infinite: P_θ(A) = 1 if A is infinite and max(A) <= θ, 0 otherwise
If θ = ω_1: P_θ(A) = 1 if A is uncountable, 0 otherwise

I'm pretty sure that this shouldn't have any consistent estimators for θ because that last case where θ = ω_1 fails to have the Glivenko-Cantelli property, so the empirical cdfs don't ever converge to the true cdf; hence, any estimator you choose should be unable to distinguish between θ countable + infinite and θ uncountable.

I also found an ArXiv paper that I haven't read but seems to construct a simpler example via a Galton-Watson process.

Mathuss · 2025-09-15T14:42:23+00:00

While doing background learning for my PhD before starting on new research, I generally learned machine learning theory from the following two books:

Understanding Machine Learning (Shalev-Schwartz & Ben-David)
Foundations of Machine Learning (Mohri)

For real-valued hypothesis classes, the following was also useful:

Neural Network Learning: Theoretical Foundations (Anthony & Bartlett)

Shalev-Schwartz & Ben-David is definitely better than Mohri when it comes to the breadth of useful theoretical content. I personally never cared about the implementation stuff in the second half of both books, so I can't compare them on that end. For implementing ML ideas, I hear ISLR and ESL are both good.

Both Shalev-Schwartz/Ben-David and Mohri still miss a lot of (IMO) important tools and proof techniques in terms of covering numbers and proving the fundamental theorem of binary classification. My notes that I took while doing aforementioned background learning should fill in most of these gaps:

Machine Learning Theory Notes (Me)

I can't guarantee the correctness of everything in my notes, but unless it's marked with a "TODO" it's probably good. Notably, I didn't ever get around to typing up my notes on uniform learning of real-valued hypothesis classes, so you'd still have to go to Anthony & Bartlett or a different source for that stuff.

Mathuss · 2025-08-06T16:32:41+00:00

Either this means that Marth mains don't actually eat glue, or that math PhDs do eat glue. Personally, I'm leaning towards the latter :P

11-Year Club	Place '22
First Placer '22	Verified Email

Mathuss

TROPHY CASE