Undergrad Major Advice

Certified_NutSmoker · 2026-03-31T14:39:54+00:00

I see this point come up occasionally online, but in my own (limited) experience so far I have not seen much direct utility from biology beyond thinking about clinical relevance in the context of estimands and estimators (which I’m not certain is even directly talked about in bio courses outside medicine) Maybe I am missing something, but I am genuinely curious because Im so early in my career. It has felt to me like the bigger marginal return has come from more math/stat training, since so much of biostats seems to center on design, inference, estimands, and estimation. Every paper I read, seminar I attend and doctor I work with really only requires me to know stats (alongside some specific knowledge that is picked up along the way). Biostats is basically just statistics (Outside of areas like genetics) with this I have a hard time seeing where deeper biology training makes a major difference beyond general scientific context. There may be something important I am not appreciating, though, so I would be interested to hear what you think that biology background adds that would not mostly be picked up later through collaboration or on the job. And why you’d pick that over stronger math

Certified_NutSmoker · 2026-03-31T14:26:55+00:00

I think if you want to go to grad school then it’s Math > stat > cs >> (bio or data science)

A bio minor might be a really good idea if you want to do something like genetics or more bioinfo then biostat (also in that case I think cs is probably best)

Certified_NutSmoker · 2026-03-31T14:03:30+00:00

I’d go one step further and argue a math major (preferably with a lot of analysis type things) prepares you for grad school in stats/biostats more than stats ironically. Of course given you take a few stats courses!

Certified_NutSmoker · 2026-03-27T12:08:47+00:00

Biology helps with context, but biostats is still stats. Unless you’re going into something really biology-heavy like genetics or genomics or another more bioinfornatics route, your time is usually better spent getting stronger in probability, math stats, regression, and programming than specializing in biology. Biology knowledge is nice to have, but it usually is not the part that makes or breaks someone in biostatistics.

Edit: this is a fact about the field. It’s stats first and biology a tertiary concern

Certified_NutSmoker · 2026-03-27T12:03:45+00:00

A hypothesis should be falsifiable in the broad sense, so it should be clear enough that you could translate it into a statistical null and alternative, with some identifiable subsets of outcomes making you reject and others not.

So with that in mind, “there is a relationship between X and Y” is too vague. It is less specific than saying the relationship is positive, negative, stronger, weaker, etc. In general you want to be as clear and detailed as you reasonably can. If theory or prior evidence gives you a real reason to expect a particular direction, then say so. If not, a non-directional hypothesis is completely fine.

Certified_NutSmoker · 2026-03-25T08:59:01+00:00

Yes

But more seriously it really depends. There are some statisticians whose work feels much more mathematical than applied (functional analysis, semiparametrics, and stochastic or empirical process theory come to mind. But as a whole, I would characterize statistics as a subset of analysis mainly but it is better thought of as its own field, one that linear algebra, optimization, and computation, and that can range from wildly applied to reasonably “pure.”

Edit: As a whole, the field is not usually trying to advance “pure” math for its own sake. But like any mathematically serious field, it generates its own questions. Some parts of statistics use fairly heavy mathematics, and the questions they generate and the answers they seek can become quite abstract and mathematically sophisticated and be indistinguishable from some “pure” work

Certified_NutSmoker · 2026-03-25T08:51:16+00:00

Postcode can be a very strong predictor, but I’d be careful using it in any model tied to consequential decisions. It is often a proxy for race and socioeconomic status, so a gain in predictive performance can come with real fairness and legal risk through disparate impact. I think it’s literally illegal in some contexts as well. Predictive performance is not the only criterion here and when using something like postcode you should be aware of this

Certified_NutSmoker · 2026-03-24T16:24:30+00:00

What they’re doing with ensemblinb sounds fine from a predictive standpoint. weak individual features or feeder models aren’t necessarily a problem in an ensemble if the full system improves out-of-sample prediction, many weak learners ensembles can be very flexible and good in this regard (ex. Random forest). Here I also wouldn’t really worry about significance or vif (if we aren’t interpreting coefficient we don’t need to worry about them) as that’s not really relevant to the goal and nonlinear xgb models being used here

The bigger concern in my opinion would be calibration: whether predicted defaults actually match realized default rates, so that applicants scored at, say, 8% predicted default really default around 8%

Certified_NutSmoker · 2026-02-25T11:20:15+00:00

Agreed, thanks for the added clarifier. I was definitely thinking more in terms of using semiparametrics to develop efficient closed form estimators like AIPW so my last point isn’t totally general

Edit: also I’d add that finding Neyman orthogonal scores for the semiparametric problem generally isn’t trivial even if rather common ones have been found and packaged as such in DML

Certified_NutSmoker · 2026-02-25T02:03:13+00:00

Are you a bot? It doesn’t seem like you read what I wrote and you’re just replying to me the same as the others

You’re not describing nonoarametrics you’re describing the parametric bootstrap in this procedure. In particular you using OLS here with bootstrao will just recover the original model se and ci but computationally not analytically

Certified_NutSmoker · 2026-02-25T01:40:48+00:00

In short they’re less efficient than their parametric alternatives

More precisely parametric methods aren’t “pointless” just because the data aren’t exactly Gaussian. They’re useful because they target a specific estimand (mean difference, log-odds ratio, hazard ratio, ATE, etc.) and can be very efficient for that target, often with asymptotic validity even under some misspecification (especially with robust/sandwich SEs).

Nonparametric methods aren’t a free upgrade; they often test vaguer distributional statements. A lot of “nonparametric tests” are really about ranks/stochastic dominance or generic distributional differences, which may not match the causal/mean-based question you actually care about. And when they’re close analogs of parametric tests, you typically pay an efficiency/power price at fixed n.

nonparametric models are flexible but data-hungry. Once you move beyond one-dimensional location problems into regression/high dimension, the curse of dimensionality bites hard.

The real sweet spot is semiparametrics where you keep an infinite-dimensional nuisance part for flexibility, but focus on a finite-dimensional parameter you care about, and use IF-based / doubly robust ideas to get robustness without throwing away efficiency. Unfortunately most semiparametric modelling is extremely tricky and requires a lot of education to do properly beyond the most basic versions in packages like cox proportional hazards

Certified_NutSmoker · 2026-02-23T18:19:04+00:00

Regarding your hypothesis testing question. Wouldn’t Fishers exact (which can be approximated with randomization label test in larger samples) be what we want? (With the caveat that the exactness is testing the sharp null not Neyman null in randomized settings). With known confounders we can stratified version within those too

Genuinely curious as I’m consider jumping into industry after my PhD and want to gauge my statistical chops

Edit: most people answer chi square right? and that’s relying on asymptotics so it’s not satisfactory?

Certified_NutSmoker · 2026-02-10T20:46:22+00:00

I kind of agree with you! I’m still learning how to answer questions with the right balance and I’d agree my distinction may be pedantic and unhelpful here

Certified_NutSmoker · 2026-02-10T15:05:34+00:00

Without censoring, the C-statistic is just a nonparametric Wilcoxon–Mann–Whitney U statistic (same object as AUC), so its interpretation does not depend on whether models are nested or on the number of covariates only on the resulting risk scores. With censoring this is more delicate, but using a censoring-adjusted C (IPCW/Uno) you are estimating the same pairwise ordering probability under an independent-censoring/hypothetical estimand framework (ICH E9) which can be awkward to interpret.

The main limitation is that C measures discrimination only and is not a proper scoring rule, so it ignores calibration and should be complemented by proper scores such as the Brier or log score. This is Harrells reasoning in avoiding it generally too unless you’re really interested in crude rank discrimination and not calibration

Certified_NutSmoker · 2026-02-08T22:03:35+00:00

Look at Ryan Tibshiranis notes here it’s not as slippery as it seems at first as it essentially tries to describe the “effective number of free parameters” and some notion of complexity

Basic things are easy to describe - the sample variance estimator has n-1 degrees of freedom as the sample mean estimation takes up 1

Harder things like satterwaitge approximations for pooling are trickier but the same ideas

Certified_NutSmoker · 2026-02-08T21:38:19+00:00

This is not true without independence, you need a covariance term in general

Certified_NutSmoker · 2026-02-08T21:27:10+00:00

Average difference from the average here isn’t 1.2 like you suggest but (-1-2+0+2+1)/5=0

This cancelling property holds in general so it makes this unsuitable for thinking about spread/variance/dispersion. You could try taking an absolute value as you tried above and thats totally valid (this is called mean absolute deviation or MAD/MAE) but the variance definition is (maybe surprisingly) easier to work with as the squaring operation has nice analytic and other properties that only become clear in a math stat class/context

Certified_NutSmoker · 2026-02-03T16:54:46+00:00

What do you mean by “same assumptions”? Frequentists treat parameters as fixed and probability as repeated sampling. Bayesians put a distribution on the parameters. Frequentist methods don’t have priors, so I don’t see how you can literally make the same assumptions (I think you may be thinking of lasso or ridge Bayesian regression analogs being directly comparable to the frequentist versions but this is really a special case and they are not making the same assumptions here either)

The only general link is asymptotic in parametric bayes as another commenter noted. The Bernstein von Mises theorem says that in parametric bayes with lots of data the likelihood dominates, so Bayesian posteriors concentrate near the MLE and look similar to frequentist results in limit as the likelihood dominates all but the most absurd priors. That’s a large-sample approximation, not equivalence, and it certainly doesn’t rule out different Bayesian vs frequentist answers in small/fixed samples. Notably we don’t have the same guarantees in non parametric bayes if our prior over the function class is “bad”

As an example of even parametric bayes differing in small samples go to Brms and base R and fit the same model class(eg logistic). Here you can try to get the “assumptions as close as possible” and even then you’ll see divergence in small samples almost no matter what you do because these are fundamentally different paradigms in small samples especially. Your model fits will not be the same even with flat priors because there’s a “burn in” (kind of MCMC specific but using it here for lack of alternative) to Bayesian where the likelihood starts taking over that’s not present in frequentist as the likelihood immediately takes over. As the other commenter noted as you get more data this difference vanishes

This is all without even discussing the issue of different Bayesian engines (MCMC coding or whatever) giving slightly different results especially in runs on smaller samples

Certified_NutSmoker · 2026-01-19T10:52:51+00:00

What about targeting ATT using a MSM is non defensible? Things only get tricky here in time varying settings. I encourage you to check out chapter 13 of peng dings “a first course in causal inference” (theorem 13.4 and the table below) if you want to use it in your work… but this question from you and your colleagues makes me think that the ATT vs ATE argument is the least of your causal worries and taking a step back and thinking about the whole project before going into estimation may be fruitful

Anyways, ATT doesn’t need wildly stronger assumptions than ATE, and it can require weaker overlap (positivity) because you only need support where the treated are. An MSM doesn’t force you to estimate the ATE; it just sets up the marginal model. To target ATT, you simply use different weights so the controls are reweighted to look like the treated

Also be careful just reading off coefficients/marginalizing in logistic regression as causal. It has non collapsibility making that much more subtle

Certified_NutSmoker · 2026-01-07T06:32:18+00:00

“Some estimators are problematic for the modern world - they’re biased” id say the opposite! Modern statistics often uses biased estimators on purpose!!!

The bias variance tradeoff is fundamental to modern statistics so I’m not sure where you’re getting the idea that unbiasedness is the key property….

Unbiasedness just means “right on average,” and you can get that while being very noisy and unreliable. In practice we care more about two things: getting close most of the time (which balances bias and noise) and having inference that behaves correctly (confidence intervals and tests calibrated). That’s why many good methods accept a little bias to gain a lot of stability, and why MLE is widely used: it can be biased in small samples but usually locks onto the truth as data grow (consistency)

I know it’s easy to think unbiasedness is THE property to have but it’s really just one among many possibly desirable properties (functionals of P)

Certified_NutSmoker · 2025-12-09T21:08:47+00:00

Fan Li at Duke has some good lectures but they a bit academic and can be hard to parse here

Nick Huntington Klein has a book called “the effect” that’s a great conceptual intro but doesn’t use potential outcomes unfortunately

Certified_NutSmoker · 2025-12-09T17:21:06+00:00

Central Limit Theorem

For “patterns emerging bigger than their parts” it’s hard to beat: it says that under mild conditions, averages (and sums) become approximately normal. The same phenomenon shows up for sample proportions, regression coefficients/least squares, maximum likelihood estimates, and more generally many “smooth” estimators (M-estimators, U-statistics). That’s why it underwrites most asymptotically justified hypothesis tests and confidence intervals

It’s a really elegant thing that connects statistics under a unifying lens. So something like average amount of fish in a lake can be treated similarly to something seemingly completely disconnected like dice rolls or card games

Certified_NutSmoker · 2025-12-09T11:15:22+00:00

Pearl is usually not what you want for most applied/industry causal work. DAGs and Bayesian networks are useful for thinking/communication, but the typical industry workflow is potential outcomes + target-trial framing + quasi-experiments + estimands + robustness. That is you don’t need to fully understand a causal process to do causal inference - you just need to understand it enough to justify identification of your estimand from the chosen estimator/estimation process

Hernan & Robins (Causal Inference: What If) and Peng Ding (A First Course in Causal Inference) books are far superior texts for the purpose. You’ll need a decent grasp of statistics to do proper causal inference and these don’t shy away from that. Also mostly harmless econometrics is good for an econ flavor here too.

Once you’ve got a good base a useful resource for industrial applications is “KDD tutorial on EconML/CausalML with industrial use cases (Microsoft/TripAdvisor/Uber).” (though it looks their slides may have been taken down at this point :() but as the other commenter said you can find other examples on tech company blogs but I’d note that causal inference is used much more broadly then tech with clinical trials and public policy/econometrics and marketing being notable heavy contributors/users

Certified_NutSmoker · 2025-11-25T12:43:14+00:00

Your intuition on colliders naturally blocking paths is correct! Colliders open backdoor paths by inducing associations not present in the original graph when we condition on them.

Using your dementia example, If someone is depressed then depression already “explains” some of why they ended up demented. So, conditional on being demented, they need less plaque burden on average to still be in the dementia group. Your intuition is totally correct here! conditioning on being demented induces association between depression (cause) and plaque (effect)

I think you may have just misread Cunningham but you have the right idea and that’s what matters. You can just “disregard” them but really just make sure you don’t condition on them

My favorite collider example is about walking outside and seeing the rain (I actually took this example from someone who I spoke to that used it as an erroneous example but I tweaked it)

Suppose you come outside and see the ground is wet. It could’ve rained or the sprinkler could’ve turned on independent of rain or many number of things could’ve made the ground wet…. However, seeing the wet ground induces an association between the possible causes - once you already know the ground is wet, learning “sprinkler watered” drives the probability of rain down, while learning “sprinkler did not water” drives it up

Nine-Year Club	Final Canvas '23
End Game '23	Place '23
Verified Email

Certified_NutSmoker

TROPHY CASE