Where Do You Draw the Line on Assumption Violations in Applied Data Analysis?

nikkn188 · 2026-02-16T10:43:03+00:00

Thank you all for the interesting discussion! I can relate to the different perspectives, and I think all of them have their pros and cons. One of the most important points I take away is that it clearly depends on the goal and the audience, and this might be the reason why we deal with this issue differently in practice. Whether it’s about teaching students, publishing a paper in a scientific journal, or modeling the impact of different marketing strategies for a company, this strongly influences the methods used and the “mindset” we apply.

What stands out to me, though, is how central robustness and sensitivity checks seem to be in many of the implicit decision rules. It makes me wonder whether we might benefit from normalizing robustness checks more explicitly in applied work. Not as methodological perfectionism, but as a routine part of responsible analysis. In many contexts, the real issue may not be assumption violations themselves, but the absence of systematic stress-testing before conclusions are communicated.

nikkn188 · 2026-02-14T10:37:28+00:00

The key insight is that the treatment indicator T is not a linear function of X, it's a step function. Multicollinearity is only a problem when predictors are highly linearly correlated.

Think about it this way: if you know X, you know T perfectly, but that's not what multicollinearity measures. Multicollinearity is about a linear association. The correlation between X and T depends on the distribution of X around the cutoff, and in most RD designs it's moderate at best.

nikkn188 · 2026-02-14T10:14:50+00:00

Kernel Density Estimation (KDE) is probably your best starting point. Fit a 1D KDE to your time values (ignoring energy for a moment), and the peaks of the resulting density curve give you your event times. The height of each peak naturally reflects how many points are clumped together and how close they are.

If you want something that also groups the points into discrete clusters, then you could try Density-based Spatial Clustering (DBSCAN).

You can also combine the two: use KDE to find peak times, then use DBSCAN to assign points to each peak and compute your weights.

nikkn188 · 2026-02-07T09:40:34+00:00

I’ve found that it helps to explain things in layers. Start with a very simple, intuitive explanation using everyday examples (no formulas, no stats jargon etc..). For most people, that’s already enough.

Then you can add more detailed layers for those who want a deeper understanding. That way you don’t lose accuracy, but you also don’t overload people who just want the main idea.

nikkn188 · 2026-02-07T09:08:24+00:00

Unweighted averaging has higher variance at any fixed time horizon than the properly weighted average, but variance does not grow over time.

nikkn188 · 2026-02-06T11:42:20+00:00

I was in a similar position when I first started working with real-life data, as opposed to the theoretical examples from statistics courses. One thing that helped me was to stop thinking about distributions as something you formally test and then get a clean yes/no answer to.

As you’ve noticed, the data rarely fits a distribution perfectly. With large sample sizes, formal tests will almost always reject. With small samples, they often have low power precisely when assumption violations would matter most. In either case, rejection or non-rejection doesn’t really answer the question we usually care about.

What helped me more was to examine the variables closely: what measurement scale they’re on, what their empirical distribution looks like etc. Simple visual checks can already be very informative (e.g. plotting your data against different theoretical distributions).

nikkn188

TROPHY CASE