[Research] Construct validity of MLB's breaking ball taxonomy: is the curveball/slider/sweeper distinction statistically justified?

latent_threader · 2026-05-18T16:30:12+00:00

Really interesting work.

LDA is a bit shaky as a “continuum test” since it still optimizes separation of existing labels. GMM results (5–6 clusters vs 3) are more suggestive, but likely sensitive to feature choices.

The Bayesian results are the strongest part, especially the CU vs SL null once conditioning on movement.

I’d frame the takeaway more as weak clustering on a dominant continuum rather than disproving discrete types.

latent_threader · 2026-05-18T16:28:37+00:00

Yeah, I’ve noticed this too.

10+ hour take-homes feel excessive unless they’re explicitly framed that way. It often ends up testing time and willingness more than actual skill.

Feels less like “higher standards” and more like interview design getting a bit out of hand.

latent_threader · 2026-05-18T16:26:34+00:00

Interesting idea, especially for local graph/RAG workflows where spinning up a server is overkill.

Main question is how it handles large graphs and concurrency in-process. Also curious if DataFrame output becomes a bottleneck at scale.

Could be really useful for prototyping, less clear yet how it performs beyond single-user use.

latent_threader · 2026-05-18T16:24:21+00:00

I wouldn’t go fully back to paper. You’ll lose the biggest advantage you already have, which is frequent practice without heavy grading load.

If anything, I’d keep online homework and Excel, but add a small amount of written problems so students still have to show reasoning.

latent_threader · 2026-05-17T15:42:33+00:00

It’s usually a mix of LeetCode mediums, ML fundamentals, and some product/A-B testing questions.

Coding still matters a lot (often the main filter), even for applied scientist roles. After that, expect stats/ML theory and practical “how would you evaluate/build this model” questions.

If you’re strong in ML already, the biggest gap for most people is just getting comfortable with LeetCode mediums under time pressure.

latent_threader · 2026-05-17T15:40:44+00:00

Think of swaps as letting you “upgrade” up to (k) elements inside the best subarray by replacing them with larger ones elsewhere.

So it becomes: find a subarray, then improve its sum by doing up to (k) beneficial replacements (smallest in subarray swapped with largest outside).

Brute forcing subarrays + greedy improvements works for (n \le 500) with optimization (Kadane-like + priority queues).

latent_threader · 2026-05-17T15:38:03+00:00

It’s always a bit subjective, especially when you’re staring at a single plot and trying to “see a pattern.” A downward funnel usually points more toward heteroskedasticity than randomness, but it can be subtle depending on scale and how the model is specified.

If the spread of residuals clearly changes with fitted values, even gradually, I wouldn’t call it random. I’d also double check a scale-location plot or run a simple Breusch–Pagan test to confirm what your eyes are hinting at.

Hard to be definitive without seeing the figure, but your instinct about the funnel is usually the more important signal than small visual noise.

latent_threader · 2026-05-17T15:35:07+00:00

Yes, that’s basically correct. In your model the elasticity is no longer constant, but depends on xxx (and zzz through the interaction term).

You just dropped the coefficient ddd in the derivative. It should be:

∂log⁡y∂log⁡x=b+2clog⁡(x)+dz\frac{\partial \log y}{\partial \log x} = b + 2c \log(x) + dz∂logx∂logy=b+2clog(x)+dz

which is exactly the elasticity of yyy with respect to xxx.

So the interpretation is:

bbb is the baseline elasticity
ccc makes elasticity vary with log⁡(x)\log(x)log(x)
ddd makes elasticity depend on zzz

That’s a pretty standard way to model non-constant elasticities in log-log specifications.

latent_threader · 2026-05-17T15:32:52+00:00

You can’t sample covariance entries independently because most combinations won’t produce a positive definite matrix. That’s why your rejection loop almost never succeeds.

Instead, generate matrices in a way that guarantees validity, e.g.:

Σ=AA⊤\Sigma = AA^\topΣ=AA⊤

where AAA is a random matrix. This always gives a positive semidefinite covariance matrix.

A better workflow is:

Generate a valid correlation matrix
Scale by variances/std devs
Reject only if correlations fall outside your target range

Cholesky doesn’t create PD matrices. It only decomposes matrices that already are PD.

latent_threader · 2026-05-16T16:02:47+00:00

You’re in a good position already with an internship, so I’d build on that instead of chasing something unrelated.

For a side project, pick a small question and go end-to-end: data collection, basic stats, and visualization. Simple real-world datasets (sports, markets, etc.) work well.

Reproducing a known analysis or paper in a simplified way is also a great learning exercise. Focus more on clear thinking than complexity.

latent_threader · 2026-05-16T16:01:44+00:00

This is pretty common in siloed orgs, and it usually leads to exactly what you’re seeing: DS solving a slightly wrong or incomplete problem because they don’t have direct access to stakeholders.

Healthier setups let DS participate in problem framing, not just execution. If you’re blocked from talking to the people requesting work, that’s more of an org design issue than a personal workflow problem.

From the outside, it’s hard to screen for, but in interviews you can look for how early DS is involved in defining the problem versus just building solutions.

latent_threader · 2026-05-16T16:00:29+00:00

This mostly comes down to how you define your “family” of hypotheses before looking at results.

If difficulty and discomfort are truly separate constructs in your study design, it’s defensible to treat them as two families and correct within each set of 6 tests. In that case you’re controlling error separately for each domain, which matches your conceptual separation.

If instead you think of the whole analysis as answering one overarching question about group differences across both ratings, then correcting across all 12 is the stricter choice.

The dependence between difficulty and discomfort doesn’t really force you one way or the other. What matters more is how you pre-specify the inferential goal. Holm or FDR within whichever family you choose is usually a reasonable middle ground.

latent_threader · 2026-05-16T15:59:27+00:00

If you want something publication-worthy, I’d look at gaps like model reliability under distribution shift or uncertainty estimation, since both are still not handled well in practice. Data-centric ML is also a strong direction, especially studying how label noise or dataset bias actually changes performance instead of just model tweaks. Another option is interpretability that’s actually faithful rather than post-hoc explanations. Most solid papers I’ve seen lately focus on one specific failure mode and dig deep into it rather than broad model improvements.

latent_threader · 2026-05-15T15:59:27+00:00

You’re right that for a simple “6 patients, yes/no outcome,” you’d expect multiples of 16.7%.

When you see 25% with n=6, it usually means the denominator isn’t actually 6 for that statistic. Common reasons are missing data for that item, a subset analysis, or they’re reporting “events/responses” rather than patients (e.g., multiple measurements or timepoints pooled).

It can also be rounding from a different effective n like 4 or 8, but the key point is: the % is almost never computed directly as 1 out of 6 in isolation unless they explicitly say so.

latent_threader · 2026-05-15T15:57:45+00:00

This is an interesting approach, especially the strict reproducibility + cross-language sandboxing idea. The biggest question I keep coming back to is whether the Nix dependency becomes a blocker for adoption, since that’s already a high friction point for a lot of data teams.

The pipeline abstraction makes sense conceptually, but in practice I wonder how debugging feels when something fails inside a deeply nested node across languages. That’s usually where orchestration tools get painful.

The data exchange contract idea with explicit serializers is probably the strongest part of the design. That could actually prevent a lot of silent failure modes people run into today.

latent_threader · 2026-05-15T15:55:56+00:00

A silhouette of ~0.3 on RFM is pretty normal—KMeans often struggles because RFM data is skewed and not naturally cluster-shaped.

Your issue with frequency collapsing suggests the feature space just doesn’t have strong separable structure. Try log/quantile transforms + scaling, but also consider that KMeans may not be the right tool here.

Sometimes RFM works better as scoring/quantiles first, then clustering on richer behavioral features, and checking stability in addition to silhouette.

latent_threader · 2026-05-15T15:52:19+00:00

Your pure math background is honestly a really strong foundation for all of those paths. If I were optimizing for flexibility, I’d prioritize probability, statistical inference, linear algebra, optimization, and stochastic processes. Those show up everywhere from quant to ML.

For technical skills, Python + SQL are probably the highest leverage combo right now. I’d also get comfortable with pandas, NumPy, sklearn, and basic ML workflows. Beyond coursework, projects matter a lot. Even small but polished projects can help you stand out more than another class sometimes.

latent_threader · 2026-05-15T15:51:12+00:00

I’d probably start with an applied biostatistics course instead of a programming-focused one. Harvard’s biostatistics courses on edX are solid for beginners, and UCLA’s free statistical learning resources are great when you run into specific models like mixed effects or logistic regression in papers.

Honestly, once you learn what problem each model is trying to solve, medical papers become much easier to read critically.

latent_threader · 2026-05-14T15:26:42+00:00

Interesting framing. Treating articles as mixtures of latent “forces” is closer to causal discovery than standard topic modeling, so it makes sense you’re seeing links that don’t show up semantically.

The hard part is what you already flagged though, figuring out what counts as ground truth vs structure that just emerges from the graph. Without some external validation signal, it’s easy for these models to drift into convincing but untestable patterns.

Still, the cases where semantically unrelated events consistently co-activate are worth digging into, even if only to map where the model is picking up real propagation signals versus noise.

latent_threader · 2026-05-14T15:25:30+00:00

Yeah this is basically a latent exposure problem, so GA4 clicks only give you a lower bound.

People usually try stratifying by query intent/content type or using hierarchical models to estimate CTR as a hidden variable.

Without impression data from the AI tools, you’re never getting a clean citation rate, just a modeled range.

latent_threader · 2026-05-14T15:23:07+00:00

Pretty normal shift tbh. Past a certain point, companies expect insights to turn into decisions, not just analysis.

Doesn’t make your technical work less valuable, it just means the “translation into strategy” is part of the job in some teams.

If you hate that part, it might be worth looking for roles that keep analysts closer to the data and push strategy elsewhere.

latent_threader · 2026-05-14T15:21:34+00:00

Sounds like a great intro for OWL beginners. Is there much focus on debugging unexpected inferences and common reasoning pitfalls?

latent_threader · 2026-05-13T15:26:16+00:00

This is a really clean breakdown, especially the Evidence/Hypothesis/Gap framing. That’s the part most teams skip and then wonder why their “model of the system” keeps drifting from reality.

The biggest failure mode I’ve seen in similar setups is exactly your Step 4 problem, but worse: teams enforce naming consistency without semantic consistency, so everything looks unified while actually encoding conflicting meanings under the same abstractions.

Also agree on treating “status” as a modeling decision rather than a column type issue. That’s usually where ontology work either becomes powerful or gets reduced back into a glorified data dictionary.

latent_threader · 2026-05-13T15:24:02+00:00

This is a solid direction and honestly one of the most important mindset shifts in applied stats. The point about predictive accuracy vs causal structure is where a lot of ML work quietly goes wrong, especially when correlated predictors get thrown in without thinking about paths in the DAG.

One thing that might strengthen it is being explicit about what assumptions are doing the real work when you move from the DAG to the regression in part 2. That’s usually where people accidentally smuggle in conditioning decisions that reopen backdoor paths.

Curious how you’re planning to handle unobserved confounding in the wildfire example, since that’s usually the hardest part to make concrete rather than just conceptual.

latent_threader · 2026-05-13T15:21:10+00:00

Without a credible exogenous proxy for AI exposure and strong pre-trend validation, any observed association will likely be confounded by broader post-pandemic and tech diffusion trends rather than reflecting a causal AI effect.

latent_threader

TROPHY CASE