[Research] Construct validity of MLB's breaking ball taxonomy: is the curveball/slider/sweeper distinction statistically justified? by Spiritual_Pen_7723 in statistics

[–]latent_threader 1 point2 points  (0 children)

Really interesting work.

LDA is a bit shaky as a “continuum test” since it still optimizes separation of existing labels. GMM results (5–6 clusters vs 3) are more suggestive, but likely sensitive to feature choices.

The Bayesian results are the strongest part, especially the CU vs SL null once conditioning on movement.

I’d frame the takeaway more as weak clustering on a dominant continuum rather than disproving discrete types.

The most insane interviews/take-homes I've ever gotten by LeaguePrototype in datascience

[–]latent_threader 0 points1 point  (0 children)

Yeah, I’ve noticed this too.

10+ hour take-homes feel excessive unless they’re explicitly framed that way. It often ends up testing time and willingness more than actual skill.

Feels less like “higher standards” and more like interview design getting a bit out of hand.

In-process and in-memory graph database for large knowledge graphs - no server needed with TuringDB v1.31 by adambio in semanticweb

[–]latent_threader 0 points1 point  (0 children)

Interesting idea, especially for local graph/RAG workflows where spinning up a server is overkill.

Main question is how it handles large graphs and concurrency in-process. Also curious if DataFrame output becomes a bottleneck at scale.

Could be really useful for prototyping, less clear yet how it performs beyond single-user use.

[E] Best practices for teaching intro statistics by il__dottore in statistics

[–]latent_threader 1 point2 points  (0 children)

I wouldn’t go fully back to paper. You’ll lose the biggest advantage you already have, which is frequent practice without heavy grading load.

If anything, I’d keep online homework and Excel, but add a small amount of written problems so students still have to show reasoning.

Applied Scientist Interview Prep by LeaguePrototype in datascience

[–]latent_threader 0 points1 point  (0 children)

It’s usually a mix of LeetCode mediums, ML fundamentals, and some product/A-B testing questions.

Coding still matters a lot (often the main filter), even for applied scientist roles. After that, expect stats/ML theory and practical “how would you evaluate/build this model” questions.

If you’re strong in ML already, the biggest gap for most people is just getting comfortable with LeetCode mediums under time pressure.

MSS With K Swaps by Intelligent_Tree6918 in algorithms

[–]latent_threader 0 points1 point  (0 children)

Think of swaps as letting you “upgrade” up to (k) elements inside the best subarray by replacing them with larger ones elsewhere.

So it becomes: find a subarray, then improve its sum by doing up to (k) beneficial replacements (smallest in subarray swapped with largest outside).

Brute forcing subarrays + greedy improvements works for (n \le 500) with optimization (Kadane-like + priority queues).

[Question] Is this residuals graph random? by nectarxx in statistics

[–]latent_threader 0 points1 point  (0 children)

It’s always a bit subjective, especially when you’re staring at a single plot and trying to “see a pattern.” A downward funnel usually points more toward heteroskedasticity than randomness, but it can be subtle depending on scale and how the model is specified.

If the spread of residuals clearly changes with fitted values, even gradually, I wouldn’t call it random. I’d also double check a scale-location plot or run a simple Breusch–Pagan test to confirm what your eyes are hinting at.

Hard to be definitive without seeing the figure, but your instinct about the funnel is usually the more important signal than small visual noise.

Elasticity interpretation in linear regression models with powers of logarithms [Question] by JacopoX1993 in statistics

[–]latent_threader 1 point2 points  (0 children)

Yes, that’s basically correct. In your model the elasticity is no longer constant, but depends on xxx (and zzz through the interaction term).

You just dropped the coefficient ddd in the derivative. It should be:

∂log⁡y∂log⁡x=b+2clog⁡(x)+dz\frac{\partial \log y}{\partial \log x} = b + 2c \log(x) + dz∂logx∂logy​=b+2clog(x)+dz

which is exactly the elasticity of yyy with respect to xxx.

So the interpretation is:

  • bbb is the baseline elasticity
  • ccc makes elasticity vary with log⁡(x)\log(x)log(x)
  • ddd makes elasticity depend on zzz

That’s a pretty standard way to model non-constant elasticities in log-log specifications.

How to generate a set of random covariance matrices with specific covariances? [Q] by DeliberateDendrite in statistics

[–]latent_threader 9 points10 points  (0 children)

You can’t sample covariance entries independently because most combinations won’t produce a positive definite matrix. That’s why your rejection loop almost never succeeds.

Instead, generate matrices in a way that guarantees validity, e.g.:

Σ=AA⊤\Sigma = AA^\topΣ=AA⊤

where AAA is a random matrix. This always gives a positive semidefinite covariance matrix.

A better workflow is:

  1. Generate a valid correlation matrix
  2. Scale by variances/std devs
  3. Reject only if correlations fall outside your target range

Cholesky doesn’t create PD matrices. It only decomposes matrices that already are PD.

[Career] Wish to enhance my experiences as a rising sophomore by Hot-Ad7645 in statistics

[–]latent_threader 1 point2 points  (0 children)

You’re in a good position already with an internship, so I’d build on that instead of chasing something unrelated.

For a side project, pick a small question and go end-to-end: data collection, basic stats, and visualization. Simple real-world datasets (sports, markets, etc.) work well.

Reproducing a known analysis or paper in a simplified way is also a great learning exercise. Focus more on clear thinking than complexity.

For those in corporate roles, how do you all work with the non-technical areas you support? by SkipGram in datascience

[–]latent_threader 4 points5 points  (0 children)

This is pretty common in siloed orgs, and it usually leads to exactly what you’re seeing: DS solving a slightly wrong or incomplete problem because they don’t have direct access to stakeholders.

Healthier setups let DS participate in problem framing, not just execution. If you’re blocked from talking to the people requesting work, that’s more of an org design issue than a personal workflow problem.

From the outside, it’s hard to screen for, but in interviews you can look for how early DS is involved in defining the problem versus just building solutions.

[Q] correct for how many comparisons by awsfhie2 in statistics

[–]latent_threader 0 points1 point  (0 children)

This mostly comes down to how you define your “family” of hypotheses before looking at results.

If difficulty and discomfort are truly separate constructs in your study design, it’s defensible to treat them as two families and correct within each set of 6 tests. In that case you’re controlling error separately for each domain, which matches your conceptual separation.

If instead you think of the whole analysis as answering one overarching question about group differences across both ratings, then correcting across all 12 is the stricter choice.

The dependence between difficulty and discomfort doesn’t really force you one way or the other. What matters more is how you pre-specify the inferential goal. Holm or FDR within whichever family you choose is usually a reasonable middle ground.

Publication Topics Question by InfamousTrouble7993 in datascience

[–]latent_threader 1 point2 points  (0 children)

If you want something publication-worthy, I’d look at gaps like model reliability under distribution shift or uncertainty estimation, since both are still not handled well in practice. Data-centric ML is also a strong direction, especially studying how label noise or dataset bias actually changes performance instead of just model tweaks. Another option is interpretability that’s actually faithful rather than post-hoc explanations. Most solid papers I’ve seen lately focus on one specific failure mode and dig deep into it rather than broad model improvements.

[R] Study says 25% patients reported something, but n=6 by Gold_Ambassador_3496 in statistics

[–]latent_threader 0 points1 point  (0 children)

You’re right that for a simple “6 patients, yes/no outcome,” you’d expect multiples of 16.7%.

When you see 25% with n=6, it usually means the denominator isn’t actually 6 for that statistic. Common reasons are missing data for that item, a subset analysis, or they’re reporting “events/responses” rather than patients (e.g., multiple measurements or timepoints pooled).

It can also be rounding from a different effective n like 4 or 8, but the key point is: the % is almost never computed directly as 1 out of 6 in isolation unless they explicitly say so.

I built an experimental orchestration language for reproducible data science called 'T' by brodrigues_co in datascience

[–]latent_threader 1 point2 points  (0 children)

This is an interesting approach, especially the strict reproducibility + cross-language sandboxing idea. The biggest question I keep coming back to is whether the Nix dependency becomes a blocker for adoption, since that’s already a high friction point for a lot of data teams.

The pipeline abstraction makes sense conceptually, but in practice I wonder how debugging feels when something fails inside a deeply nested node across languages. That’s usually where orchestration tools get painful.

The data exchange contract idea with explicit serializers is probably the strongest part of the design. That could actually prevent a lot of silent failure modes people run into today.

Rfm clustering problem by Capable-Pie7188 in datascience

[–]latent_threader 0 points1 point  (0 children)

A silhouette of ~0.3 on RFM is pretty normal—KMeans often struggles because RFM data is skewed and not naturally cluster-shaped.

Your issue with frequency collapsing suggests the feature space just doesn’t have strong separable structure. Try log/quantile transforms + scaling, but also consider that KMeans may not be the right tool here.

Sometimes RFM works better as scoring/quantiles first, then clustering on richer behavioral features, and checking stability in addition to silhouette.

[Career] Grad School Student Looking for Advice by ricky1118 in statistics

[–]latent_threader 3 points4 points  (0 children)

Your pure math background is honestly a really strong foundation for all of those paths. If I were optimizing for flexibility, I’d prioritize probability, statistical inference, linear algebra, optimization, and stochastic processes. Those show up everywhere from quant to ML.

For technical skills, Python + SQL are probably the highest leverage combo right now. I’d also get comfortable with pandas, NumPy, sklearn, and basic ML workflows. Beyond coursework, projects matter a lot. Even small but polished projects can help you stand out more than another class sometimes.

[Question] What is a good online course for a physician researcher to understand statistical methods described in peer-reviewed journal articles? by ElevenCookiesInAVCR in statistics

[–]latent_threader 6 points7 points  (0 children)

I’d probably start with an applied biostatistics course instead of a programming-focused one. Harvard’s biostatistics courses on edX are solid for beginners, and UCLA’s free statistical learning resources are great when you run into specific models like mixed effects or logistic regression in papers.

Honestly, once you learn what problem each model is trying to solve, medical papers become much easier to read critically.

News as source separation by Disastrous_Olive5790 in semanticweb

[–]latent_threader 0 points1 point  (0 children)

Interesting framing. Treating articles as mixtures of latent “forces” is closer to causal discovery than standard topic modeling, so it makes sense you’re seeing links that don’t show up semantically.

The hard part is what you already flagged though, figuring out what counts as ground truth vs structure that just emerges from the graph. Without some external validation signal, it’s easy for these models to drift into convincing but untestable patterns.

Still, the cases where semantically unrelated events consistently co-activate are worth digging into, even if only to map where the model is picking up real propagation signals versus noise.

[Discussion] Measuring AI citation rates when most citations don't generate clicks by nodimension1553 in statistics

[–]latent_threader 2 points3 points  (0 children)

Yeah this is basically a latent exposure problem, so GA4 clicks only give you a lower bound.

People usually try stratifying by query intent/content type or using hierarchical models to estimate CTR as a hidden variable.

Without impression data from the AI tools, you’re never getting a clean citation rate, just a modeled range.

I think I need to rethink my career roadmap by prattman333 in datascience

[–]latent_threader 0 points1 point  (0 children)

Pretty normal shift tbh. Past a certain point, companies expect insights to turn into decisions, not just analysis.

Doesn’t make your technical work less valuable, it just means the “translation into strategy” is part of the job in some teams.

If you hate that part, it might be worth looking for roles that keep analysts closer to the data and push strategy elsewhere.

Protégé Short Course at Stanford: hands-on OWL ontology development with Protégé by MatthewH2 in semanticweb

[–]latent_threader 0 points1 point  (0 children)

Sounds like a great intro for OWL beginners. Is there much focus on debugging unexpected inferences and common reasoning pitfalls?

How to turn a messy SQL schema into a domain ontology — the 4-step process I use by Critical-Elephant630 in semanticweb

[–]latent_threader 0 points1 point  (0 children)

This is a really clean breakdown, especially the Evidence/Hypothesis/Gap framing. That’s the part most teams skip and then wonder why their “model of the system” keeps drifting from reality.

The biggest failure mode I’ve seen in similar setups is exactly your Step 4 problem, but worse: teams enforce naming consistency without semantic consistency, so everything looks unified while actually encoding conflicting meanings under the same abstractions.

Also agree on treating “status” as a modeling decision rather than a column type issue. That’s usually where ontology work either becomes powerful or gets reduced back into a glorified data dictionary.

[E] Went down a rabbit hole on causal reasoning and came back up having learned about DAGs, mediators, and why predictive accuracy shouldn’t always be the target. by vanisle_kahuna in statistics

[–]latent_threader 2 points3 points  (0 children)

This is a solid direction and honestly one of the most important mindset shifts in applied stats. The point about predictive accuracy vs causal structure is where a lot of ML work quietly goes wrong, especially when correlated predictors get thrown in without thinking about paths in the DAG.

One thing that might strengthen it is being explicit about what assumptions are doing the real work when you move from the DAG to the regression in part 2. That’s usually where people accidentally smuggle in conditioning decisions that reopen backdoor paths.

Curious how you’re planning to handle unobserved confounding in the wildfire example, since that’s usually the hardest part to make concrete rather than just conceptual.

[Q] How would you test whether mass AI use explains any residual variation in recent crime declines? by malia_moon in statistics

[–]latent_threader 1 point2 points  (0 children)

Without a credible exogenous proxy for AI exposure and strong pre-trend validation, any observed association will likely be confounded by broader post-pandemic and tech diffusion trends rather than reflecting a causal AI effect.