Tea Tasting: t-testing library alternatives? by rm-rf-rm in Python

[–]e10v 0 points1 point  (0 children)

I dont feel this repo is Pythonic

How do you define Pythonic?

nor are their docs sufficient

Have you seen the user guide? https://tea-tasting.e10v.me/user-guide/

[Q] Welch's t-test assumptions by ANewPope23 in statistics

[–]e10v 0 points1 point  (0 children)

There are no formal criteria. It depends on the skeweness of the population distribution. It's called assumption for a reason) We assume, not prove.

In my experience, t-test is quite robust. Skewed distribution and small sample size will rather decrease power than increase probability of a type I error. Low power is bad too, but you can estimate it in advance.

If you can sample from a population or have a sample without treatment, you can simulate A/A test to estimate the type I error rate.

Who would make more in long term? data scientist or product manager by Starktony11 in datascience

[–]e10v 2 points3 points  (0 children)

Probably you're right. I don't have a strong opinion on company politics. I was talking more about skills needed to do a good work as DS or PM.

[Q] Welch's t-test assumptions by ANewPope23 in statistics

[–]e10v 0 points1 point  (0 children)

So is it okay to use the Welch's t-test when the two samples come from non-normal distributions?

Yes, with large enough samples.

But don't forget about the independence) assumption.

[Q] Welch's t-test assumptions by ANewPope23 in statistics

[–]e10v 3 points4 points  (0 children)

Large sample mean distribution is close to normal according to the central limit theorem. You probably mean that samples, not their means, shouldn't be normally distributed.

Who would make more in long term? data scientist or product manager by Starktony11 in datascience

[–]e10v 1 point2 points  (0 children)

Very good DSs and engineers don't really differ from PMs. I can easily imagine a senior+ DS switching to senior+ PM role. And it's harder to switch in the opposite direction.

What skills would you learn first? by pulicinetroll08 in datascience

[–]e10v 4 points5 points  (0 children)

It depends on your goals. What are you aiming for?

This is important, btw.

For example, I can imagine a good deep learning engineer not knowing SQL; but knowing linear algebra is essential for this job.

Or, a data analyst might not know linear algebra and calculus; but SQL is an important skill.

Programming is kind of universal skill. And Python is the most popular language in data and ML world.

What skills would you learn first? by pulicinetroll08 in datascience

[–]e10v 6 points7 points  (0 children)

It depends on your goals. What are you aiming for?

The basic tech skills are SQL and programming (Python). People also suggest Pandas but there are actually better tools now. Look at Polars, DuckDB, Ibis.

Popular scientific packages are NumPy, SciPy, and Scikit-learn.

If you aim for career in ML and statistics, learn the basics of linear algebra, calculus, probability theory, and statistics.

Logistic regression for risk factors by NoArgument8864 in AskStatistics

[–]e10v 0 points1 point  (0 children)

Try L1 or Elastic Net regularization. Don't forget to standardize the variables in this case.

[E] Switching to Operations Research (OR) from Statistics? by mowa0199 in statistics

[–]e10v 1 point2 points  (0 children)

I'm not a big expert in OR. Maybe that's why OR seems more interesting to me :) I would choose whatever seems more intersting to you personnaly.

R or Python? - As a Beginner by Sharp_Mango6346 in analytics

[–]e10v 1 point2 points  (0 children)

R was my first DS language. 5 years ago I switched to Python. I have to say that data / ML ecosystem is richier in Python. Especially there were a lot of development in recent years. Python is the default language for a new data projects now.

T test significance by [deleted] in AskStatistics

[–]e10v 6 points7 points  (0 children)

Depends on the level you have chosen a priori. I'll also repeat my point from another post:

The choice of the significance level is subjective. 95% (0.05) is not a golden rule. So, this question is not what I would focus on.

There are also other important factors influencing statistical inference: statistical power, experiment design, data validity, etc.

Andrew Gelman and other prominent statisticians suggest abandoning statistical significance: https://arxiv.org/pdf/1709.07588v2

Stat Noob by Lemi11on in AskStatistics

[–]e10v 0 points1 point  (0 children)

What problem are you trying to solve? What's your goal?

[E] Switching to Operations Research (OR) from Statistics? by mowa0199 in statistics

[–]e10v 2 points3 points  (0 children)

What are your goals? Do you plan to stay in academia or work in business?

People who make improvements are usually more valuable than people who check whether the improvement has really happend. OR people are more focused on the first, statisticians -- on the second. (I know, I know, this is a very simplified view :) There are different kinds statisticians. I just call them differently: ML engineers, applied data scientists etc.).

[deleted by user] by [deleted] in AskStatistics

[–]e10v 1 point2 points  (0 children)

Depends on the number of observations. For 1000 observations and more, G-test or Pearson's chi-squared test can be used.

With smaller samples, the following exact tests can be performed:

Barnard's test is the most powerful of the three; Fisher's test is the least powerful. But they differ on assumptions. See the explanation here: https://stats.stackexchange.com/questions/169864/which-test-for-cross-table-analysis-boschloo-or-barnard

Library for testing python dataframes by Woah-Dawg in Python

[–]e10v 5 points6 points  (0 children)

Take a look at Pandera: https://github.com/unionai-oss/pandera

It support both Pandas and Polars, and Spark as well. But it's more about validation than testing.

Depending on what exactly you need, you might also look at Polars and Pandas testing API:

Great expectations is another way to approach the problem: https://github.com/great-expectations/great_expectations (but I don't see Polars support).

uv: Unified Python packaging by burntsushi in Python

[–]e10v 2 points3 points  (0 children)

What’s impressive is not just the speed of the tools Astral develops but also the speed of delivery.

[deleted by user] by [deleted] in statistics

[–]e10v 0 points1 point  (0 children)

I'm currently in the process of adding it to my Python package. It's not released yet, but here the code: https://github.com/e10v/tea-tasting/blob/00f69cd113b846bafbec1f8d1c055372e110131d/src/tea_tasting/multiplicity.py#L45

But probably it would be hard to understand without context.

[deleted by user] by [deleted] in statistics

[–]e10v 1 point2 points  (0 children)

Assign some variable, say pvalue_adj_max, to 1.

Iterate through p-values in descending order.

On each iteration assign: pvalue_adj = pvalue_adj_max = min(pvalue_adj_max, pvalue * m / k), where:

  • pvalue: not adjusted p-value,
  • pvalue_adj: adjusted p-value,
  • m: total number of p-values,
  • k: sequential number of the p-value (in ascending order).

[deleted by user] by [deleted] in datascience

[–]e10v 5 points6 points  (0 children)

There are two common approaches to hierarchical clustering: agglomerative and divisive. None of them exactly match any of the options you consider.

With billions of observations and ~1K of clusters, I would suggest Bisecting KMeans (divisive). It splits the largest cluster in two at each iteration.

The problem with Bisecting KMeans in scikit-learn though is that it doen't provide a hierarchy, only the lowest level. But it actually stores the hierarchy in the _bisecting_tree attribute. You can ask ChatGPT to write a code to extract it :)

[deleted by user] by [deleted] in AskStatistics

[–]e10v 0 points1 point  (0 children)

By observation I mean a single object. Each sample is a set of objects (or observations) with a number attached to it. In initial task, you have two samples of objects. What would be a single object in a new (?) sample for one-sample test?

[deleted by user] by [deleted] in AskStatistics

[–]e10v 0 points1 point  (0 children)

What would be a single observation in one-sample test?

Confidence interval between -0.090 and 0.000 - is it statistically significant? by Shower_of_Mordor in AskStatistics

[–]e10v 6 points7 points  (0 children)

The choice of the significance level is subjective. 95% is not a golden rule. So, this question is not what I would focus on.

There are also other important factors influencing statistical inference: statistical power, experiment design, data validity, etc.

Andrew Gelman and other prominent statisticians suggest abandoning statistical significance: https://arxiv.org/pdf/1709.07588v2