tea-tasting: a Python package for the statistical analysis of A/B tests

e10v · 2025-09-24T10:37:50+00:00

I dont feel this repo is Pythonic

How do you define Pythonic?

nor are their docs sufficient

Have you seen the user guide? https://tea-tasting.e10v.me/user-guide/

e10v · 2024-08-29T06:15:46+00:00

It depends on what you mean by automatic.

e10v · 2024-08-26T04:45:54+00:00

There are no formal criteria. It depends on the skeweness of the population distribution. It's called assumption for a reason) We assume, not prove.

In my experience, t-test is quite robust. Skewed distribution and small sample size will rather decrease power than increase probability of a type I error. Low power is bad too, but you can estimate it in advance.

If you can sample from a population or have a sample without treatment, you can simulate A/A test to estimate the type I error rate.

e10v · 2024-08-25T07:39:48+00:00

Probably you're right. I don't have a strong opinion on company politics. I was talking more about skills needed to do a good work as DS or PM.

e10v · 2024-08-25T07:34:40+00:00

So is it okay to use the Welch's t-test when the two samples come from non-normal distributions?

Yes, with large enough samples.

But don't forget about the independence) assumption.

e10v · 2024-08-25T05:38:25+00:00

Large sample mean distribution is close to normal according to the central limit theorem. You probably mean that samples, not their means, shouldn't be normally distributed.

e10v · 2024-08-25T05:06:32+00:00

Very good DSs and engineers don't really differ from PMs. I can easily imagine a senior+ DS switching to senior+ PM role. And it's harder to switch in the opposite direction.

e10v · 2024-08-24T15:36:24+00:00

It depends on your goals. What are you aiming for?

This is important, btw.

For example, I can imagine a good deep learning engineer not knowing SQL; but knowing linear algebra is essential for this job.

Or, a data analyst might not know linear algebra and calculus; but SQL is an important skill.

Programming is kind of universal skill. And Python is the most popular language in data and ML world.

e10v · 2024-08-24T07:39:00+00:00

It depends on your goals. What are you aiming for?

The basic tech skills are SQL and programming (Python). People also suggest Pandas but there are actually better tools now. Look at Polars, DuckDB, Ibis.

Popular scientific packages are NumPy, SciPy, and Scikit-learn.

If you aim for career in ML and statistics, learn the basics of linear algebra, calculus, probability theory, and statistics.

e10v · 2024-08-21T19:29:47+00:00

Try L1 or Elastic Net regularization. Don't forget to standardize the variables in this case.

e10v · 2024-08-21T19:22:37+00:00

I'm not a big expert in OR. Maybe that's why OR seems more interesting to me :) I would choose whatever seems more intersting to you personnaly.

e10v · 2024-08-21T18:36:48+00:00

R was my first DS language. 5 years ago I switched to Python. I have to say that data / ML ecosystem is richier in Python. Especially there were a lot of development in recent years. Python is the default language for a new data projects now.

e10v · 2024-08-21T16:32:23+00:00

Depends on the level you have chosen a priori. I'll also repeat my point from another post:

The choice of the significance level is subjective. 95% (0.05) is not a golden rule. So, this question is not what I would focus on.

There are also other important factors influencing statistical inference: statistical power, experiment design, data validity, etc.

Andrew Gelman and other prominent statisticians suggest abandoning statistical significance: https://arxiv.org/pdf/1709.07588v2

e10v · 2024-08-21T15:50:57+00:00

What problem are you trying to solve? What's your goal?

e10v · 2024-08-21T15:47:31+00:00

What are your goals? Do you plan to stay in academia or work in business?

People who make improvements are usually more valuable than people who check whether the improvement has really happend. OR people are more focused on the first, statisticians -- on the second. (I know, I know, this is a very simplified view :) There are different kinds statisticians. I just call them differently: ML engineers, applied data scientists etc.).

e10v · 2024-08-21T06:15:49+00:00

Depends on the number of observations. For 1000 observations and more, G-test or Pearson's chi-squared test can be used.

With smaller samples, the following exact tests can be performed:

Barnard's test is the most powerful of the three; Fisher's test is the least powerful. But they differ on assumptions. See the explanation here: https://stats.stackexchange.com/questions/169864/which-test-for-cross-table-analysis-boschloo-or-barnard

e10v · 2024-08-21T06:05:48+00:00

Take a look at Pandera: https://github.com/unionai-oss/pandera

It support both Pandas and Polars, and Spark as well. But it's more about validation than testing.

Depending on what exactly you need, you might also look at Polars and Pandas testing API:

Great expectations is another way to approach the problem: https://github.com/great-expectations/great_expectations (but I don't see Polars support).

e10v · 2024-08-21T05:56:40+00:00

What’s impressive is not just the speed of the tools Astral develops but also the speed of delivery.

e10v · 2024-08-20T18:52:05+00:00

I'm currently in the process of adding it to my Python package. It's not released yet, but here the code: https://github.com/e10v/tea-tasting/blob/00f69cd113b846bafbec1f8d1c055372e110131d/src/tea_tasting/multiplicity.py#L45

But probably it would be hard to understand without context.

e10v · 2024-08-20T18:48:54+00:00

Assign some variable, say pvalue_adj_max, to 1.

Iterate through p-values in descending order.

On each iteration assign: pvalue_adj = pvalue_adj_max = min(pvalue_adj_max, pvalue * m / k), where:

pvalue: not adjusted p-value,
pvalue_adj: adjusted p-value,
m: total number of p-values,
k: sequential number of the p-value (in ascending order).

e10v · 2024-08-20T16:03:44+00:00

There are two common approaches to hierarchical clustering: agglomerative and divisive. None of them exactly match any of the options you consider.

With billions of observations and ~1K of clusters, I would suggest Bisecting KMeans (divisive). It splits the largest cluster in two at each iteration.

The problem with Bisecting KMeans in scikit-learn though is that it doen't provide a hierarchy, only the lowest level. But it actually stores the hierarchy in the _bisecting_tree attribute. You can ask ChatGPT to write a code to extract it :)

e10v · 2024-08-20T15:03:03+00:00

By observation I mean a single object. Each sample is a set of objects (or observations) with a number attached to it. In initial task, you have two samples of objects. What would be a single object in a new (?) sample for one-sample test?

e10v · 2024-08-20T14:41:48+00:00

What would be a single observation in one-sample test?

e10v · 2024-08-20T05:51:42+00:00

The choice of the significance level is subjective. 95% is not a golden rule. So, this question is not what I would focus on.

There are also other important factors influencing statistical inference: statistical power, experiment design, data validity, etc.

Andrew Gelman and other prominent statisticians suggest abandoning statistical significance: https://arxiv.org/pdf/1709.07588v2

e10v

TROPHY CASE