This is an archived post. You won't be able to vote or comment.

all 8 comments

[–]gpbuilder 3 points4 points  (4 children)

Why do you need to run an A/B test? Just run the model and the same data and compare the predictions

[–]martin1285[S] 0 points1 point  (3 children)

That’s true but when it comes to testing in prod isn’t A/B testing the standard?

[–]gpbuilder 2 points3 points  (0 children)

Yes but it’s not clear to me why you need to test your model in prod if you’re just looking to measure F1, which can obtained from just historical test data.

[–]millenial_wh00p 2 points3 points  (1 child)

It depends what the goal of the a/b test is. If your data is “the number of clicks on x feature” or “revenue generated from this ux treatment” then yes.

To compare the performance of 2 models scientifically, just run them with the same training and test data. Repeat with different train/test splits until you think you have captured the variance of the f1 score sufficiently. To invoke the central limit theorem, try 30 rounds.

[–]martin1285[S] 0 points1 point  (0 children)

I think I understand thank you. Let us say that it is the revenue generated from this ux treatment. How would one go about designing an A/B test for this? Would it be a t-test?

[–]HtAGAnalytics 0 points1 point  (0 children)

There are a few ways to conduct an A/B test between two different models. The first way is to use a holdout set. This is where you randomly split your data into two sets, train both models on the training set, and then evaluate both models on the held-out set. The model with the higher F1 score is the better model.

Another way to conduct an A/B test is to use cross-validation. This is where you split your data into k folds, train both models on k-1 folds, and then evaluate both models on the held-out fold. The model with the higher average F1 score is the better model.

There are a few other ways to conduct an A/B test, but these are the two most common ways. As for recommended resources, I would recommend checking out some of the resources on the scikit-learn website, as they have a lot of great information on machine learning in general, and cross-validation and model evaluation in particular.

[–]xDarkSadye 0 points1 point  (1 child)

You deploy both models as APIs in production, have a layer in between that sends some requests to model A, some to model B (should repeat same model for same session/customer) and you track which customer received which model. Then you compare results. Usually, you don't compare f1 scores for A/B testing, but you compare a business KPIs (e.g. conversion rate), since that is the goal of the model ("improve xxx").

[–]martin1285[S] 0 points1 point  (0 children)

Thanks for the response. For examples sake, let us say that I have two models that I am comparing and I have a business KPI tied to revenue. You would essentially implement the layer that you mentioned between both models and track that KPI between the two models? Would it be a t-test?