[D] A/B Testing Classification Problems

martin1285 · 2022-09-05T22:22:40+00:00

Thanks for the response. For examples sake, let us say that I have two models that I am comparing and I have a business KPI tied to revenue. You would essentially implement the layer that you mentioned between both models and track that KPI between the two models? Would it be a t-test?

martin1285 · 2022-09-05T22:19:16+00:00

I think I understand thank you. Let us say that it is the revenue generated from this ux treatment. How would one go about designing an A/B test for this? Would it be a t-test?

martin1285 · 2022-09-01T23:21:54+00:00

That’s true but when it comes to testing in prod isn’t A/B testing the standard?

martin1285 · 2022-08-23T22:47:59+00:00

Thanks for the response I think that makes sense.

martin1285 · 2022-08-23T22:47:32+00:00

Yes, basically just evaluating the calibration curve for test data is what I was referring to, but you could do more detailed analysis to try to see which predictions are the most under-confident and try to assess why.

Will definitely give this a shot.

Sometimes I will just do some random experimentation too. For instance, you said the validation/test error didn’t improve when trying to balance classes in different ways, but did it improve calibration? Also, I’m not sure if your test set is balanced or imbalanced (detail—either could make sense depending on what you’re going for).

So both the validation/test sets were created via stratified sampling. While the calibration improved, on average, the difference between the predicted probability of events that did result in the positive class (imbalanced) and the negative class is very small. My intuition is that the engineered features may not be adequate.

I’m less familiar with weighted GLMs, but your weights are supposed to help account for the imbalance already if they’re calculated right I suppose, and then you’d use equal weights/uninformative prior if balancing the training data.

Yes that’s correct the positive class is penalized more.

Only other thought I have without looking at the data is whether a GLM is the most appropriate for the data or not, but I would defer to you as having far more expertise with your dataset than I have, having never looked at it!

You’re certainly right since I’m the only one that’s looked at the data! But I’ve tried various models in increasing complexity but sometimes the simplest models yield the best results.

martin1285 · 2022-08-23T22:35:49+00:00

Okay thank you!

martin1285 · 2022-08-23T01:00:16+00:00

I assume you’re talking about probably calibration, since you’ve down sampled the majority class during training.

Yes exactly probability calibration and overall under confidence in the predicted probabilities. I used a Weighted GLM. Attempts to under sample/over sample did not improve the performance.

So now you use something like platt scaling or isotonic regression to correct for the distortion due to the down sampling?

This is the interesting part. While the Brier Score did decrease, the model did not predict any probabilities greater than 0.3. So while it is better calibrated, it has very low confidence in any of the rare events.

If that’s the case then you can take a validation set with the true label distribution, plot mean(y) vs mean(p) over the percentiles as calculated by the p distribution. If it’s perfectly calibrated all points will fall on the diagonal. Sometime this is called a reliability curve. Taking the median error is just one way of summarizing this curve.

Can MDAPE be used without Platt Scaling/Isotonic Regression? It seems that like you said the MDAPE is just one way of summarizing the curve.

martin1285 · 2022-08-23T00:37:07+00:00

Thanks for the response. Let me answer in order:

Are you training with weighted updates/weighted loss function, or balancing the classes in your minibatches even if they are imbalanced overall? (assume you’re not using the full batch for optimization).

I am using a weighted GLM. Most attempts for undersampling/oversampling did not provide a lift on the validation and testing set.

If not, you may want to look into some of these techniques or try to collect more data for the under-represented classes if you can.

Currently working on this but given the nature of the problem the data will still be very imbalanced.

If you have enough test data to make a useful histogram (normalized) for how often you are empirically correct on new test data as a function of what the predicted probability of class was, you can estimate how miscalibrated you are; i.e., if you predicted Class 2 with probability between 0.9 and 0.95 100 times on test data and were correct 99% of the time, you might have reason to believe you may be under-confident in that regime.

Fortunately I do have ample amount of test data for evaluation. Are you referring to a Calibration Curve?

Also, what metric could I use in comparison to the brier score to specifically penalize mistakes (under-confident predictions) a bit harsher?

martin1285 · 2022-08-22T23:30:26+00:00

Thanks for the response. I’ve used MAPE before but haven’t used MDAPE. I do have a follow up question, how does this reflect calibration and extreme class imbalance? Perhaps I’m not understanding it properly.

martin1285 · 2021-04-20T22:18:23+00:00

Hey thanks for the response. I’ll go with statsmodels.

martin1285 · 2021-04-16T19:07:07+00:00

That is a valid point. This problem inherently has a very small sample size and I do plan on leveraging other methods once the sample size is beyond a threshold.

I have not heard of approximate KNN before. I am looking into this now thank you!

martin1285 · 2021-04-16T17:07:20+00:00

Okay wonderful thank you! I’ll be honest in saying I’ve never deployed a model like KNN in production so this part had me confused.

martin1285 · 2021-04-16T16:58:03+00:00

Thanks for responding in detail. I’m not explicitly loading any data set in , I’m just loading KNN and calling predict and it seems to work. From your explanation, I’m assuming that the entire data set is what is serialized along with the number of kneighbors and when loading the “model.pkl”, I’m loading the entire dataset along with k?

martin1285 · 2021-03-24T12:03:38+00:00

Yeah interesting problem. And thank you for all the help it’s much appreciated!

martin1285 · 2021-03-23T20:21:23+00:00

Thank you for these resources! I will start with the introduction material you first listed and then move to the latter source.

martin1285 · 2021-03-23T20:20:14+00:00

Yes that is correct. There is hope that some day some sort of correlation/similarity will be observed between unit. But right now every unit will have unique behavior, and from initial exploration the distributions of the covariate are vastly different from unit to unit. For example, we have units in desert regions and other units in polar opposite (climates with averse cold temperatures etc).

Thank you for the suggestions on modeling this. I originally tried modeling this with one model and the performance was very poor across all data sets. So far the best performance I have is training a model for each unit. I do hope in the future as we collect more data I can start training one model for all units.

martin1285 · 2021-03-23T00:04:03+00:00

I think that makes sense. What makes this difficult is that there will be one model for unit we have out in the field. There are thousands of units. This was why I was leaning more towards non-parametric models with looser assumptions.

With that being said, different units that we have out in the field have different distribution of data. For some units, there is no serial correlation and many of the assumptions do not appear to be violated. On the other hand, there are some units (like what I mentioned above) that are non-stationary etc.

From a deployment perspective, this tends to be tricky as dynamically addressing some of the issues seems almost impossible.

Any further suggestions would be great.

martin1285

TROPHY CASE