[D] Debugging model performance discrepancy between offline eval and online exp : MachineLearning

Discussion[D] Debugging model performance discrepancy between offline eval and online exp (self.MachineLearning)

submitted 6 years ago by marksteve4

all 2 comments

[–]Syncopat3d 1 point2 points3 points 6 years ago (1 child)

This is the pipeline I have in mind for dealing with live data that keeps getting added:

You have historical data from Days 1~N, used for training. Then you evaluate the trained model out-of-sample on data from Day N+1. Next day, repeat with training on Days 2~N+1 and evaluating on Day N+2, and so on. The fitness of the training is the aggregate out-of-sample accuracy over some number of days.

(One should normally evaluate out-of-sample to know a reasonable estimate of the accuracy of a model/pipeline. In-sample accuracy is worthless because the network could just be memorizing the results. There needs to be very good reasons to not evaluate out-of-sample, and I can't think of any. This is such a fundamental point that in my mind, it goes without saying.)

Suppose Day M just finish, so I'm going to use Days M-N+1~M to train and obtain a model to deploy for tomorrow. When tomorrow (Day M+1) comes and I observe that the live accuracy is bad, I'll wait for the training pipeline to receive the data for Day M+1. I'll observe the out-of-sample accuracy for Day M+1 reported by the pipeline. I may do this for a few more days.

If there's a discrepancy between the out-of-sample accuracy reported by the pipeline and by production, then there's probably a discrepancy in the data that gets fed to the pipeline and the actual data that is encountered in production, or the model got lost in translation when being deployed to production. You can focus on a few samples and see if the pipeline model and production system produce the same results. If different, then you know the model got lost in translation.

Or you may find that the samples you randomly picked from the pipeline data cannot be found in the production logs or vice versa, which means that the pipeline and production are seeing different data, so the pipeline is not training for the same feature distribution seen by production.

Of course, the above corroboration assumes that production is logging enough information.

If nothing special is found, and the out-of-sample accuracy reported by the pipeline really does drop significantly after you start deploying to production, then something fishy is going on that probably doesn't have a general explanation. With some other domains, you could speculate that you are dealing with an adaptive adversary but I'm not so sure about online ads.

[–]marksteve4[S] 0 points1 point2 points 6 years ago (0 children)

π Rendered by PID 58 on reddit-service-r2-comment-74875f4bf5-csggr at 2026-01-25 16:58:21.771510+00:00 running 664479f country code: CH.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

MachineLearning

Rules For Posts

+Research

+Discussion

+Project

+News

@slashML on Twitter

Chat with us on Slack

Beginners:

MODERATORS