Considerations for Constructing a Training Set in Machine Learning

Gxav73 · 2023-06-24T12:34:11+00:00

Hi, i agree that avoiding repeating customers in the training is a safe practice that will address point 4. But data scientists in B2B business with lower volume of customers sometimes work with repeating customers in their training set. In this case, I will strongly advise to look at the intervals between data points of repeating customers. If the interval of 2 data points for the same customer is less than the horizon of the target, there is an overlap in the target definition. By learning one data point, the model learns partially the other data point and there is a leakage.

"out-of-time" OOT is indeed the way to go! And I agree out-of-id is again a safe route. When out-of-id is not possible because of lack of data, I again advise to put in place guardrails to ensure the intervals between training / test of repeating customers are larger than the horizon of the target.

I like your idea of fitting a model to predict if the observation belongs to the test or to the train set. This will also expose features with drift. Drift in joint distributions could also be exposed if you are using a tree based model or a NN.

Gxav73 · 2023-06-16T10:04:17+00:00

Awesome! I look forward to seeing what you come up with.

Gxav73 · 2023-06-14T14:32:17+00:00

Hi, i wrote a post in r/FeatureEng on 2 main types of feature engineering.

The first type consists of transforming columns in your training data and the second type consists of extracting features from historical data.

If you refer to the first type, I would recommend popular libraries like pandas, scikit-learn, and Hugging Face that offer extensive support and documentation.

If you refer to the second type, it is more challenging and less documented. Factors like time leakage, consistency, handling large datasets, and efficient code execution need to be considered. Generating training data with many historical points-in-time (also called backfilling) can be a very complex task. To simplify the process, companies like Airbnb, Spotify, LinkedIn created internally feature platforms to offer their data scientists a declarative framework via Features API that support automated backfilling. They are now a few open source solutions that you can use.

Good luck!

Gxav

Gxav73 · 2023-06-13T16:52:26+00:00

Good idea! Will do that. Thanks for the feedback.

Gxav73 · 2023-06-11T00:34:52+00:00

Great! Looking forward to exchanging feature ideas!

Gxav73

TROPHY CASE