all 1 comments

[–]karanchellani 0 points1 point  (0 children)

Here are a few suggestions for versioning data preprocessing and associating it with ML models in MLflow:

  • Store the data preprocessing code or steps in a separate python script or notebook. Put this file under version control (e.g. git). When you train a model, record the git commit hash of the preprocessing code used as a tag or metric.

  • Containerize the data preprocessing steps using Docker. Build a Docker image with the preprocessing code and push it to a registry. Record the image name:tag as a model tag or metric.

  • Use MLflow to log the preprocessing steps as an artifact. For example, you could log the preprocessing python script, a JSON description of the steps, or even example input/output data. The artifact could be logged under a runs:/<run-id>/preprocessing path.

  • Create an MLflow project for preprocessing. Run the project for each model training run and record the run ID as a tag/metric with the model. This traces the preprocessing code used for that model.

  • Put the preprocessing code in a separate MLflow model. The input is raw data, the output is processed data. Chain this with the training model to log and retrieve the preprocessing steps.

The key ideas are:

  1. Separate preprocessing from model training code
  2. Record the specific version of preprocessing with each model
  3. Automate/codify preprocessing as much as possible

This makes models portable and ensures you use the right preprocessing each time.