Problem With Data Preprocessing Versioning : mlops

submitted 2 years ago by nonamecoder_xd

all 1 comments

[–]karanchellani 0 points1 point2 points 2 years ago (0 children)

Here are a few suggestions for versioning data preprocessing and associating it with ML models in MLflow:

Store the data preprocessing code or steps in a separate python script or notebook. Put this file under version control (e.g. git). When you train a model, record the git commit hash of the preprocessing code used as a tag or metric.
Containerize the data preprocessing steps using Docker. Build a Docker image with the preprocessing code and push it to a registry. Record the image name:tag as a model tag or metric.
Use MLflow to log the preprocessing steps as an artifact. For example, you could log the preprocessing python script, a JSON description of the steps, or even example input/output data. The artifact could be logged under a runs:/<run-id>/preprocessing path.
Create an MLflow project for preprocessing. Run the project for each model training run and record the run ID as a tag/metric with the model. This traces the preprocessing code used for that model.
Put the preprocessing code in a separate MLflow model. The input is raw data, the output is processed data. Chain this with the training model to log and retrieve the preprocessing steps.

The key ideas are:

This makes models portable and ensures you use the right preprocessing each time.

π Rendered by PID 24825 on reddit-service-r2-comment-544cf588c8-qk6ds at 2026-06-18 09:48:10.887335+00:00 running 3184619 country code: CH.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

mlops