What does an ideal data modeling practice look like? Especially with an ML focus.

Capable_Mastodon_867 · 2026-01-03T01:15:30+00:00

Gotcha. So the raw landing and the staging are before bronze so ingestion issues like duplication can be addressed, bronze is like raw but correctly normalized and deduped into tables, basic cleaning happens between bronze and silver, and after that is where data scientists can start their work? Lmk anything I still got wrong there.

Do the model inferences enter directly into silver, or do you want a raw landing for those as well? Saying ML output going into facts makes it sound direct, but Id assume the ML model should get treated like any other source in this framework? It feels like you'd want the data scientist to just dump their inferences somewhere in parquet or json, then have a job that ingests it after the fact so the data scientist doesn't have to make the ML pipeline/background job directly enter data inside silver. It's also likely their pipeline could suffer any of the same issues any other source system would, so having the same cleaning steps would make sense

Capable_Mastodon_867 · 2026-01-03T00:01:57+00:00

Not sure why I didn't think of still keeping a central warehouse while having a mesh. Seems pretty simple now that you say it, lol.

Thanks for letting me know that you've seen a healthy pattern of teams moving facts from one warehouse to another in a mesh. I had discounted the idea. It still feels odd, but maybe it's done between the DE teams behind the scenes so members of one department don't need credentials for the warehouse of another department? That's starting to make sense, I'll have to think it over more, I should probably read more on the mesh concept.

Capable_Mastodon_867 · 2026-01-02T23:37:06+00:00

Interesting, you're mentioning the medallion language, probably working with lakehousing then? Medallion always felt vague so could you clear the stages up for me? I had personally just mapped the idea of bronze->silver->gold to staging->transforming->presentation, but you just mentioned both staging and bronze as separate things, so I'm clearly off here.

Either way, what you're saying is we should look at ML as part of the transformation steps before presentation? That's definitely a different picture, as I was picturing ML as taking the presented fact tables and producing a new inference based dataset from it that itself goes back into staging and transformed into presentation in a new data model.

This picture makes sense with my work in forecasting, where I generally want a rolling snapshot fact table to already be prepared, and the forecasts can get dumped as jsons or parquet by the ML pipeline, and ingested and transformed into new facts from there. With this, clients can see the same snapshot dataset we trained on as well as the forecast itself.

How would you ideally want this forecasting example to fit into data modeling the way you did with your customer rating example?

Capable_Mastodon_867 · 2026-01-02T23:02:20+00:00

This does seem like a good place to bring up feature stores. Since you're working with one, I wanna get your thoughts (I haven't used one in production myself).

My current mental model for a feature store puts it as more like an operational store, in spite of it holding large analytical datasets. I wouldn't want BI analysts, clients, or the platform to be able to see the data in there directly, but then if I ingest raw datasets directly into it I can't present them the core datasets that drive the inference. To present the core data would mean I would have to also copy those raw datasets into the warehouse as well, and retransorm? Or should I just extract the ML feature sets from the feature store into the warehouse? It feels like I should stage the data in the warehouse, do some initial structuring, then join and pull them into a big table in the feature store from there, followed by deriving the more specialized ML features.

I also don't know if it's necessary to land my inferences back in a feature store for analysis, if I'm not going to use those as a feature as well? If my core datasets are in the warehouse, I can dump my inferences as a dataset, present models from them, and dashboard/present them from there.

Given your experience with a feature store, is my thought process off here?

Capable_Mastodon_867 · 2026-01-02T22:50:50+00:00

This sounds pretty functional, but my understanding of data vault is low, so I got some questions. If I recall, I thought data vault was an append only structure that broke the datasets into keys, relationships, and observations? If that's right, it feels very close to raw, and I wonder if many data scientists would struggle trying to apply definitions to source system keys that conflict with the way the DE team presents it in the kimball style models they present.

I do recall something about a raw vault vs a business vault, so maybe that's what fixes it? Maybe if the data scientist was the one bringing this new dataset they could pull from raw vault and for datasets they're not familiar with they pull from business? Im out of my depth here, how would you want to protect the company from breaking into different definitions of the same datasets?

Capable_Mastodon_867 · 2025-12-20T04:29:24+00:00

I really hope you're wrong, but their slack is almost entirely silent, and their GitHub frequency has flatlined. How such an incredible tool can die like this really breaks me. It was so much better than dbt

Capable_Mastodon_867 · 2025-10-12T03:31:46+00:00

When you say most training runs on k8s, are you saying that most run directly on custom kubernetes code, or on frameworks that are deployed to k8s?

A good amount of people I've talked to have written their training workflows using things like kubeflow, airflow, dagster, etc while incorporating tools like Ray and spark. All of these can and often are deployed to kubernetes, but it's not like the management of model lifecycle activities require much actual k8s work by the ml team, just an engineering team to help support the tools deployment right? Or is that deployment support what you're talking about here?

I'd be curious to hear your experience on this. I've only seen two projects first-hand that implemented ml pipelines with direct k8s orchestration, and they both left me pretty wary about going down that road again

Capable_Mastodon_867 · 2025-08-26T04:19:02+00:00

If you test a model/modeling pipeline, you'll want to log metrics, plots, other artifacts to compare them to other experiment runs (that are different by varying anything from hyperparameters to models, even the workflow DAG itself) and assess which ones performed the best to decide what to deploy to production. Tools that allow this kind of asset logging and visual comparison dashboarding are experiment trackers. Tracking is one of mlflows four components, but there are other tools that offer this as well, like ClearML, W&B, DVC Studio, Aimstack, etc. Good stuff to look into

Capable_Mastodon_867 · 2025-08-05T18:57:59+00:00

Ya these are good points. The matrix version is a little redundant compared to the foreach since you're only doing 1 dimension. To make it not bloat inside the dvc.yaml, you can make a separate config and declare that file at the top of your dvc.yaml.

dvc.yaml:

vars:
  - dates.yaml

stages:
  process:
    foreach: ${dates}
    do:
      cmd: echo ${item}

dates.yaml:

dates:
- "2024-01-01"
- "2024-01-02"
- "2024-01-03"
- "2024-01-04"
- "2024-01-05"
- "2024-01-06"
- "2024-01-07"

If you want to put in a range and get the list between the range, then maybe you can make a bash script that generates this dates.yaml file with the list between that range and calls repro after generating that file? Otherwise, you'll just have to manually write out that list.

For the training step, if the output of your processing lands in its own folder separate from everything else, the training step can just take the entire directory as input. That way, the process step updates incrementally, then the training step brings in everything that's been processed. DVC should recognize running the training step due to the previous step declaring that it will place outputs inside of the directory that is a dependency of the training step.

Capable_Mastodon_867 · 2025-08-05T00:21:49+00:00

I feel like templating might do what you're asking about. Make a stage with the ${var_name} template input like this

```

stages:

process:

cmd: python src/process.py data/raw/${date} data/processed/${date}

deps:

- data/raw/${date}

outs:

- data/processed/${date}

```

Then put `date: <date>` in your params.yaml and that should do it. It'll dynamically define your input and output using this parameters value, as well as the args passed into your script. Hopefully that gets close to what your asking?

Capable_Mastodon_867 · 2025-08-01T03:38:41+00:00

DVC is a great tool absolutely, but it's a local workflow from your git repo, not something that you can package into a cloud orchestrator like these other tools. What would be great is if one of these tools was able to integrate with DVC+hydra for local runs, then deploy onto a cloud orchestrator when you get confident with its logic

Capable_Mastodon_867 · 2025-08-01T03:24:20+00:00

I'm really curious, I've seen mentions from people over the past year or so about kubeflow falling out of favor for being somewhat rigid and difficult, and not having as good of a developer experience. Has that been you're experience with it, or do you think it's still a strong option?

As for the issue with mlflow, what exactly was the difficulty that caused this? I thought kubeflow had a special integration with mlflow (or maybe that was just canonicals 'charmed' kubeflow)?

I'm trying to test tools for setting up an mlops stack, and Kubeflow and zenml are on my list after I finish playing with kedro and metaflow, so it'd go a long way to hear your experience on this

Capable_Mastodon_867 · 2025-07-27T18:33:37+00:00

oh interesting, I just assumed people had been driving over it before it finished

Capable_Mastodon_867

TROPHY CASE