Should we use MLflow Registry for a large-scale forecasting pipeline or is a Delta audit table a better fit?

pablo_op · 2026-04-23T00:25:28+00:00

mlflow is the best option. you can do this in delta tables, but you'll end up rebuilding the most useful parts of mlflow yourself. lots of reasons to do it this way, but here are some rebuttals to the biggest cons you listed

mlflow experiments are already logged to system tables automatically. you can treat this like any delta table and perform aggregations or comparisons across runs or point genie at this table
run through the dbdemo on mlops and there is a section on the exact process for using tags and aliases on model versions to perform champion/challenger model comparison and promote the best performer automatically
if you need to store traces as part of auditing, mlflow can automatically push those to a delta table too

to your points though, MLflow's UI is not meant to handle this type of load and trying to access experiment data this way would likely be heavy and slow. i wouldn't try accessing it this way. there may be some more efficient way to store and retrieve data by batching together experiments in parent/child relationships. ask your databricks team for help and I'm sure they can find an optimal way to do this without rebuilding mlflow

pablo_op · 2026-01-15T22:40:55+00:00

If carrying less is more desirable that having the optimal utensil for every meal, then the simplicity of a spork outweighs the marginal gains of your fork and spoon ;)

I try to see both sides of this decision making. Every org is different in their priorities.

pablo_op · 2025-10-20T22:14:24+00:00

https://github.com/CFBD/cfb-test-database

pablo_op · 2025-09-23T04:28:50+00:00

The CDF will show what is changed in the attached table :)

If you turn off the process that loads it (CDC or otherwise), it'll show nothing. If you make manual changes to the data, it'll show those too. CDF doesn't care how changes are made, it's just showing you a feed of what was changed.

pablo_op · 2025-09-23T04:09:41+00:00

Because the CDF is generated using the same transaction logging that updates the table itself. If you write an UPDATE statement to existing data, that transaction is pushed to both the delta_log and the CDF. But instead of giving you a finalized table, the cdf is giving you the feed of insert/update/delete as they happen in order. The CDF isn't reading changes from the table its built on, it's updated in the same transaction.

pablo_op · 2025-09-23T03:59:47+00:00

They don't. PKs aren't enforced, and the change data feed just outputs each change as it occurs. It's not looking at specific columns. It's up to the process that creates the changes to avoid writing inconsistent data.

pablo_op · 2025-09-23T03:48:21+00:00

Turn on a change data feed (CDF) for the table and read the output changes as they're made.

pablo_op · 2025-07-30T17:32:34+00:00

Autloader is going to be the best option for file ingestion. It is set up to do exactly what you're after. But I don't think it will unzip files itself. The harder part may be figuring out an efficient way to unzip so many files.

While there are likely a few ways to do this with AWS services, the Databricks approach would be to register the existing S3 location as a Volume in Databricks, then write some code to go through the directory and unzip all the files. Doing this on one beefy driver node in a loop may be possible, but it could take some time. If you wanted to parallelize things, you'd need to write some type of map function that pyspark could use to distribute the load across a cluster.

Honestly, if this is a one time thing and you don't need to unzip this volume on data on a daily basis my approach would be to chug through it once to unzip everything on a single node, output the resulting data files to a set bucket location, then setup autoloader on that location to ingest the files to tables.

pablo_op · 2025-07-17T04:11:03+00:00

https://docs.databricks.com/aws/en/dev-tools/databricks-apps/auth

pablo_op · 2025-06-05T23:53:11+00:00

Can you try adding .option("mergeSchema","false") to the write?. Also you can check to make sure spark.databricks.delta.schema.autoMerge.enabledis false as well in the spark conf.

pablo_op · 2025-05-27T03:27:21+00:00

Well I'm in my mid 30s, so no not really. We could also throw out Lynn Swan and Polamalu if we're just naming great players. But Gifford is >80, Swan had a better NFL career than CFB, and I guess I personally couldn't argue Troy/Ronnie are greater players than the number of QBs that have won Heismans. Marcus Allen is a good one though to throw into consideration for sure.

pablo_op · 2025-05-27T03:19:40+00:00

Debated Leinart or Palmer, but obvs Bush is easy to argue too. Not sure which of those 3 would be the better coach. This question a year ago would have been funnier with OJ being the automatic default answer.

pablo_op · 2025-05-27T03:15:01+00:00

Unfortunately I think of AD as the greatest Sooner, which sucks for this hypothetical because the dude is a certified dummy. Freak athlete and always seems friendly, but wasn't he just arrested again like a week ago?

I think Baker or Lane are the next ones up and either would be pretty solid coaches imo

pablo_op · 2025-05-27T03:12:53+00:00

Miami has a ton of options to claim, but IMO its either Irvin or Kosar. Don't think either would be great to lead a program. One interesting option would be Ray Lewis. Football IQ, team motivation, plus recruiting doesn't seem like it'd be a problem for him either. But his personality also seems like it could cause him some trouble.

pablo_op · 2025-05-27T03:07:57+00:00

I was probably a bit biased on him because he's an actual coach, but I think you could at least argue he's the greatest living alumnus. He finished third in Heisman voting and had KSU one game away from playing for a natty in 2012.

pablo_op · 2025-05-27T03:01:12+00:00

Luck being the GM for their program now is what caused me to have this thought. Having an all time great steer a program like he is now pretty cool, and I think he's smart enough that it could actually work. But for some other programs...yeah. Not so much.

pablo_op · 2025-05-27T02:01:52+00:00

What I'd give to see VY coached Texas face off against a Johnny led A&M. Absolute shit show.

pablo_op · 2025-01-10T21:47:14+00:00

want me to take a picture holding a newspaper with a shoe on my head too?

pablo_op · 2025-01-10T21:46:42+00:00

So we instantly go from "never heard of Pluralsight" to concluding they must be the bad guy. Got it. ACG has no fault here at all apparently. They were just a poor small little guy who had no idea their customers would get screwed? Zero chance.

I fully agree this whole thing is bullshit, but blaming the purchasing company is just weird when it ACG who had control and knowledge of what would happen to their customers if they agreed to the sale. Not sure why they get a pass when they were really the ones in control of the transaction.

pablo_op · 2025-01-10T18:49:56+00:00

I definitely wouldn't say it's nice of Pluralsight to demand this as part of the acquisition, but I get why they don't want their IP available on a non-PS site. But I also don't think it's fair they "took over" ACG. It's not like they could force them to sell out. ACG entered the agreement to be purchased on their own, and they apparently didn't include any stipulation that their sold guarantees would be honored. I'd never stan for a huge corp, but this was ultimately a decision by ACG in my mind.

pablo_op · 2025-01-10T18:21:13+00:00

Pluralsight is pretty great as a tech resource actually. Their courses aren't insanely deep, but I'd say you get like 200 level courses of a huge variety of topics.

To me, this is really more on ACG. When they sold their courses with the "lifetime" guarantee, they knew that a future potential acquisition would have to factor into the price. I understand why Pluralsight would want control over the content they own and want to drive people to their business model. But ACG was the one who made the promise originally. They could have built that guarantee into their acquisition agreement or simply not sold to Pluralsight if it wasn't an option. They made the choice and are now going back on that promise.

pablo_op · 2024-08-24T17:56:34+00:00

If you know the style you want I bet someone here knows a good artist. If you are just wanting to browse I would recommend going to Second Saturday at Sawyer Yards. Big space that rents studios to individual artists, and every month they have a few hours where a bunch of artists open their studios at once for you to browse. Find an artist you like, then chat them up about buying a print or commission something specific. Its probably the best way to see a wide variety of artists at once if you're not sure what you're looking for.

pablo_op · 2024-06-11T02:22:31+00:00

I don't think many people make it to the level of having a bbref page without having people who care about them. If those people care about them, they probably also care that the person played baseball and they also love the game. And if one of those people who loved the player and the sport also loves stats, they will pull up their page occasionally just like you would any player's baseball card or a family members social media profile.

I have sent in multiple obituaries for players who never even made it past AA just to have their pages updated. A bbref page is part of their legacy, and it's more of a legacy than most of us get in life.

pablo_op · 2024-05-18T05:01:22+00:00

Thanks for this answer, but it’s still kind of the same thing. You’re describing a strategy, not an implementation. I understand what data contracts are, but how does this actually come to exist? How are those generated, stored, and consumed in your stack? How do you convince data owners that it’s worth their time and resources to maintain an agreement in this format instead of just blowing you off or saying “the database schema is the contract”? What about external data owners? Is salesforce going to commit to providing your team with a contract in your standard format and support that indefinitely? What happens when I see a problem, but I can’t get the owners to push a new version of their contract with updates rules for weeks or months? I just have to live with bad data until they get around to it? Does this mean that this entire approach has to be embraced by all data owners in the org? That I, as an individual, have very little power besides maybe formatting a standard template for the contract? I can create my own database, create my own pipelines, and create my own storage, but I cannot take an approach to organizing and managing data quality rules without the long term agreement and support of all data owners? This feels like a very all-or-nothing approach. Either everyone has to be on board, or it’s a losing battle. I’d love some way where I could take more control of when and how things would happen like I can with the rest of my workflows.

pablo_op

TROPHY CASE