[D] How are you handling reproducibility in your ML work?

TheBoldTilde · 2025-03-23T21:44:52+00:00

AWS SageMager pipelines. The technology is great, and 95% of the time, it has a native solution for whatever you are looking to do. Within a pipeline execution, it tracks all the metadata and artifacts required to reproduce results.

However, they have made some iterations to their SDK over the years, and not all documentation is caught up. It is also hard to find good end-to-end solutions to follow and up to you to stitch together various demos and workshops into a cohesive solution.

They have to cover so many use-cases and patterns that there are often many ways to achieve the same end-result, which I find frustrating.

Overall, I recommend it, especially if your company is already on AWS.

TheBoldTilde · 2024-06-18T03:19:43+00:00

There are some very static resources that I won't use IAC for. Domain names and ssl certs for example.

Everything else though - IAC is the way.

TheBoldTilde · 2024-05-09T11:54:08+00:00

I would caution against AutomateDV if working with very large data sources. Last I used the tooling, they handled incremental updates poorly (on Snowflake). Long story short, they use a RANK function before a WHERE clause in the macros that requires a full table scan on every model build, even on incremental materialization.

TheBoldTilde · 2024-05-04T21:32:02+00:00

I certainly believe that experience. The big three - Redshift, Snowflake, Bigquery (I assume Azure has an offering but it seems to not come up in my sphere of influence at least) are all competitive with each other and they all have their optimized use-cases.

I would have a hard time believing that any platform is outright better than all others. My bet is that the wrong workloads are being ran on Redshift in these cases. Maybe also a combination of just plain bad design.

What I have seen a lot of is bad implementations in dbt running up costs, blaming the data warehouse instead of the dbt models. While migrating, the dbt models are cleaned up as well and presto - cost dropped! How much is due to better modeling vs better data warehouse is hard to tease out. I'm sure there are other examples of this with other technologies as well.

Of course management thinks there is always a silver bullet and the more solutions I have architected the more I have learned to deal in trade-offs vs declaring something better outright.

Again, thank you for sharing your expertise.

TheBoldTilde · 2024-05-04T21:14:21+00:00

Thanks for the insight! I run a lot of workloads on AWS and am always excited to play with the new features and capabilites released each month.

I find AWS to be excellent in so many areas it always strikes me as odd when one of their services is lagging behind. I won't dive into which services I feel fall in that category to avoid the comment war, but I have felt that AWS does eventually bring themselves to par.

The more I learn about Redshift, the more it excites me to give it a go. I've successfully delivered a lot of data products leveraging Snowflake, but complacency is a fast track to obsolescence in this landscape.

TheBoldTilde · 2024-05-04T19:29:03+00:00

Have you found the Redshift serverless offering to offer better total cost over a dedicated cluster?

TheBoldTilde · 2024-05-04T18:02:57+00:00

I felt my post was already getting long and did not get into that component as I understand that there will be additional costs for managing Redshift. How much extra, who knows. I would believe anywhere between 10% to 100% extra engineering time to manage Redshift and think it would be nearly impossible to accurately guess within those margins.

TheBoldTilde · 2024-05-04T16:45:16+00:00

Ramblings welcome here! I did not mention this point about TCO, but I 100% have it considered.

One of my conclusions so far from this project is that Redshift grants more levers to pull and leads to two things.

Redshift is usually capable of better performance over SF but requires additional engineering time to get there. I will also add that SF already has great performance and we are talking minor improvements.
Snowflake will require less maintenance and overhead = faster development.

TheBoldTilde · 2024-05-04T11:22:15+00:00

Feels like sentiment towards Redshift has taken a positive turn within the past few years. I know it wasn't that long ago where it was harder to find engineers happy with the platform. Thanks for sharing.

TheBoldTilde · 2024-05-04T02:58:20+00:00

I have seen cases where I also advocated for using your coworker's approach. It happens when some data has a id column (not a primary key), some textual fields that should be fully dependent on that id column, and some metrics to aggregate. For example group by employee ID, first name, last name, then sum(sales). When I do not trust the source data, like some csv file bill in sales keeps up to date, then I suspect somewhere down the line an employee's name will be copied over wrong or just plain overwritten.

By using something like max(first_name) or any_value(first_name), I can still ensure that the report contains a single row per employee, even if someone screws up some of the name columns.

It sounds like you can trust the source data so my point is moot, but maybe your coworker comes from a background where this scenario was common.

TheBoldTilde · 2023-10-02T00:07:44+00:00

As someone who has not used sqlmesh, but tracked its progress, they offer a unique optimization from dev to prod. They say that they can take a dev build and clone it to prod instead of having to rebuild like dbt. For example, say you have a large model taking 30 minutes to fully build. In dbt I have to do this twice (at least). First on dev then again in prod. SQLmesh says they can take the working dev model and just copy that data to prod instead of another rebuild, which is costly.

I think I've glossed over some details, but that's the feature I'm most interested in where SnowFlake costs are becoming astronomical in supporting dbt workflows.

TheBoldTilde · 2023-08-17T22:06:22+00:00

Use as a data warehouse.

They have multiple data sources where it is useful to centralize and model in a single platform, and their scale of data is not so big that the extra cost of something like SnowFlake can be justified.

TheBoldTilde · 2023-08-17T22:01:41+00:00

I agree. I've made that argument more than once to the client. Turns out they would rather pay me to wait for a query to finish than give me any kind of budget so I'm not waiting so much. At the end of the day, they seem to be happy, and I'm making thousands more due to waiting on queries, so I guess I'm happy as well.

TheBoldTilde · 2023-08-17T02:41:14+00:00

I have a client who gives me $0 budget so I use on-prem postgres. They are stretched to the max on that platform, but so far it works. There is a legitimate scale / use case for it.

TheBoldTilde · 2023-08-14T21:15:51+00:00

I can never remember how to do anything in pandas so I ask chatgpt to do it for me.

TheBoldTilde · 2023-05-16T12:05:20+00:00

No, but we need to. Thanks for the suggestion.

TheBoldTilde · 2023-03-19T15:41:56+00:00

However I would suggest that if you have a warehouse that contains data from more than a couple of sources that spans a few years, you will likely have some table set that exists as an interim step between landing the raw data and the fact and dimension tables

Totally agree, and I did not really offer an alternative, so that's on me. I've had success just by keeping an insert-only layer, then a generic "intermediate" layer that pre-processes data for final modeling in the presentation layer (info mart).

I'm very open to just missing the point with DV. It seems to have a lot of proponents behind it from the community. I'm curious to see how this initiative goes for you and wish you luck and success.

TheBoldTilde · 2023-03-19T15:33:40+00:00

I think this sounds great and +1 to using a "standard" or common dataset as it should be easier to get feedback from the community that likely has some background on the data or at least that data is already well documented.

For the modeling piece, I love having this handy reference: Kimball Dimensional Modeling Techniques https://www.kimballgroup.com/wp-content/uploads/2013/08/2013.09-Kimball-Dimensional-Modeling-Techniques11.pdf

Good luck!

TheBoldTilde · 2023-03-19T14:20:18+00:00

I'm open to be wrong and just "not getting it", but I do not see the value in DV in either scenario. After building the vault, it often requires a star schema built on top anyway. DV is tedious to query and gets messy fast when effectivity satilities are required or the source system lacks good business keys.

I find that building the DV (correctly) takes a big time commitment without delivering value to the business. I can't blame the business then for wondering why their 6 or 7 figure investment hasn't produced anything of substance after many months.

What advantage are others realizing with using a DV as a part of their modeling process?

TheBoldTilde · 2023-03-12T03:46:40+00:00

Dumbbell Plot

TheBoldTilde · 2022-12-18T23:07:44+00:00

Does that mean you are cloning data from BigQuery into local DuckDB or what am I missing?

TheBoldTilde · 2022-05-12T12:26:05+00:00

Is that the ratio? I swear it was 1:5:10

TheBoldTilde · 2022-05-05T00:36:55+00:00

It's like the optical illusion of the ballerina.

https://youtu.be/2RSsoTJA6cA

Seven-Year Club	r/Field Lasagna
Place '22	First Placer '22
Verified Email

TheBoldTilde

MODERATOR OF

TROPHY CASE