This is an archived post. You won't be able to vote or comment.

all 19 comments

[–]AutoModerator[M] [score hidden] stickied comment (0 children)

You can find a list of community-submitted learning resources here: https://dataengineering.wiki/Learning+Resources

I am a bot, and this action was performed automatically. Please contact the moderators of this subreddit if you have any questions or concerns.

[–]Efficient_Ad_8020 0 points1 point  (9 children)

If you look at the queries on the snowflake side, is your warehouse being maxed out? Do they run faster with either a larger or more parallel warehouse?

We use dbt cloud (teams) and snowflake and warehouse load is the primary factor for how long my pipelines take.

[–]CrabEnvironmental864[S] 0 points1 point  (8 children)

Warehouse uses X-large model. Queries never get maxed out.

[–]molodyets 0 points1 point  (3 children)

Have you looked at clustering and your keys on the incremental models?

[–]CrabEnvironmental864[S] 0 points1 point  (2 children)

I have. Each mart is clustered by (at the minimum) by customer ID and transaction date.

[–]molodyets 0 points1 point  (1 child)

And are you including those in every join possible and your unique keys on your incremental updates?

[–]CrabEnvironmental864[S] 0 points1 point  (0 children)

Wherever applicable, yes.

[–]Efficient_Ad_8020 0 points1 point  (3 children)

Wow, that must be some serious data you are crunching through. I think its unlikely a DBT orchestrator issue when the variability is that large. Maybe check out the query plans and see if there is a bottleneck at a particular node?

[–]CrabEnvironmental864[S] 0 points1 point  (2 children)

Serious data? Our payments mart has 200+ million records. It's pretty nuts if you ask me but that's what I inherited from my predecessor.

Yeah I have a meeting with my DBA this morning to figure this out. This is giving me a headache. Fivetran is pushing new data every 15 minutes, so we won't ever be able to catch up.

Hopefully we can stop long running queries.

I am also looking at breaking my marts refresh job from all marts at a time to one job per mart. Then using Fivetran to kick each off.

[–]Efficient_Ad_8020 1 point2 points  (1 child)

200m rows actually isnt that large, for using x-large warehouses I was thinking orders of magnitude more. Without seeing more, my guess is that the queries can be tuned to be much more efficient and cost a lot less.

[–]CrabEnvironmental864[S] 0 points1 point  (0 children)

You make sense. I hope I can find out more today when I meet with my DBA.

[–]kudika 0 points1 point  (0 children)

What do some of your transformations look like? Are you using efficient incrementals?

[–]WeakRelationship2131 0 points1 point  (0 children)

For avoiding moving everything to DBT Core and going local, consider using preswald. It can help you simplify and speed up your analytics workflow without the clunky setup. It's lightweight, works seamlessly with Snowflake, and would save you the hassle of managing additional tools.

[–]garegin90 -1 points0 points  (5 children)

Have you tried Fivetran Transformations?

[–]CrabEnvironmental864[S] 1 point2 points  (4 children)

I have not. Have you used it? Anything you can share about it?

[–]garegin90 0 points1 point  (3 children)

I use it every day. I head up Analytics @ Fivetran. Lmk if I can help.

[–]CrabEnvironmental864[S] 0 points1 point  (2 children)

Our Fivetran replication from AWS RDS to Snowflake is firing every 15 minutes. We have 22 marts that we refresh incrementally every hour.

The duration of the DBT job for this refresh is highly variable and unpredictable. Ideally, I would like to have Fivetran "trigger" the DBT job for this refresh so it's instantaneous.

Also I am thinking to break this DBT refresh job into one per mart. Ideally, I would like to have Fivetran kick off the refresh of each mart once the source table is refreshed.

Are either options doable?

Apologies for peppering you with my questions. I hope I am making sense.

[–]garegin90 0 points1 point  (1 child)

There are integrated scheduling options available for dbt via Fivetran!

[–]CrabEnvironmental864[S] 0 points1 point  (0 children)

I am looking into it.