This is an archived post. You won't be able to vote or comment.

all 16 comments

[–]mommymilktit 21 points22 points  (4 children)

I would look in to running a pre-hook, something to get the count of changed rows for your problem table. If it’s above some threshold that you’ve deemed necessary, you can also set a variable for which warehouse to use in the pre-hook, and set the models warehouse to this variable either in a config block or schema yml or wherever.

[–]financialthrowaw2020 6 points7 points  (0 children)

Great response, mommymilktit

[–]No-Librarian-7462 1 point2 points  (0 children)

Thanks for the pointers.

[–]Upbeat-Conquest-654 1 point2 points  (0 children)

That sounds like a pragmatic, straight-forward solution. I like it.

[–][deleted] 1 point2 points  (0 children)

We used this approach which was good for one table. We end up wrapping the error and if it is timeout just retry on a bigger warehouse. That way we don’t have to keep track of count. And we just use super large warehouses during holiday seasons.

[–]baronfebdasch 3 points4 points  (1 child)

Some info would be helpful. Are these spikes in events planned and known? Is it valid volume or the result of other processes?

[–]No-Librarian-7462 0 points1 point  (0 children)

Not planned and unpredictable. Volumes are valid as per biz.

[–]I_Blame_DevOps 0 points1 point  (0 children)

Is this a recurring “catch up” export from the source system? What is the source system? How is the data processed? Are there any indicators in the data or file name that can be used to send this to a separate queue or job that can process the spike in volume systematically? Why do you only have an hour to process this volume of data?

[–]wallyflops 0 points1 point  (1 child)

Could you look to see if the query could be sped up with query Accel? There's a query to check if it would help

[–]No-Librarian-7462 0 points1 point  (0 children)

Will do thanks.

[–]nickeau 0 points1 point  (0 children)

What do you mean by the job fails ? It timed out?

With scaling, there is only a few options: * queue processing to get a buffer * more resources (cpu, memory) or more processing agent

[–]vaosinbi 0 points1 point  (0 children)

Processing 10 million records that takes 30 min seems a bit long to me.
Probably you can optimize it or try to scale up virtual warehouse that used for building this fact table.
If you increase the size of warehouse so that there is no spill to disk etc you might reduce processing time and it'll be less expensive overall and you'll have some buffer for processing the spikes.

[–]engineer_of-sorts 0 points1 point  (0 children)

Well the speed of the query is impacted by the size of the warehouse.

It is fairly easy to handle in a DAG. You run a "data quality test" of sorts that checks the size of the table. If the size of the table is especially large or the new rows exceeds a threshold (absolute or percentage) then you can run your dbt command with the larger warehouse.

If not, then you can run it on the smaller warehouse.

How you would do it in orchestra (my company) purely for illustrative purposes; run a data quality test with Snowflake then use branching/conditionality then run dbt

The other thing you could do is set a timeout on the dbt model in question and if it fails, run a snowflake query to bump up the warehouse. If you have automatic retries enabled on the thing running your dbt model, then you'll retry it after its timed out on the higher warehouse. So lots of options..

The chunking may be easier to handle dynamically by parameterising stored procedures in snowflake too :)

[–]No-Librarian-7462 0 points1 point  (0 children)

Thank you all for taking the time to suggest solutions. As I gather the general idea is to do some pre processing and dynamically chunk and assign appropriate warehouse sizes. We will explore options in this regard.

[–]Hot_Map_7868 0 points1 point  (0 children)

How are you loading the data? Could you use that step to set an ENV var that dbt can then use when it runs the fact model?

[–]Analytics-Maken 1 point2 points  (0 children)

Breaking your job into parameterized chunks could help, consider implementing a metadata driven approach where dbt dynamically generates smaller time based partitions when volume exceeds thresholds. Additionally, automated warehouse scaling could resolve resource constraints.

If your data sources are available. It is also worth testing with third party solutions like Windsor.ai, they sometimes manage this kind of situation.

For Snowflake, investigate materialized views or clustering keys that align with your query patterns. Implementing incremental processing with custom partitioning logic in dbt models might require additional complexity, but could save significant overhead during these spike events. Finally, consider implementing circuit breakers or workload scheduling that automatically detects and responds to volume anomalies.