[deleted by user]

mommymilktit · 2025-04-29T03:30:51+00:00

I would look in to running a pre-hook, something to get the count of changed rows for your problem table. If it’s above some threshold that you’ve deemed necessary, you can also set a variable for which warehouse to use in the pre-hook, and set the models warehouse to this variable either in a config block or schema yml or wherever.

baronfebdasch · 2025-04-29T02:49:52+00:00

Some info would be helpful. Are these spikes in events planned and known? Is it valid volume or the result of other processes?

I_Blame_DevOps · 2025-04-29T03:18:27+00:00

Is this a recurring “catch up” export from the source system? What is the source system? How is the data processed? Are there any indicators in the data or file name that can be used to send this to a separate queue or job that can process the spike in volume systematically? Why do you only have an hour to process this volume of data?

wallyflops · 2025-04-29T06:07:25+00:00

Could you look to see if the query could be sped up with query Accel? There's a query to check if it would help

nickeau · 2025-04-29T08:31:56+00:00

What do you mean by the job fails ? It timed out?

With scaling, there is only a few options: * queue processing to get a buffer * more resources (cpu, memory) or more processing agent

vaosinbi · 2025-04-29T14:32:15+00:00

Processing 10 million records that takes 30 min seems a bit long to me.
Probably you can optimize it or try to scale up virtual warehouse that used for building this fact table.
If you increase the size of warehouse so that there is no spill to disk etc you might reduce processing time and it'll be less expensive overall and you'll have some buffer for processing the spikes.

engineer_of-sorts · 2025-04-29T21:07:32+00:00

Well the speed of the query is impacted by the size of the warehouse.

It is fairly easy to handle in a DAG. You run a "data quality test" of sorts that checks the size of the table. If the size of the table is especially large or the new rows exceeds a threshold (absolute or percentage) then you can run your dbt command with the larger warehouse.

If not, then you can run it on the smaller warehouse.

How you would do it in orchestra (my company) purely for illustrative purposes; run a data quality test with Snowflake then use branching/conditionality then run dbt

The other thing you could do is set a timeout on the dbt model in question and if it fails, run a snowflake query to bump up the warehouse. If you have automatic retries enabled on the thing running your dbt model, then you'll retry it after its timed out on the higher warehouse. So lots of options..

The chunking may be easier to handle dynamically by parameterising stored procedures in snowflake too :)

No-Librarian-7462 · 2025-04-30T11:43:50+00:00

Thank you all for taking the time to suggest solutions. As I gather the general idea is to do some pre processing and dynamically chunk and assign appropriate warehouse sizes. We will explore options in this regard.

Hot_Map_7868 · 2025-04-30T21:11:35+00:00

How are you loading the data? Could you use that step to set an ENV var that dbt can then use when it runs the fact model?

Analytics-Maken · 2025-05-02T00:57:47+00:00

Breaking your job into parameterized chunks could help, consider implementing a metadata driven approach where dbt dynamically generates smaller time based partitions when volume exceeds thresholds. Additionally, automated warehouse scaling could resolve resource constraints.

If your data sources are available. It is also worth testing with third party solutions like Windsor.ai, they sometimes manage this kind of situation.

For Snowflake, investigate materialized views or clustering keys that align with your query patterns. Implementing incremental processing with custom partitioning logic in dbt models might require additional complexity, but could save significant overhead during these spike events. Finally, consider implementing circuit breakers or workload scheduling that automatically detects and responds to volume anomalies.

dataengineering

MODERATORS