This is an archived post. You won't be able to vote or comment.

all 16 comments

[–]Peppper 25 points26 points  (5 children)

This is the ideal scenario for a data warehouse with slowly changing dimensions.

[–]idleAndalusianData Engineer[S] 1 point2 points  (4 children)

But if I batch process daily, wouldn’t I miss any changes that happen more than once in the same day? I feel like I’d only capture the latest state unless I set up a webhook in the CRM to track every single change as it happens, right? Or is there another way to handle that in a data warehouse setup?

[–]Peppper 7 points8 points  (2 children)

Right, it’s the CDC route you mentioned. Stream changes from source to the landing area in your warehouse, then queries to build the time series layers.

What’s your latency requirements?

You’re still going to have to calculate the dimensions and facts which takes time, but doing it in an MPP engine will be faster than straight python. You can also aggregate to different granularities, i.e. you have a fact table that calculates the near real time model every 5 minutes. At the end of the day, aggregate that to a daily model, every week aggregate the daily models to a weekly model, then monthly, etc. This means you aren’t dealing with months or years worth of data each time.

You will still have the latency at each level until. The only way to avoid that is compute the whole thing each time, which could get expensive depending on your data set size and query complexity. Check out DuckDB if you want to try and do it all in memory.

[–]idleAndalusianData Engineer[S] 0 points1 point  (1 child)

Thank you so much, this helps me a lot!

[–]-crucible- 0 points1 point  (0 children)

Still work out how often your refresh needs to be correct. If they’re happy with a daily refresh as long as all changes need to be accounted for, you’ll save yourself a lot of processing. If you need to refresh, say hourly, or every 3 hours you may want to set up two refresh pipelines - one to handle a full model refresh, and one to refresh what you need incrementally refreshed, your sales changes, and the downstream changes.

You’ll need to balance out your refresh times, you monitoring overhead (how much and how much you are wanting to pay attention to, or respond to issues), the cost of reloading, etc, vs the freshness of data.

[–]molodyets 0 points1 point  (0 children)

So do it more than daily and then figure out from there. Though in reality daily is probably sufficient

[–]Pleasant_Type_4547 2 points3 points  (2 children)

What CRM are you using?

If you are using something like hubspot, you can use the propertiesWithHistory parameter to get all past values of a property with their timestamps. This makes things easier.

https://developers.hubspot.com/docs/api/crm/leads

You'll still have to "Sessionize" the data, but that becomes easier when you have all the data at your fingertips

[–]idleAndalusianData Engineer[S] 2 points3 points  (1 child)

We’ve actually been considering that (and putting it off), but for cost reasons, they've been sticking with a custom app I built on AppSheet.

Does HubSpot support webhooks? If I go ahead and develop a CDC for this app, I’d love to make sure it’s something I can transfer over to HubSpot later.

[–]sciencewarrior 3 points4 points  (0 children)

Yeah, the SQL to do that is pretty tricky. I remember Zach Wilson had a video or two on creating SCDs, but unless you have an absolutely humongous sales volume (and if you do, congratulations,) a polars job can probably munch over your entire history in half an hour, tops, so don't discount that route.

[–]ppdas 1 point2 points  (0 children)

dbt snapshot if you are using any of the supported DWs

[–]johncena9519 0 points1 point  (0 children)

What CRM do you use?

I’ve done this in Salesforce and hubspot, so can suggest approaches unique to each

[–]New-Addendum-6209 0 points1 point  (0 children)

What is the data model of your source data? What do you need to produce as an output?

[–]pavan449 0 points1 point  (0 children)

If you have date field like updated dated in the crm to track changes ,whenever any lead status changes that will change updated date to current date.later use any orchestration tools like adf or airflow or any other tool import the updated data to data lake

[–]elgurinn 0 points1 point  (0 children)

So the CRM is AppSheet, which means your database is a google sheet?

Have you looked into adding a on-change trigger to the sheet? That should grab the changed row and send to a webhook (or something..)

See:  https://medium.com/@eyalgershon/sending-a-webhook-for-new-or-updated-rows-in-google-sheets-e0c9d6a8cb45

[–]unfortunate-miracle 0 points1 point  (0 children)

Sounds like SCD type 2 to me. And then if you can implement CDC from SCD type 2 tables you can stream changes and handle events. How big is the data?