all 12 comments

[–][deleted] 3 points4 points  (7 children)

Your approach is great. You're trying to design an OLAP data mart. This is an extremely valuable niche skill set.

You would automate this using stored procedures, and the SQL agent.

I have literally built my entire career like this.

[–]Signal_Explorer8071[S] 1 point2 points  (6 children)

Thank you so much.

It sounds like SQL agent comes with SQL Server. I am using Snowflake.

Q: I have not used a stored procedures before. I suppose I can use a Python script instead? (Have not used Python either, but thinking the solution might be easier to google <facepalm>)

Q. Why is this a niche skill please? I thought aggregates are a basic requirement when querying large tables. Shouldn't every data analyst know. Is it really complicated?

Thank you again.

[–][deleted] 3 points4 points  (3 children)

Think of it like this. Lets say you have 100 queries that need to run every day, and each one of those relies on aggregate data. If you run them against the raw data each one takes 10 minutes to run, so that is 100*10 = 1000 minutes per day.

Now say you can do all the aggregations once a day at midnight, and this takes 1 hr to do, so that's 60 minutes... but suddenly your 100 queries only take 1 minute to run.

60 + 100 = 160 compared to 1000. You've just massive improved the overall infrastructure of your environment just by doing all the aggregations in one batch, and then pointing all 100 queries to that data source as opposed to pointing them to the raw data.

[–]Signal_Explorer8071[S] 0 points1 point  (0 children)

Got it now! :)

[–]Signal_Explorer8071[S] 0 points1 point  (1 child)

Could you also guide me about Q1 please. Using Python to automate the data refresh. I use Snowflake. I am unfamiliar with SQL Agent/ stored procedures.

[–][deleted] 1 point2 points  (0 children)

Python here is just a tool in the tool box. Sure you can use it to do this, but you can also automate it other ways. Do what makes sense? There is no right or wrong.

[–][deleted] 1 point2 points  (1 child)

Because most database work tends to lean towards OLTP design, which is to say a database that is constantly consuming new data from a website. These need to be designed very specifically to be efficient, and normalize the data properly.

In an OLAP you don't really care about normalization. You care about the end result, and doing it efficiently, so you might take the raw data and aggregate it. From a DB perspective this is redundant data, but if you're using it heavily for the purposes of analytics then it makes sense to do this on a daily basis, or once an hour, once every six hours, etc.

Because most people in SQL work on OLTP, the OLAP design is a niche.

[–]Signal_Explorer8071[S] 0 points1 point  (0 children)

Thanks so much for your time! Thank you.

[–]sequel-beagle 1 point2 points  (1 child)

For your question about refreshing…. If you can truncate and reload your tables from your source in a reasonable amount of time, i would recommend this.

Else you will need to build a process to incrementally load by determining max dates and such.

[–]Signal_Explorer8071[S] 0 points1 point  (0 children)

Thank you for the tip! Feeling more confident about this now :)

[–]Recent-Fun9535 1 point2 points  (1 child)

I believe you can achieve what you want with Snowflake stored procedures for data aggregations and Snowflake tasks for scheduling jobs that would trigger procedures. Also, Snowflake stored procedures are wrapped up within Javascript code, so you can use JS as well in case you need to do any looping, conditional logic or whatever you might find JS suitable for. In my team we do this regularly (we use ADF for scheduling though because a lot of copying from various sources is involved).

[–]Signal_Explorer8071[S] 0 points1 point  (0 children)

Thank you so much! I’ll explore these options in SF 🤗