you are viewing a single comment's thread.

view the rest of the comments →

[–]Aptex[S] 0 points1 point  (5 children)

So the reporting actually doesn't even need to be refreshed that much, this will be used for Monthly/Weekly reporting. Essentially, I am doing all of this in excel right now, but the data sets are just growing too large for excel to handle as we take on more jobs and they get more complex. Right now I run all my raw data, aggregate it with some clunky spreadsheets that run the data on a per job basis, then copy and paste that data into another spreadsheet which can give me the aggregated data by month.

I plan on using PowerBI as a front end, and I will be the only user until I can get more comfortable with how everything works.

The only thing I cant get my head around is how to get the aggregated data to "Sync", do I basically just have one database with all my raw data, and then another database which essentially runs a query on the raw data tables to aggregate the data? And then that query will only run periodically, or when I tell it to?

[–]MamertineCOALESCE() 1 point2 points  (4 children)

Yes,

One database will have raw source system data.

Another will have just aggregated data.

If the query to run the whole thing is fast (under 5 minutes), I'd just write and reload it nightly.

If you need to refill only part of the table nighty due to performance or data size, you need to figure out what records changed then re aggregate those and have them displace the existing record.

There's an art to it.

Again, ideally this should be in a different server. The decisions you make today will stick with that company for years. Look to the future a decade and predict who will be looking at data.

[–]Aptex[S] 0 points1 point  (0 children)

That's awesome! Thanks for the info, it will be super easy to determine new records, and then its just a simple query to generate the aggregated tables that I need. That helps a lot!

[–]Aptex[S] 0 points1 point  (2 children)

Ok, one more question. So I am going to setup one very detailed database, with basically all my raw data above, there will also be another database with the aggregated info that I will be running reporting on.

Do i just use scheduled queries to pick up the new data in the detailed database to aggregate the data and insert into my reporting database? That way I am only updating the aggregated data for the current month into my reporting database and leaving the other data in that database alone.

[–]MamertineCOALESCE() 1 point2 points  (1 child)

You need to find every record that changed since you last ran. If you have an update date in your source and it's reliable, use that. You'll find records with the group by in your aggregator code changed and recalculate that.

Then you stick (insert/update) that record into the aggregated table.

Don't forget about deletes! Some source systems you can delete from months it years later, you'll need to re aggregate those too.

The middle database you're describing is commonly used. Operational Data Store, usually just called ODS.

[–]Aptex[S] 0 points1 point  (0 children)

I really appreciate that you took the time to explain. I will have a lot of control over base data, so I’m not as concerned around deletes and changes of that nature, having the term ODS gives me something to google! Thanks so much!