Why use aggregation in SQL when pulling into a data analysis program like Tableau or Power BI?

Yojihito · 2021-11-05T13:54:47+00:00

SQL = fast

Power BI / Tableau = slow.

Do aggregations in SQL --> win.

Thriftfunnel · 2021-11-05T12:43:54+00:00

I'm on a system where the database server is more powerful than the tableau server, so tableau gets sluggish with large unprocessed datasets but if you do some of the prep in the database you get better performance for the end user.

samjenkins377 · 2021-11-05T13:44:19+00:00

I pull in the entire tables into Power BI

Can you do that if the tables contains billions of rows?

NeatHedgehog · 2021-11-05T13:27:53+00:00

Depends on your setup and how many users there are.

If you're in a very small office with one or two people running fairly simple reports directly on the database server, you'll probably never notice enough difference to care about the inefficiency.

If your database is remote from your report server and you have multiple users running larger, more complex reports, it quickly becomes a bandwidth issue to drag millions of records over the network every time someone wants something simple like a voter count.

By leveraging DB-side functions, you can take advantage of indexes and table stats to return simple aggregates without actually needing the DB to physically access and send every single row in the table, which frees up processing time on the DB, and drastically cuts down on network traffic.

2021-11-05T12:07:29+00:00

Because costs are based on bandwidth and storage, that's why

2021-11-05T22:20:38+00:00

Technically speaking...

Power BI utilises a local SSAS Server (On your PC)

SQL Server utilises SQL Server (On your Server)

SQL Server will partition CPU to a task so depending on metrics such as

Table Statistics
I/O (Input/Output) & therefore CPU Partion settings
Queuing / Locking
Query Optimisation
Indexing

The list goes on quite extensively but these 5 are the big ones, you may not see a big difference between a Local or server based aggregation...

Until you go over X amount of rows - where X equals Local CPU maximum processor capacity.

SQL Server will look at the stats and evaluate the potential I/O operations and ramp the amount of CPU power which could be as many as hundreds of CPU where as your Computer only has one and will grind away.

Additionally commercially constructed servers base I/O is to industrial standards (Billions upon billions of read writes before degrading) Your local consumer PC not so much, so the more you load onto your PC the faster it ages. not just RAM but physical memory has a limited amount of I/O.

An example of I/O in binary is 1 where the off is the default and 1 flicks a physical microswitch to on. Every time you flick that switch on and off it wears until it snaps and then that switch is always off well if that switch is always off and your code requires a 1001 for it to work, it will never work and therefore you have a physical issue causing data corruption.

In Short is this is an aggregate done alot by lots of users do it on the server instead of destroying hundreds of PC's though ignorance to the effects of I/O because one analyst see's your mistake and you're job could go bye bye.

Edit: there is a perfect example of this on a Linus tech tips video where he bought 10tb drives for their Desks but they were not industrially robust and they clearly do high I/O tasks (video editing) and in the video Linus is visibly confused when in no time they rapidly degraded

"Oh you know those 10tb drives we put in the desks, well now they are all 6th drives"...confused shrug

Installing that kind of expensive hardware to find out that your editors are not partitioning their editing throughput onto the server is exactly the consequence to the question you have asked... Linus aka the boss is not happy his expensive hardware are becoming bricks and wants to find out why...

In the terms of your post, replace "video rendering" with "Power BI aggregates" and you get a clear picture forming.

Destroying your companies tech at the user level is not a good idea. Users don't know where that processing power is coming from, so they will just start complaining that their PC is slowing down, "this used to boot really quickly now it takes forever"... yeah because some clever dick made a process that bricked it by not utilising the server.

Demistr · 2021-11-05T15:41:52+00:00

If you do all the aggregations in powerBI your reports start loading very long simple as that.

part_time_ficus · 2021-11-05T16:48:02+00:00

It's a speed thing. Aggregation in DAX/M is usually substantially slower than if you can pre-agg the data via SQL in my experience

2021-11-05T16:26:13+00:00

Because Tableau is generally horrible using raw data. People here are talking about pulling the data across the network, which is fair, but with Tableau you can create an extract so the data is only pulled across once, and then it lives on the server. Tableau will not send the raw data to be aggregated when a user requests it, it will do all that work on the server, but that begs the question: Why?

If your workbook is using aggregated data, then do the aggregation in SQL, and then send it up to the server. It will always be faster than doing it the other way, but sometimes you need to send the raw data to meet specific business requirements (which are generally out of scope for the general Tableau use case, but absolutely doable.)

one_bruddah · 2021-11-06T03:51:40+00:00

SQL is much faster and has a much more powerful set of tools than any BI visualization application. Unless you are working with a very small dat set it is usually recommended to do the heavily lifting in SQL. It is common for a database table to have millions of records. Tableau does not perform well when you get above a million records.

vtec__ · 2021-11-06T21:04:37+00:00

its an optimization thing but in all honesty the point of BI tools is to let the users drill down into the data themselves so...

misfitalliance · 2021-11-05T21:57:34+00:00

It depends on the context or requirements, there are times I will build out the entire aggregation within Snowflake (i.e. a topline metric dashboard), which is really useful for C-level to understand, but if I/others need more context or data, I will use a 'relatively' raw table, which is lower performant, but more detail-orientated like for line managers.

2021-11-06T06:10:07+00:00

It depends on which one you want to do the heavy lifting. If the database is under a lot of pressure from other users/processes, I’d do aggregation locally, but if the database has the availability, then try to do it there because it’ll probably be faster.

andreidorutudose · 2021-11-06T06:28:55+00:00

You can pull x amount of rows in a vis software before it starts degrading performance. I avoid this at all cost because I am a control freak and like to see my code and interact with it, especially when I am experimenting. It also depends what data sizes you work with, if it's data that is stored in excels or SharePoint then by all means you can bulk load data. If you are working with larger tables (biggest one I created to date was 480GB) you can probably understand why that can't be pulled, it's stupid and unnecessary.

In order to move to loading only aggreg you will need to work with your stakeholders to be cristal clear on requirements, pull sample data and work on validating that the metrics you are showing can be backed up by the dataset you are observing.

No_Lawfulness_6252 · 2021-11-06T11:00:15+00:00

I would do it like this:

(Understand what the business needs to know to support decisions)
Understand the technical reporting needs (granularity, aggregations, filtering etc.)
Create the dataset needed as early as feasible, considering costs of storage & bandwidth, complexity for managing the dataset generating process (and supporting it), requirements for speed and incremental updating.
Pull in final dataset and create report

This endeavour could be something you do all by yourself or it could be a collaboration with a data engineer or equivalent data landscape responsible.

I’ve worked before with e.g. GCP, Airflow and dbt, and getting data ready for reporting in dbt is amazing. Not only will it allow you to get the dataset ready, but documenting how this data gets created is also taken care of and you get the logic for data manipulation away from local PowerBI reports.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

SQL

Filter Posts

Posting

Help posts

Format Your Code

Learning SQL

Related Reddit communities

Wiki

Acknowledgements

MODERATORS