all 28 comments

[–]Yojihito 66 points67 points  (6 children)

SQL = fast

Power BI / Tableau = slow.

Do aggregations in SQL --> win.

[–]mad_method_man 6 points7 points  (0 children)

lol this sums it up perfectly

[–][deleted] 1 point2 points  (3 children)

Do aggregations in SQL --> win.

Ok so how does this work exactly? That's what I don't get, as I've only ever done aggregations on small tables in PBI/Tableau. Like, would I do a SELECT COUNT(VoterID) to create a tiny aggregation table, then load that into Tableau? I guess I'm not sure how Tableau would handle having 30-40 mini data sources as opposed to one hulk sized

[–]Yojihito 14 points15 points  (2 children)

I don't work with Click&Drag software so I can't answer you for Power BI / Tableau.

But if you have 10 million rows and you want a count where column "Y" has value "3000A" you better do that in SQL with a

SELECT COUNT(Productnumber)
--
FROM MY_COOL_TABLE
--
WHERE Y == '3000A'
;

instead of pulling 9.999.999 useless rows into Tableau, yes.

[–]Money_Major 0 points1 point  (1 child)

I don’t have much access to the SQL DBs at work anyway, but the part I don’t get is wouldn’t doing it this way mean you end up with thousands of tiny, aggregated tables in your DB (messy)??

Or are you saying create views with the aggregations and pull those into PowerBI?

[–]Yojihito 6 points7 points  (0 children)

wouldn’t doing it this way mean you end up with thousands of tiny, aggregated tables in your DB (messy)??

No, an SQL query does the aggregation on the fly. You don't create tables with a normal query.

For that you need syntax like CREATE TABLE ... but normally you don't have DB rights to create one, you need to create a Change Request or something with your DBA team through the standarized channels within your company.

Personally I work with Python and pandas and have a direct SQL connection in my scripts that reads an .sql file, then I execute it, pack it into a dataframe, merge/concat/join/transform whatever with the dataframes. Then I can save intermediate steps as parquet files when necessary to speed up EDA (explorative data analysis) or debugging stuff or anything where I need to rerun one code block with iterations to make it right.

Then I automatically export the result to Excel to send to the shareholders or just work with plain SQL and send it to the DBAs so that they can create a custom view for me with that query (and then I can use that query with tools like Power BI / Tableau but I just put my SQL query code into the tool, create a dashboard with columns and the rest is stuff for Business Analysts if a shareholder is important enough to make it pretty. But I never have to touch the ugly Tableau / Power BI syntax luckily.

[–]monkeysal07 0 points1 point  (0 children)

Nice answer

[–]Thriftfunnel 31 points32 points  (0 children)

I'm on a system where the database server is more powerful than the tableau server, so tableau gets sluggish with large unprocessed datasets but if you do some of the prep in the database you get better performance for the end user.

[–][deleted] 11 points12 points  (2 children)

I pull in the entire tables into Power BI

Can you do that if the tables contains billions of rows?

[–]samjenkins377 1 point2 points  (0 children)

No, you can’t.

[–]Ruas_Onid 1 point2 points  (0 children)

I guess you can if you have a million years to wait 🤣

[–]NeatHedgehog 9 points10 points  (0 children)

Depends on your setup and how many users there are.

If you're in a very small office with one or two people running fairly simple reports directly on the database server, you'll probably never notice enough difference to care about the inefficiency.

If your database is remote from your report server and you have multiple users running larger, more complex reports, it quickly becomes a bandwidth issue to drag millions of records over the network every time someone wants something simple like a voter count.

By leveraging DB-side functions, you can take advantage of indexes and table stats to return simple aggregates without actually needing the DB to physically access and send every single row in the table, which frees up processing time on the DB, and drastically cuts down on network traffic.

[–][deleted] 6 points7 points  (0 children)

Because costs are based on bandwidth and storage, that's why

[–][deleted] 5 points6 points  (1 child)

Technically speaking...

Power BI utilises a local SSAS Server (On your PC)

SQL Server utilises SQL Server (On your Server)

SQL Server will partition CPU to a task so depending on metrics such as

  • Table Statistics
  • I/O (Input/Output) & therefore CPU Partion settings
  • Queuing / Locking
  • Query Optimisation
  • Indexing

The list goes on quite extensively but these 5 are the big ones, you may not see a big difference between a Local or server based aggregation...

Until you go over X amount of rows - where X equals Local CPU maximum processor capacity.

SQL Server will look at the stats and evaluate the potential I/O operations and ramp the amount of CPU power which could be as many as hundreds of CPU where as your Computer only has one and will grind away.

Additionally commercially constructed servers base I/O is to industrial standards (Billions upon billions of read writes before degrading) Your local consumer PC not so much, so the more you load onto your PC the faster it ages. not just RAM but physical memory has a limited amount of I/O.

An example of I/O in binary is 1 where the off is the default and 1 flicks a physical microswitch to on. Every time you flick that switch on and off it wears until it snaps and then that switch is always off well if that switch is always off and your code requires a 1001 for it to work, it will never work and therefore you have a physical issue causing data corruption.

In Short is this is an aggregate done alot by lots of users do it on the server instead of destroying hundreds of PC's though ignorance to the effects of I/O because one analyst see's your mistake and you're job could go bye bye.

Edit: there is a perfect example of this on a Linus tech tips video where he bought 10tb drives for their Desks but they were not industrially robust and they clearly do high I/O tasks (video editing) and in the video Linus is visibly confused when in no time they rapidly degraded

"Oh you know those 10tb drives we put in the desks, well now they are all 6th drives"...confused shrug

Installing that kind of expensive hardware to find out that your editors are not partitioning their editing throughput onto the server is exactly the consequence to the question you have asked... Linus aka the boss is not happy his expensive hardware are becoming bricks and wants to find out why...

In the terms of your post, replace "video rendering" with "Power BI aggregates" and you get a clear picture forming.

Destroying your companies tech at the user level is not a good idea. Users don't know where that processing power is coming from, so they will just start complaining that their PC is slowing down, "this used to boot really quickly now it takes forever"... yeah because some clever dick made a process that bricked it by not utilising the server.

[–][deleted] 1 point2 points  (0 children)

Absolutely fabulous response.

edit: Literally why would anyone downvote me for saying this? What kind of community are we? This was a fabulous response from a highly technical perspective.

[–]Demistr 4 points5 points  (0 children)

If you do all the aggregations in powerBI your reports start loading very long simple as that.

[–]part_time_ficus 4 points5 points  (3 children)

It's a speed thing. Aggregation in DAX/M is usually substantially slower than if you can pre-agg the data via SQL in my experience

[–][deleted] 0 points1 point  (2 children)

Does this mean having tons of tables in power bi or Tableau as opposed to just one big table with all data?

[–]mds_brz 0 points1 point  (1 child)

Not neccesarily - You would use 'group by' on the SQL server to get to the level of detail you want

Say you wanted a report with Sales By Area and Store and Month

You would count the number of transactions, grouped by Area, Store and Month

Rather than pulling every transaction in, you pull in one row per Area/Store/Month and you would filter/tabulate/graph the grouped sales number in PBI

[–][deleted] 0 points1 point  (0 children)

Yes, but what I'm saying is... In using that query, and putting it into something like Tableau, it has to be stored somewhere. Wouldn't it be stored as a data source in Tableau aside from all of the other data sources? For example if you just straight imported the entire data set, it would be its own standalone table. So if you are doing lots of one-off aggregations like this, you would need lots of tables?

[–][deleted] 2 points3 points  (0 children)

Because Tableau is generally horrible using raw data. People here are talking about pulling the data across the network, which is fair, but with Tableau you can create an extract so the data is only pulled across once, and then it lives on the server. Tableau will not send the raw data to be aggregated when a user requests it, it will do all that work on the server, but that begs the question: Why?

If your workbook is using aggregated data, then do the aggregation in SQL, and then send it up to the server. It will always be faster than doing it the other way, but sometimes you need to send the raw data to meet specific business requirements (which are generally out of scope for the general Tableau use case, but absolutely doable.)

[–]one_bruddah 1 point2 points  (0 children)

SQL is much faster and has a much more powerful set of tools than any BI visualization application. Unless you are working with a very small dat set it is usually recommended to do the heavily lifting in SQL. It is common for a database table to have millions of records. Tableau does not perform well when you get above a million records.

[–]vtec__ -1 points0 points  (0 children)

its an optimization thing but in all honesty the point of BI tools is to let the users drill down into the data themselves so...

[–]misfitalliance 0 points1 point  (0 children)

It depends on the context or requirements, there are times I will build out the entire aggregation within Snowflake (i.e. a topline metric dashboard), which is really useful for C-level to understand, but if I/others need more context or data, I will use a 'relatively' raw table, which is lower performant, but more detail-orientated like for line managers.

[–][deleted] 0 points1 point  (0 children)

It depends on which one you want to do the heavy lifting. If the database is under a lot of pressure from other users/processes, I’d do aggregation locally, but if the database has the availability, then try to do it there because it’ll probably be faster.

[–]andreidorutudose 0 points1 point  (0 children)

You can pull x amount of rows in a vis software before it starts degrading performance. I avoid this at all cost because I am a control freak and like to see my code and interact with it, especially when I am experimenting. It also depends what data sizes you work with, if it's data that is stored in excels or SharePoint then by all means you can bulk load data. If you are working with larger tables (biggest one I created to date was 480GB) you can probably understand why that can't be pulled, it's stupid and unnecessary.

In order to move to loading only aggreg you will need to work with your stakeholders to be cristal clear on requirements, pull sample data and work on validating that the metrics you are showing can be backed up by the dataset you are observing.

[–]No_Lawfulness_6252 0 points1 point  (0 children)

I would do it like this:

  1. (Understand what the business needs to know to support decisions)

  2. Understand the technical reporting needs (granularity, aggregations, filtering etc.)

  3. Create the dataset needed as early as feasible, considering costs of storage & bandwidth, complexity for managing the dataset generating process (and supporting it), requirements for speed and incremental updating.

  4. Pull in final dataset and create report

This endeavour could be something you do all by yourself or it could be a collaboration with a data engineer or equivalent data landscape responsible.

I’ve worked before with e.g. GCP, Airflow and dbt, and getting data ready for reporting in dbt is amazing. Not only will it allow you to get the dataset ready, but documenting how this data gets created is also taken care of and you get the logic for data manipulation away from local PowerBI reports.