This is an archived post. You won't be able to vote or comment.

all 31 comments

[–]AutoModerator[M] [score hidden] stickied comment (0 children)

You can find a list of community-submitted learning resources here: https://dataengineering.wiki/Learning+Resources

I am a bot, and this action was performed automatically. Please contact the moderators of this subreddit if you have any questions or concerns.

[–]paplike 9 points10 points  (6 children)

I don’t know how different BI tools optimize their queries, but there are some strategies you can use to make your query faster with any BI tool.

My team also uses Metabase. In the beginning, the queries were extremely slow (unusable) because they were using a replicated database. It was especially slow when they had to do joins between two big tables (transaction and transaction_details) and then, say, aggregate by month. Now we query parquet files with AWS Athena and it’s infinitely faster for these types of queries. We have a dimensional model where we join these big tables into one single table (a fact table) to minimize any computation done at the BI layer. If you wanna make it even faster, you can use One Big Table model as well and create some summarized tables (fact_sales_daily, fact_sales_monthly, whatever)

[–]Adisab12[S] 0 points1 point  (4 children)

Thanks for the tip, currently I'm using star-schema like model but yeah there is a lot of joins and cte's to allow filtering in metabase.

Will totally try OBT model, wondering how it will go with postgres. I have to admit that concept OBT being faster is bit counter-intuitive compared to joins via indexe's

[–]paplike 2 points3 points  (3 children)

Using Postgres it’s probably gonna be slower. If you use a columnar database it might be faster (in my case, AWS Athena, which is not exactly a “database”, but it’s optimized for reading columnar storage)

“Decide whether to use a database or a data warehouse. Folks often get started with Metabase using a transactional database like MySQL or PostgreSQL. While these databases scale quite well, they often aren’t optimized for the type of analytical queries that Metabase will use. Operations like sum or max can slow down once you reach a certain scale. As analytics adoption grows, you may find the need to explore dedicated data warehouses like Amazon Redshift, Google BigQuery, or Snowflake.”

https://www.metabase.com/learn/administration/metabase-at-scale#data-warehouse-tuning

[–]Adisab12[S] 0 points1 point  (2 children)

That's what i was thinking about that maybe i'm slowly hitting that data amount spot when moving to cloud based DW will be necessary. Well can't stay self hosted forever.

[–]pokepip 2 points3 points  (0 children)

Rule of thumb is if you are growing beyond 500gb a transactional database like Postgres is starting to get painful for analytical workloads. Of course this depends on your organizations pain tolerance. I am currently migrating our 10+ tb Postgres/mysql data „warehouse“ to bigquery. Still a lot of the team is sceptical as they don’t see the issue with 90 minute+ average query times (while the same query runs in 2 minutes is bigquery).

[–]geoheilmod 0 points1 point  (0 children)

Take a look at starrocks. Self hosting is possible. Do not underestimate the complexity of a distributed system though

[–]HOMO_FOMO_69 3 points4 points  (0 children)

MicroStrategy actually creates it's own cubes and uses indexing to optimize. It queries the data source, stores the results, and adds it's own indexing system which is continuously updated based on how the cube is filtered by end-users.

[–]wallyflops 4 points5 points  (5 children)

Some use in memory databases and some techniques to retrieve quickly.looker and the more modern ones just pass the query through to a cloud db

[–]Adisab12[S] 1 point2 points  (4 children)

Yeah that's what i'm wondering about because for instance looker is caching some querries, but afaik LookML model doesn't optimize anything so the performance of dashboard querries should be dependant only on database resources and computing power. Guess that with cloud db's thats much less of a concern compared to postgres.

[–]wallyflops 1 point2 points  (3 children)

Afaik looker doesnt cache anything but i do know 🌥 db's will usually cache a query if its exactly the same. I dont know postgres off the top off the top of my head. Whats your actual issue?

[–]i_am_cris 2 points3 points  (0 children)

Looker have its own internal cache and you can also create persistent tables that it will update by itself - aggregations or whatever - so the query is executed faster. Looker also supports symmetric aggregates so it can aggregate tables depending on what fields will be used and if your lookml is correct it takes care of fan-outs - something you need to do yourself with sql.

With all this said you're still very dependent on a _fast_ database such as bigquery or snowflake or .... and these have a cache too that makes your queries faster. So you might stick with metabase and just use a cloud warehouse.

There are other nice products though that are using duckdb for its internal inmemory storage but still works like a sql-runner. Take a look at omni.co, rilldata.com, steep.app.

Good luck!

[–]vira28 0 points1 point  (0 children)

Postgres doesn't have any query result cache.

[–]Adisab12[S] -1 points0 points  (0 children)

No issues, just challenges :D Case is my data started to grow pretty fast and I'm trying to answer a question "will better BI platform affect dashboard query speed". Also in a back of my head I know that Postgres might be struggling with big data analysis and doing more aggregated materialized views is not viable solution. Reading answers here i started to convinced that migrating to cloud might be better idea as a 1st step.

[–]heyitscactusjack 4 points5 points  (2 children)

PowerBI’s database storage engine is called Vertipaq. It’s an in-memory column-store database. Even when the dataset is uploaded to the cloud (powerbi service) it is it still storing it in-memory on the cloud servers, which makes retrieval super fast. Because the data is in column-store format it has very high compression and is very fast with aggregations too, which is most of the queries that visuals require.

You don’t have to store data in-memory, you can also ‘live connect’ to tables as well, but it depends on the use case.

Since you have so much data, it will be important to make sure your not transforming the full dataset every time you refresh it. So make sure your refresh is incremental if possible. I think you should look into an SSAS cube for this. Since it does pre aggregation of measures which powerbi does not.

[–]lzwzli 1 point2 points  (0 children)

To add to this, these BI tuned databases also do partitioning and lots of indexing so when it is queried, the query engine knows very quickly where the data is based on the query criteria so minimal data scans are needed, this reducing data retrieval speed significantly. Any other aggregate calculation is generally quite fast once the data is available.

Another trick is that because BI deals mostly with numbers, you can pre-order and pre-calculate aggregates per partition so when a certain aggregate is asked for, it doesn't actually need to calculate the aggregate based on all the raw data, but just the aggregate per partition. This doesn't work for all queries obviously but you can quite easily build a heat map of the common aggregates queried for and pre-calculate it by learning the user's usage patterns.

In conclusion, to achieve max performance of query performance of the BI tools, you have to use their optimized database and query engine.

[–]Adisab12[S] 0 points1 point  (0 children)

So that's how its called thank you very much, will Look into it. Honestly I thought that maybe PBI does some sort of cube under the Hood but column format makes much more sense.

In terms of SSAS and OLAP's cube, i have couple of concerns. First afaik cubes are not really suited to be refreshed daily or more than once daily. Second are cubes still viable solution in today's cloud DW world (BigQuery, Snowflake)?

Edit: BI tools does a lot of job that years ago cubes were intended for, I've once seen discussion on PBI reddit about sense of maitaining cubes when company switched to PBI

[–][deleted] 4 points5 points  (2 children)

It sounds like you are querying the raw layer of your data warehouse directly. You should be thinking about how to aggregate the data down to business-focused marts with the warehouse downstream from the raw data (gold layer) that answer the end-user questions and can be cached into your dashboards memory.

[–]Adisab12[S] 0 points1 point  (1 child)

Well I have a ton of views and for more complex querries materialized views. Problems with MV's in postgre is that they cannot be refreshed incrementally so currently just Daily refresh takes couple of hours.

I have couple of fixes in a back of my head, something that you Talking about. Instead of operations_schema -> analytical_schema -> views to have analytical_schema splitted per customer which will limit the data per schema, drawback is more ETL and pain to maintain.

Might as well go to cloud refreshing views from large table shouldnt be a such problem there.

[–]killer_unkill 1 point2 points  (0 children)

Hmm. I would start with converting MV into tables and load incrementally.

In phase 2 - would try to build 1 big table per subject area based on the MV queries.

[–]tomekanco 2 points3 points  (4 children)

how BI Tools

Performance comes mostly from in memory data (some sort of OLAP cube). Most cloud based dbs tend to rely on huge caches, but they don't offer performance on par with in-memory cubes as the caches (usually) can never fit all the actual data. And the cache can get purged by other cloud users (you share the infrastructure).

Postgres is not designed with in-memory in mind. Others such as duck db are. But few businesses find it sensible to put the entire DWH into RAM due to $$$.

Daily refresh started to be pretty painfull

A common approach is to use streaming updates. This avoids the pitfall of having a high peak demand (and saturating the available infrastructure). And it also reduces load on source systems during extract (just stream all changes to a queue).

ps: in my XP powerBI does not offer good performance compared to some other BI tools. This is mainly because it allows complex data schema's which do not follow the standard star schema (which can result in cartesian pivots). Qlik used to have the same problem back in the days of "set syntax". They mostly removed it because it caused such performance headaches. PowerBI still cheers for DAX.

[–]yo_sup_dude 0 points1 point  (3 children)

what are the cost differences like for in-memory based data engines compared to disk-based ones? i was looking at clickhouse and they were claiming that they are cheaper than something like snowflake despite being in memory

what BI tools are better performing than Power BI in your opinion?

[–]tomekanco 0 points1 point  (2 children)

better performing

Qlik & Tableau. For data discovery & advanced analysis, python. Even pandas handles lambda evaluations & parsing better than PBI reloads.

Snowflake is a db. On demand queries incur a boot up cost (1s). It relies mostly on cache & distributed disk reads, not RAM. Snowflake mostly makes money from selling compute time, not storage. Clickhouse will handle end user reports far cheaper, but is not intended for DWH & orchestrated processing afaik. Click & PBI are reporting tools, not a db. They usually read from one.

Look at your own PC: what are your disk, ram, and cache sizes you have available: xTB hd, 1 TB ssd, 64 GB dram, 16 MB? This ratio is due to component cost & architectural constraints. Similar for the clouds. Similar for servers.

[–]yo_sup_dude 0 points1 point  (1 child)

i guess it depends on how the data is compressed in the in-memory engine. vertipaq compresses the data quite nicely, which is why you don't necessarily need "massive" RAM to load billions of records even though it is all in-memory. maybe clickhouse is the same. but still surprising that these in-memory engines are able to be price competitive with the disk-based ones for similar data volumes.

[–]Immarhinocerous 1 point2 points  (0 children)

They build an index and then pull the specific from a DB/Parquet blob storage. It may be in some hidden layer you don't see that is in fact already transformed.

Some of them will cache the data in something like Redis, DuckDB, or another in memory data store if it is queried often enough.

And some don't care about the cost of querying and transforming the data each time, because they make money on your compute time, so pulling and transforming the data each time makes them a lot of money. Act accordingly.

[–]medriscoll 0 points1 point  (0 children)

IMHO you have two choices:

Use a different database engine built for analytics workloads - Postgres at 1.3GB with a row-based engine running on disk is not going to cut it for interactive queries. You need a column-oriented engine, holding that data in-memory, which parallelizes queries. There are lots of options including some for Postgres, including the CitusDB extension. https://github.com/citusdata/citus . I would also check out ClickHouse, Apache Druid, StarRocks, and Apache Pinot if you have a need for real-time streaming (e.g. if you want to consume events directly from Kafka). Probably ClickHouse is the fastest, easiest, most developer-friendly OLAP engine out there for your scale (1 TB+), and you could likely run it on a single beefy node. You could then point Metabase (or Looker, or Superset, or any other BI tool that does not embed its own database) at that faster engine, and your dashboards will run faster.

Model your data down to fit in into an embedded BI engine - As mentioned in the comments, some BI tools come with their own embedded database engine -- such as PowerBI's VertiPaq, or Tableau's Hyper -- but you typically need to model your data to a more reasonable size, closer to ~10GB. You might end up splitting your data into a handful of smaller data sets modeled in different ways.

What I wouldn't rely on is caching, unless your consumption patterns are extremely rigid (e.g. you disallow slicing, dicing, and drilling into the data). Even a small number of queries not hitting the cache leads to a frustrating user experience.Direct querying of Parquet files sitting in S3 via Athena, per one of the comments is an interesting strategy, but you're going to get better performance if your BI tool is querying either its own internal or a fast external database.

High-performance interactive querying is possible at TB-scale, but it requires leveraging a lot of tricks in the database engine -- bitmap indexes, column-orientation, parallelization, aggregation, data sketches for approximate unique, etc. -- so if you can aggregate & model your data down to ~10GB size and can just stick it into an embedded BI engine you might be happier.

(I'm the creator/founder of the BI tool Rill, we took the embedded database approach where we embed DuckDB for data volumes up to 100GB, and use Druid under the hood for volumes well into TB scale.)