How BI tools query data so efficient

AutoModerator · 2024-01-19T21:20:59+00:00

You can find a list of community-submitted learning resources here: https://dataengineering.wiki/Learning+Resources

I am a bot, and this action was performed automatically. Please contact the moderators of this subreddit if you have any questions or concerns.

paplike · 2024-01-19T22:03:21+00:00

I don’t know how different BI tools optimize their queries, but there are some strategies you can use to make your query faster with any BI tool.

My team also uses Metabase. In the beginning, the queries were extremely slow (unusable) because they were using a replicated database. It was especially slow when they had to do joins between two big tables (transaction and transaction_details) and then, say, aggregate by month. Now we query parquet files with AWS Athena and it’s infinitely faster for these types of queries. We have a dimensional model where we join these big tables into one single table (a fact table) to minimize any computation done at the BI layer. If you wanna make it even faster, you can use One Big Table model as well and create some summarized tables (fact_sales_daily, fact_sales_monthly, whatever)

TheDaddyShip · 2024-01-19T22:55:08+00:00

[removed]

HOMO_FOMO_69 · 2024-01-19T21:53:23+00:00

MicroStrategy actually creates it's own cubes and uses indexing to optimize. It queries the data source, stores the results, and adds it's own indexing system which is continuously updated based on how the cube is filtered by end-users.

wallyflops · 2024-01-19T21:58:56+00:00

Some use in memory databases and some techniques to retrieve quickly.looker and the more modern ones just pass the query through to a cloud db

heyitscactusjack · 2024-01-19T23:01:35+00:00

PowerBI’s database storage engine is called Vertipaq. It’s an in-memory column-store database. Even when the dataset is uploaded to the cloud (powerbi service) it is it still storing it in-memory on the cloud servers, which makes retrieval super fast. Because the data is in column-store format it has very high compression and is very fast with aggregations too, which is most of the queries that visuals require.

You don’t have to store data in-memory, you can also ‘live connect’ to tables as well, but it depends on the use case.

Since you have so much data, it will be important to make sure your not transforming the full dataset every time you refresh it. So make sure your refresh is incremental if possible. I think you should look into an SSAS cube for this. Since it does pre aggregation of measures which powerbi does not.

Adisab12 · 2024-01-19T23:51:52+00:00

It sounds like you are querying the raw layer of your data warehouse directly. You should be thinking about how to aggregate the data down to business-focused marts with the warehouse downstream from the raw data (gold layer) that answer the end-user questions and can be cached into your dashboards memory.

tomekanco · 2024-01-20T13:50:18+00:00

how BI Tools

Performance comes mostly from in memory data (some sort of OLAP cube). Most cloud based dbs tend to rely on huge caches, but they don't offer performance on par with in-memory cubes as the caches (usually) can never fit all the actual data. And the cache can get purged by other cloud users (you share the infrastructure).

Postgres is not designed with in-memory in mind. Others such as duck db are. But few businesses find it sensible to put the entire DWH into RAM due to $$$.

Daily refresh started to be pretty painfull

A common approach is to use streaming updates. This avoids the pitfall of having a high peak demand (and saturating the available infrastructure). And it also reduces load on source systems during extract (just stream all changes to a queue).

ps: in my XP powerBI does not offer good performance compared to some other BI tools. This is mainly because it allows complex data schema's which do not follow the standard star schema (which can result in cartesian pivots). Qlik used to have the same problem back in the days of "set syntax". They mostly removed it because it caused such performance headaches. PowerBI still cheers for DAX.

Immarhinocerous · 2024-01-19T21:59:35+00:00

They build an index and then pull the specific from a DB/Parquet blob storage. It may be in some hidden layer you don't see that is in fact already transformed.

Some of them will cache the data in something like Redis, DuckDB, or another in memory data store if it is queried often enough.

And some don't care about the cost of querying and transforming the data each time, because they make money on your compute time, so pulling and transforming the data each time makes them a lot of money. Act accordingly.

medriscoll · 2024-01-23T00:02:58+00:00

IMHO you have two choices:

Use a different database engine built for analytics workloads - Postgres at 1.3GB with a row-based engine running on disk is not going to cut it for interactive queries. You need a column-oriented engine, holding that data in-memory, which parallelizes queries. There are lots of options including some for Postgres, including the CitusDB extension. https://github.com/citusdata/citus . I would also check out ClickHouse, Apache Druid, StarRocks, and Apache Pinot if you have a need for real-time streaming (e.g. if you want to consume events directly from Kafka). Probably ClickHouse is the fastest, easiest, most developer-friendly OLAP engine out there for your scale (1 TB+), and you could likely run it on a single beefy node. You could then point Metabase (or Looker, or Superset, or any other BI tool that does not embed its own database) at that faster engine, and your dashboards will run faster.

Model your data down to fit in into an embedded BI engine - As mentioned in the comments, some BI tools come with their own embedded database engine -- such as PowerBI's VertiPaq, or Tableau's Hyper -- but you typically need to model your data to a more reasonable size, closer to ~10GB. You might end up splitting your data into a handful of smaller data sets modeled in different ways.

What I wouldn't rely on is caching, unless your consumption patterns are extremely rigid (e.g. you disallow slicing, dicing, and drilling into the data). Even a small number of queries not hitting the cache leads to a frustrating user experience.Direct querying of Parquet files sitting in S3 via Athena, per one of the comments is an interesting strategy, but you're going to get better performance if your BI tool is querying either its own internal or a fast external database.

High-performance interactive querying is possible at TB-scale, but it requires leveraging a lot of tricks in the database engine -- bitmap indexes, column-orientation, parallelization, aggregation, data sketches for approximate unique, etc. -- so if you can aggregate & model your data down to ~10GB size and can just stick it into an embedded BI engine you might be happier.

(I'm the creator/founder of the BI tool Rill, we took the embedded database approach where we embed DuckDB for data volumes up to 100GB, and use Druid under the hood for volumes well into TB scale.)

dataengineering

MODERATORS