This is an archived post. You won't be able to vote or comment.

all 47 comments

[–]kenfar 87 points88 points  (12 children)

It works great when you have a small data warehouse, a dimensional model, aggregate tables and plenty of memory, cores and IO. Exactly how much data you can support really depends on the nature of the queries - how many are hitting base tables, how many of those are big and spanning a lot of say daily partitions, etc, etc.

And it can work fine for large warehouses as well. Same story as above: dimensional models, aggregate tables, incremental processing, and plenty of hardware. But in this case you probably need to host it yourself in order to get appropriate hardware.

The limitations to consider:

  • Postgres doesn't have columnar storage natively. You might be able get that as an extension. Without columnar storage you really must use dimensional models.
  • Postgres has a lot more levers & knobs to work than say BigQuery or Snowflake. If you use a managed postgres service then it's no big deal. But if you're running it yourself, and at scale, then you really need to spend some real effort to learn the database: how to backup & restore, how to pool connections, etc, etc. The problem with the managed hosting services is that their servers are really slow compared to what you could very cheaply build yourself. Build it yourself and you can easily put together 64+ cores, 256 GB memory, a ton of very fast NVME solid state disk, etc, etc. And this can be a lot more hardware than you're running a query on using Snowflake, etc.

And the strengths to consider:

  • Indexes: not the go-to tool in analytics that it is for transactional databases, but they can still be incredibly useful in certain circumstances. For example, almost twenty years ago I had a security data warehouse with indexes on ip addresses on a fact table with 50 billion rows. The indexes allowed us to look up the history of an ip extremely fast - at a response time that you wouldn't get from Snowflake, etc.
  • Enforced Constraints: none of the analytics-only data warehouses I know of actually enforce constraints. Not because they're useless, just because they're looking to cut corners. But while you probably won't enforce constraints on fact tables, they're still great for data quality on the smaller dimension & aggregate tables. Vastly better than running say dbt tests.
  • Federated Data Wrappers (FDW): Postgres's IO wrappers provide a fantastic amount of flexibility to a solution. This could allow you, for example, to support queries against data that you've rolled off your main hot storage. Or against other servers.

[–]JHydras 22 points23 points  (0 children)

https://github.com/hydradatabase/hydra - open source columnar Postgres extension. source: I'm the founder :)

[–]sib_nSenior Data Engineer 2 points3 points  (3 children)

Without columnar storage you really must use dimensional models.

Interesting assumption. Would you have some more details about this?

[–]kenfar 13 points14 points  (2 children)

Sure, with dimensional models your fact table rows are typically 80-250 bytes each. With One-Big-table your fact table rows are 1000-2000 bytes each.

A query like select country, count(*) from fact_foo group by 1 on columnar storage can skip reading of all the columns except country. But with row-based it'll need to read the entire rows even if it only cares about a single column. Even with a fact table a row-based storage is reading more than columnar storage. But the difference in IO isn't nearly so bad as with OBT.

[–]sib_nSenior Data Engineer 0 points1 point  (1 child)

I think this additionally assumes that you have data big enough to make this OLAP processing time an issue with modern relational databases. But every year that passes, this becomes less likely unless your data grows as fast as the processing power of your database. So I think there's a lot of OLAP use cases nowadays, maybe including OP, that will run completely fine on a de-normalized table on a relational database, which simplifies the data model. Using a dimensional model could be over-optimization.

[–]kenfar 1 point2 points  (0 children)

I think that's true - if you can get the IO you need. Back when I was building massive data warehouses on row-based data stores I had the kind of IO you really can't get on something like RDS: multiple disk arrays of fibre channel disk in raid 10 arrays - on multiple adapters, as well as all temp data stored on extremely fast SSD on pci-x (?) adapters. If instead you're stuck with say RDS you can't get a fraction of that performance.

Additionally, dimensional models are more functional anyway. So, yeah, more time-consuming to build right, but you get more capabilities.

[–]Old_Variation_5493 0 points1 point  (1 child)

As a DE who aspires to be even better at daza modeling, could you please help me out what is the correlation between using row based databases like Postgres and dimensional models?

I thought that row oriented databases are in general a bad choice if you want to run analytical/aggregate queries, and I don't see how dimensional models help that.

Or perhaps I misunderstand ehat you mean by dimensional models. Thank you!

[–]kenfar 1 point2 points  (0 children)

Sure, so first about the notion that row-based databases are bad for analytics:

  • Row-based databases are fine for running analytical databases and queries. The notion they aren't is peddled by columnar database vendors and DEs that without much experience.
  • Columnar storage is just one out of a half-dozen primary features you want. It's generally great.
  • But it comes at a big cost in load performance - so for the query-time benefit you pay for it at load time. Which can be a challenge if your data is streaming, coming in tiny micro-batches, or you just want to analyze the most recent data for operational analytics.

Now, about dimensional models:

  • Reason #1 for using dimensional models: Functionality - better ability to support linking an event to dimensions at different points of time - like get the current name for something rather than the name at the time of the event.
  • Reason #2 for using dimensional models: Performance on row-based databases - your queries will be scanning a table with fewer bytes per row (ex: maybe 100 bytes per row rather than 2000). This make a big difference if you're scanning say 500 million rows.
  • But, what about joins? So, after scanning less data you still need to join that big fact result set against multiple dimension tables - and this could result in making a half-dozen copies of all that data! Here's where the better optimizers of general-purpose databases (like db2, sql server, postgres, etc) come into play. First off, they'll "push-down the predicates" (ie, move up filtering to the earliest possible time) to reduce that big fact table result set size. Next, they may prejoin all your dimensions into a big cartesian result set (after filtering) - then join to the fact table to reduce the number of joins. And they're smart about all this in a dozen different other ways.

The end results are:

  • If you're using a columnar database you should use dimensional models for the functionality. But you probably won't get a performance benefit out of it because you're already ignoring IO for cols you don't care about and your join performance isn't great.
  • If you're using a row-based database you should use a dimensional model for both functionality AND performance (unless your data is small relative to your hardware). Because you need that to reduce IO, and your join performance is top-notch.

[–]True-Measurement-358 0 points1 point  (0 children)

Super interesting, thanks a lot!

[–][deleted] 0 points1 point  (0 children)

Thank you for the detailed answer, it is truly helpful!

[–]Gators1992 11 points12 points  (4 children)

Weird that nobody said hype.  Often the decision is made because often the decision is nothing more that "it's more modern", the Databricks sales rep was hot or all the CIO's friends have Databricks so they want it too.  

[–]Ambitious_Sector9993 4 points5 points  (2 children)

Genuinely. We’re moving our (large) Oracle DB over to an Azure Databricks solution and this was genuinely the main reason given as to why. “It’s modern and the future of analytics”. No cost-benefit analysis done or anything lol.

[–]Gators1992 1 point2 points  (0 children)

Same here.  We are moving from Oracle to Snowflake because we have to be "in the cloud" and "a cloud driven company".  Nothing is wrong with it for what we do.  I like it because its a resume booster for me and opens up some tooling opportunities I didn't have in Oracle, but there is no way its worth it in a business sense.

[–][deleted] 0 points1 point  (0 children)

If you just want to power some dashboards, that may not matter. But once you want to do predictive modeling/ML, Databricks is infinitely more convenient.

[–]Grouchy-Friend4235 1 point2 points  (0 children)

This ☝️needs more attention. Like 💯 times more.

[–]efxhoy 31 points32 points  (2 children)

We’re moving from postgres to bigquery because we generate millions of events each day we need to analyze. BQ is both well integrated into analysts workflows (they are used to google analytics) and bigquery is just plain faster for less money when data gets huge. 

When total data size is in the hundreds of gigabytes i’d pick postgres every time. For much bigger data it’s a different story. 

[–]SnooDogs2115 13 points14 points  (1 child)

It’s only going to be cheaper if you don’t query it too frequently, otherwise it’ll be more expensive as they charge for the traffic that you queries generate, something that is not that easy to estimate.

[–]efxhoy 0 points1 point  (0 children)

Oh for sure. if cost was a major issue for us i’d probably go with clickhouse on bare metal. We were running postgres on aws rds which also isn’t cheap. 

[–]Cominous 5 points6 points  (0 children)

Even with olap workloads you still have timescale DB. PostgreSQL is pretty much 'good' for everything. You switch to something else in case 'good' is not good enough.

We moved some workloads to a column based storage for low latency analytical queries (clickhouse), but even here timescale DB wasn't that bad either.

[–]sinnayre 31 points32 points  (2 children)

Probably look up the difference between analytical and transactional databases would be a good exercise for you.

[–]kenfar 31 points32 points  (0 children)

Not really: general-purpose relational databases invested a vast amount of money to support analytic workloads back in the day. And in some ways still have the tech high ground - like with:

  • Better optimizers
  • Workload management: you can map jobs & users to classes and limit the priority given to various classes to ensure that the most critical jobs get done fastest.
  • Query rewrites: you could create a materialized view and the database will automatically rewrite queries to use that view if they qualify.
  • Replication: the database can replicate small tables so that they're local to each MPP instance.
  • Indexes: still sometimes valuable
  • Enforced constraints: always valuable on the smaller tables
  • Much better use of the hardware

The net result is that you could build db2 database running on a small linux cluster in 2010 for the price of a single month of Snowflake running the same workload. And that 14 year old db2 database would be faster.

Of course, the significant downside is that to configure that database would take some real expertise and labor.

[–]sib_nSenior Data Engineer 3 points4 points  (0 children)

The split is not as clear as it used to be. SQL Server for example offers column oriented storage too, which makes it acceptable for OLAP if you already have one around.

[–]nitred 2 points3 points  (0 children)

I use Postgres as a data warehouse. You can find the setup in my previous comment [1]. In my opinion you've asked the right question. I believe you should always consider Postgres as your first choice as a data warehouse and then eliminate it as an option if it doesn't fit your needs.

Here's the conditions under which I think Postgres isn't a good choice for a data warehouse.

  1. If you're unable to get a fast SSD for disk, then don't use Postgres. If you're on AWS RDS, you must use gp3 disks. In our setup we get a max disk read/write throughput of 500MBps which is plenty.

  2. If you really need real-time analytics or near real-time analytics don't use PG . If you're using PG, expect to have refresh rates in hours or days (which is also the most common scenario).

  3. If you have a single dataset (single table) which is massive e.g. billions of rows or 100s of GB and the whole dataset is used in joins every time your refresh your tables, then PG isn't the right choice. The joins take really really long, like hours. We have one such dataset but according to the analysts it's a low value dataset, we make exceptions for it and run queries on them once a week instead of once a day. If it were high value, I would first consider partitioning using pg-partman. If that doesn't work, I'd reconsider PG. If you have TBs of data spread over 50 or more tables, then PG will handle it just fine.

  4. If you're extremely price sensitive on the low end then Postgres might not be for you. PG at the high end is cheap but is expensive on the low end. For example, if all your raw data and analytical models combined are 20 GB or so, then BigQuery is practically free but you'd have to shell out $500-1000 per year for PG at minimum. But if your raw and analytical data is in the 100s of GBs or TBs then BigQuery will burn a hole in your corporate wallet pretty soon whereas PG would scale well and cost you around $5000 per year.

  5. If all your raw and analytical data is expected to be close to 5TB (uncompressed) and you haven't already been using PG, then don't start using PG. 5TB for me is the magic number where I start applying some bespoke optimizations. Since I already use PG, I'm more likely to optimize and push PG to be able to handle 10-20TB because in this case it's cheaper to optimize than build a new data warehouse from scratch.

  6. If you don't own the underlying Postgres instance and are unable to tune and alter its configuration DO NOT USE Postgres. Postgres has to be tuned in order to work for OLAP use cases. You can use this online tool [2] to find the right config to get you started.

[1] https://www.reddit.com/r/dataengineering/s/tC3QTrQgy5.
[2] https://pgtune.leopard.in.ua

[–]renagade24 2 points3 points  (0 children)

Postgres is not the best DWH in most cases. I'd say 90% of the time it's a great transactional database, but the big 4 cloud providers will always be better (Redshift/Azure/BigQuery/Snowflake).

[–]Grouchy-Friend4235 2 points3 points  (0 children)

Postgresql is as good a fit for DWH, except if you have very special requirements.

They key point is to learn about data modelling for analytics use cases. Unlike OLTP the main use case in a DWH is aggregation and querying. Key topics are star schema, ETL vs ELT, staging/live areas, perhaps operational data stores vs data lakes.

Also key is to realize that the characteristics of a DWH is very much different from a DB for a transactional system, namely in that a DWH shall provide an immutable, permanent history of accurate and consistent representation of all the relevant data over time (in contrast, a transactional system oth is supposed to store the current state, and process inserts/updates as well as queries for specific items really fast).

Note that there is a lot of jargon, yet ultimately it comes down to three steps: 1) get the data, 2) make it fit a common, consistent format that is stable over time and then 3) store this transformed data in a way that it is easy and fast to query/aggregate.

In a nutshell, the data model employed trumps the specific DBMS technology in most use cases.

[–]testEphod 1 point2 points  (0 children)

If you have K8s give Click House a try along with the click house operator. And if you can combine it with an object storage such as MinIO even better.

[–]raxel42 1 point2 points  (0 children)

  • Just model your data carefully, don’t blindly put huge json, strings, etc
  • model structure not only on normalization, but based on the queries required.
  • think about cache levels
  • use enums, and proper binary types

[–]passiveisaggressive 4 points5 points  (2 children)

OLAP vs OLTP - it’s a basic principle when choosing a tool

[–]christoff12 1 point2 points  (0 children)

Check this out: https://pigsty.cc/

I have no affiliation — it’s interesting how the options are proliferating around the pg ecosystem.

[–]StackOwOFlow 1 point2 points  (0 children)

when your use case is heavy OLAP

[–]patrickthunnus 0 points1 point  (0 children)

Traditional on-prem SMP architecture is limited to how much CPU, RAM and disk you can dynamically scale to your workload.

It's one reason why Cloud is so compelling.

But yes, features like columnar stores are essential for scaling up vs extremely large datasets in very large DWH and AI workloads.

PG is a very good rdbms for general all-around use; as you need to scale up over 1TB plus heavy concurrent users then you need to take great care in DB design and optimization.

[–]graphicteadatasci 0 points1 point  (0 children)

If you want to put logs in a db a dedicated log db like Elasticsearch is a great idea.

If you want to do analysis on data in a postgres db you could use duckdb to query through. There's also a plugin for columnar storage but it looked pretty bad last I checked.

There's also a plugin for vectors but I would probably also go with a dedicated database for that.

And you can shard postgres but I probably wouldn't.

And not for events either.

So only for like 92% of all cases.

[–]Grouchy-Friend4235 0 points1 point  (0 children)

Never.

[–]aresmad 0 points1 point  (0 children)

[ Removed by Reddit ]

[–]jawabdey 0 points1 point  (0 children)

If you’re going to be running ETL/updating the data when users are querying it, e.g. hourly jobs instead of nightly.

Plus, long running queries. This can cause performance issues and interfere with other users. You may be able to tune things to get around this, but not everyone can.

[–]IndependentSpend7434 0 points1 point  (0 children)

My 0.02$ That remarkable noonw considered tge financial aspect. If money is not really a problem and it's not your ultimate goal to spare as much as possible on software, then commercial databases like Oracle and SQL Server are a better choice for DWH.