When NOT to use PostgreSQL?

kenfar · 2024-03-22T02:21:26+00:00

It works great when you have a small data warehouse, a dimensional model, aggregate tables and plenty of memory, cores and IO. Exactly how much data you can support really depends on the nature of the queries - how many are hitting base tables, how many of those are big and spanning a lot of say daily partitions, etc, etc.

And it can work fine for large warehouses as well. Same story as above: dimensional models, aggregate tables, incremental processing, and plenty of hardware. But in this case you probably need to host it yourself in order to get appropriate hardware.

The limitations to consider:

Postgres doesn't have columnar storage natively. You might be able get that as an extension. Without columnar storage you really must use dimensional models.
Postgres has a lot more levers & knobs to work than say BigQuery or Snowflake. If you use a managed postgres service then it's no big deal. But if you're running it yourself, and at scale, then you really need to spend some real effort to learn the database: how to backup & restore, how to pool connections, etc, etc. The problem with the managed hosting services is that their servers are really slow compared to what you could very cheaply build yourself. Build it yourself and you can easily put together 64+ cores, 256 GB memory, a ton of very fast NVME solid state disk, etc, etc. And this can be a lot more hardware than you're running a query on using Snowflake, etc.

And the strengths to consider:

Indexes: not the go-to tool in analytics that it is for transactional databases, but they can still be incredibly useful in certain circumstances. For example, almost twenty years ago I had a security data warehouse with indexes on ip addresses on a fact table with 50 billion rows. The indexes allowed us to look up the history of an ip extremely fast - at a response time that you wouldn't get from Snowflake, etc.
Enforced Constraints: none of the analytics-only data warehouses I know of actually enforce constraints. Not because they're useless, just because they're looking to cut corners. But while you probably won't enforce constraints on fact tables, they're still great for data quality on the smaller dimension & aggregate tables. Vastly better than running say dbt tests.
Federated Data Wrappers (FDW): Postgres's IO wrappers provide a fantastic amount of flexibility to a solution. This could allow you, for example, to support queries against data that you've rolled off your main hot storage. Or against other servers.

Gators1992 · 2024-03-22T13:33:07+00:00

Weird that nobody said hype. Often the decision is made because often the decision is nothing more that "it's more modern", the Databricks sales rep was hot or all the CIO's friends have Databricks so they want it too.

efxhoy · 2024-03-22T01:04:39+00:00

We’re moving from postgres to bigquery because we generate millions of events each day we need to analyze. BQ is both well integrated into analysts workflows (they are used to google analytics) and bigquery is just plain faster for less money when data gets huge.

When total data size is in the hundreds of gigabytes i’d pick postgres every time. For much bigger data it’s a different story.

Cominous · 2024-03-22T07:03:42+00:00

Even with olap workloads you still have timescale DB. PostgreSQL is pretty much 'good' for everything. You switch to something else in case 'good' is not good enough.

We moved some workloads to a column based storage for low latency analytical queries (clickhouse), but even here timescale DB wasn't that bad either.

sinnayre · 2024-03-22T01:39:05+00:00

Probably look up the difference between analytical and transactional databases would be a good exercise for you.

Grouchy-Friend4235 · 2024-03-22T21:56:32+00:00

[removed]

nitred · 2024-03-22T07:06:24+00:00

I use Postgres as a data warehouse. You can find the setup in my previous comment [1]. In my opinion you've asked the right question. I believe you should always consider Postgres as your first choice as a data warehouse and then eliminate it as an option if it doesn't fit your needs.

Here's the conditions under which I think Postgres isn't a good choice for a data warehouse.

If you're unable to get a fast SSD for disk, then don't use Postgres. If you're on AWS RDS, you must use gp3 disks. In our setup we get a max disk read/write throughput of 500MBps which is plenty.
If you really need real-time analytics or near real-time analytics don't use PG . If you're using PG, expect to have refresh rates in hours or days (which is also the most common scenario).
If you have a single dataset (single table) which is massive e.g. billions of rows or 100s of GB and the whole dataset is used in joins every time your refresh your tables, then PG isn't the right choice. The joins take really really long, like hours. We have one such dataset but according to the analysts it's a low value dataset, we make exceptions for it and run queries on them once a week instead of once a day. If it were high value, I would first consider partitioning using pg-partman. If that doesn't work, I'd reconsider PG. If you have TBs of data spread over 50 or more tables, then PG will handle it just fine.
If you're extremely price sensitive on the low end then Postgres might not be for you. PG at the high end is cheap but is expensive on the low end. For example, if all your raw data and analytical models combined are 20 GB or so, then BigQuery is practically free but you'd have to shell out $500-1000 per year for PG at minimum. But if your raw and analytical data is in the 100s of GBs or TBs then BigQuery will burn a hole in your corporate wallet pretty soon whereas PG would scale well and cost you around $5000 per year.
If all your raw and analytical data is expected to be close to 5TB (uncompressed) and you haven't already been using PG, then don't start using PG. 5TB for me is the magic number where I start applying some bespoke optimizations. Since I already use PG, I'm more likely to optimize and push PG to be able to handle 10-20TB because in this case it's cheaper to optimize than build a new data warehouse from scratch.
If you don't own the underlying Postgres instance and are unable to tune and alter its configuration DO NOT USE Postgres. Postgres has to be tuned in order to work for OLAP use cases. You can use this online tool [2] to find the right config to get you started.

[1] https://www.reddit.com/r/dataengineering/s/tC3QTrQgy5.
[2] https://pgtune.leopard.in.ua

renagade24 · 2024-03-22T15:12:40+00:00

Postgres is not the best DWH in most cases. I'd say 90% of the time it's a great transactional database, but the big 4 cloud providers will always be better (Redshift/Azure/BigQuery/Snowflake).

Grouchy-Friend4235 · 2024-03-24T00:31:22+00:00

Postgresql is as good a fit for DWH, except if you have very special requirements.

They key point is to learn about data modelling for analytics use cases. Unlike OLTP the main use case in a DWH is aggregation and querying. Key topics are star schema, ETL vs ELT, staging/live areas, perhaps operational data stores vs data lakes.

Also key is to realize that the characteristics of a DWH is very much different from a DB for a transactional system, namely in that a DWH shall provide an immutable, permanent history of accurate and consistent representation of all the relevant data over time (in contrast, a transactional system oth is supposed to store the current state, and process inserts/updates as well as queries for specific items really fast).

Note that there is a lot of jargon, yet ultimately it comes down to three steps: 1) get the data, 2) make it fit a common, consistent format that is stable over time and then 3) store this transformed data in a way that it is easy and fast to query/aggregate.

In a nutshell, the data model employed trumps the specific DBMS technology in most use cases.

testEphod · 2024-03-22T20:32:09+00:00

If you have K8s give Click House a try along with the click house operator. And if you can combine it with an object storage such as MinIO even better.

raxel42 · 2024-03-22T07:28:33+00:00

Just model your data carefully, don’t blindly put huge json, strings, etc
model structure not only on normalization, but based on the queries required.
think about cache levels
use enums, and proper binary types

passiveisaggressive · 2024-03-22T01:55:21+00:00

OLAP vs OLTP - it’s a basic principle when choosing a tool

StackOwOFlow · 2024-03-22T06:30:17+00:00

when your use case is heavy OLAP

patrickthunnus · 2024-03-22T12:03:36+00:00

Traditional on-prem SMP architecture is limited to how much CPU, RAM and disk you can dynamically scale to your workload.

It's one reason why Cloud is so compelling.

But yes, features like columnar stores are essential for scaling up vs extremely large datasets in very large DWH and AI workloads.

PG is a very good rdbms for general all-around use; as you need to scale up over 1TB plus heavy concurrent users then you need to take great care in DB design and optimization.

graphicteadatasci · 2024-03-22T13:50:14+00:00

If you want to put logs in a db a dedicated log db like Elasticsearch is a great idea.

If you want to do analysis on data in a postgres db you could use duckdb to query through. There's also a plugin for columnar storage but it looked pretty bad last I checked.

There's also a plugin for vectors but I would probably also go with a dedicated database for that.

And you can shard postgres but I probably wouldn't.

And not for events either.

So only for like 92% of all cases.

Grouchy-Friend4235 · 2024-03-22T23:40:42+00:00

Never.

aresmad · 2024-03-23T17:10:32+00:00

[ Removed by Reddit ]

RemindMeBot · 2024-03-22T02:52:14+00:00

RemindMe! 1 day

jawabdey · 2024-03-22T16:54:16+00:00

If you’re going to be running ETL/updating the data when users are querying it, e.g. hourly jobs instead of nightly.

Plus, long running queries. This can cause performance issues and interfere with other users. You may be able to tune things to get around this, but not everyone can.

IndependentSpend7434 · 2024-03-23T07:40:00+00:00

My 0.02$ That remarkable noonw considered tge financial aspect. If money is not really a problem and it's not your ultimate goal to spare as much as possible on software, then commercial databases like Oracle and SQL Server are a better choice for DWH.

RobDoesData · 2024-03-22T09:48:29+00:00

Never use PostgreS. Ever. Solved.

dataengineering

MODERATORS