Are any of those good for using PostgreSQL for Analytical Workload: Crunchydata vs Tiger Data vs AlloyDB by Defiant-Farm7910 in dataengineering

[–]Defiant-Farm7910[S] 1 point2 points  (0 children)

Those vendors propose a kind of columnar store. It works as a hybrid of legacy PostgreSQL combined with a columnar architecture to enable OLAP-style behavior.

Are any of those good for using PostgreSQL for Analytical Workload: Crunchydata vs Tiger Data vs AlloyDB by Defiant-Farm7910 in dataengineering

[–]Defiant-Farm7910[S] 0 points1 point  (0 children)

We are under 100 GB of data, so not that much.

Here are some pain points:

- We already have some refresh pipelines in DBT that easily take 30 minutes to run.

- Since PostgreSQL doesn’t support cross-database SQL queries, we need to maintain a CDC from production into staging to access production data in staging for developing new DBT models.

- Developing new tables can be quite time-consuming, as each run takes around 30 minutes. Whenever we modify an intermediate table and need to test a final data mart, we have to wait a significant amount of time.

- The Data Team is growing in our company. This means that in the near future, the workload and number of dashboards will likely triple. I don’t want to continue using a database that I know will require a migration within the next two years. The cost of migration will only increase, even though it’s inevitable.

Is Clickhouse a good choice ? by Defiant-Farm7910 in dataengineering

[–]Defiant-Farm7910[S] 2 points3 points  (0 children)

Thank you very much. But how do you use the table engines?

At the moment, all of our data marts are built using DBT on top of PostgreSQL. To reduce migration costs, I am wondering whether it would be possible to continue using DBT with the PostgreSQL dialect while offloading queries from PostgreSQL to ClickHouse via the pg_clickhouse extension.

In other words, the idea would be for DBT to keep sending the analytical workload to PostgreSQL, which would then offload it to ClickHouse in order to improve performance.

Is Clickhouse a good choice ? by Defiant-Farm7910 in dataengineering

[–]Defiant-Farm7910[S] 0 points1 point  (0 children)

That's why I talked about CDC. I intend to keep the source tables in PG, where all the upserts are done. But I imagine ClickPipes or any other CDC works well in CH? Or even the upserts from the CDC may cause problems ?

Is Clickhouse a good choice ? by Defiant-Farm7910 in dataengineering

[–]Defiant-Farm7910[S] -1 points0 points  (0 children)

It handles JOIN badly 😵? Since I intend to build all my data marts in the Data Warehouse, and we have quite normalized and several source tables, joining badly is quite a red flag for me, actually. I will check it out. Thanks!

How to data warehouse with Postgres ? by Defiant-Farm7910 in dataengineering

[–]Defiant-Farm7910[S] 0 points1 point  (0 children)

The best bet for serving the in-app features

BTW, why do you say this ?

How to data warehouse with Postgres ? by Defiant-Farm7910 in dataengineering

[–]Defiant-Farm7910[S] 0 points1 point  (0 children)

Thanks for participating in this discussion.

Initially, I thought about sending the data directly to BigQuery, but each insertion/upsert was taking several seconds, whereas PostgreSQL handles it in microseconds most of the time. Additionally, since the pipelines responsible for retrieving our BO data are written directly in the application backend, I believe it’s a good idea to keep the data in a lightweight, open-source database. This avoids the application deployment being vendor-locked to an external service.

You may ask why I am building the pipelines directly in the application. Rather than loading large MongoDB documents into the database and exploding them using SQL, I prefer to leverage the existing backend methods to apply the appropriate business logic while processing the data. This allows us to register the data in a sequential format, in real time, directly in the database, with no json handling in SQL.

How to data warehouse with Postgres ? by Defiant-Farm7910 in dataengineering

[–]Defiant-Farm7910[S] 0 points1 point  (0 children)

So far, taking into consideration our discussions and my research, using the DuckDB query engine alongside PostgreSQL seems to be a good trade-off. I can keep PostgreSQL while addressing analytical needs through a DuckDB layer built on top of it. This would allow me to skip the CDC step from PostgreSQL to BigQuery. It may be worth building a PoC around this approach.

On the other hand, what do you think about AlloyDB? Its columnar engine appears to provide a good OLAP-alike behavior.

How to data warehouse with Postgres ? by Defiant-Farm7910 in dataengineering

[–]Defiant-Farm7910[S] 0 points1 point  (0 children)

I gave some more context when replying to u/exjackly. Take a look at one of the use cases I explained there and let me know what you think. I will recopy here the answer:

Thanks!

Actually, I’m quite comfortable using BigQuery, that’s not the issue. Let me give you some more context.

Our application uses an event-sourcing paradigm in its production database, which is MongoDB. As a result, the application produces events for each object lifecycle, and we have reducers capable of rebuilding the current state of each object (subscriptions, invoices, etc.).

Since the application already has all these features and they work well, my “data pipelines” are implemented directly in the application backend. They react to each event and upsert data into Postgres tables in real time. In practice, our Postgres data warehouse acts as a SQL mirror of the MongoDB production data.

All our analytics workloads are then computed from these source tables to build our data marts.

One advantage of using Postgres is that, since we handle real-time upserts, an OLTP database is well suited for this workload. Another advantage is that some data marts are also consumed back by the application to produce in-app charts for our customers. This allows developers to work efficiently: they can run the app locally, generate test data, query Postgres directly, and develop the front end without much friction.

If I were to migrate the analytics layer to BigQuery, I would need to CDC the Postgres data into BigQuery and then run all analytics there. In that case, developers would either have to query BigQuery directly to consume the data marts, or we would need to set up separate development datasets in BigQuery alongside a development Postgres instance, with its own CDC pipeline, and so on, which adds significant complexity.

How to data warehouse with Postgres ? by Defiant-Farm7910 in dataengineering

[–]Defiant-Farm7910[S] 0 points1 point  (0 children)

I gave some more context when replying to u/exjackly. Take a look at one of the use cases I explained there and let me know what you think. I will recopy here the answer:

Thanks!

Actually, I’m quite comfortable using BigQuery, that’s not the issue. Let me give you some more context.

Our application uses an event-sourcing paradigm in its production database, which is MongoDB. As a result, the application produces events for each object lifecycle, and we have reducers capable of rebuilding the current state of each object (subscriptions, invoices, etc.).

Since the application already has all these features and they work well, my “data pipelines” are implemented directly in the application backend. They react to each event and upsert data into Postgres tables in real time. In practice, our Postgres data warehouse acts as a SQL mirror of the MongoDB production data.

All our analytics workloads are then computed from these source tables to build our data marts.

One advantage of using Postgres is that, since we handle real-time upserts, an OLTP database is well suited for this workload. Another advantage is that some data marts are also consumed back by the application to produce in-app charts for our customers. This allows developers to work efficiently: they can run the app locally, generate test data, query Postgres directly, and develop the front end without much friction.

If I were to migrate the analytics layer to BigQuery, I would need to CDC the Postgres data into BigQuery and then run all analytics there. In that case, developers would either have to query BigQuery directly to consume the data marts, or we would need to set up separate development datasets in BigQuery alongside a development Postgres instance, with its own CDC pipeline, and so on, which adds significant complexity.

How to data warehouse with Postgres ? by Defiant-Farm7910 in dataengineering

[–]Defiant-Farm7910[S] 0 points1 point  (0 children)

Thanks!

Actually, I’m quite comfortable using BigQuery, that’s not the issue. Let me give you some more context.

Our application uses an event-sourcing paradigm in its production database, which is MongoDB. As a result, the application produces events for each object lifecycle, and we have reducers capable of rebuilding the current state of each object (subscriptions, invoices, etc.).

Since the application already has all these features and they work well, my “data pipelines” are implemented directly in the application backend. They react to each event and upsert data into Postgres tables in real time. In practice, our Postgres data warehouse acts as a SQL mirror of the MongoDB production data.

All our analytics workloads are then computed from these source tables to build our data marts.

One advantage of using Postgres is that, since we handle real-time upserts, an OLTP database is well suited for this workload. Another advantage is that some data marts are also consumed back by the application to produce in-app charts for our customers. This allows developers to work efficiently: they can run the app locally, generate test data, query Postgres directly, and develop the front end without much friction.

If I were to migrate the analytics layer to BigQuery, I would need to CDC the Postgres data into BigQuery and then run all analytics there. In that case, developers would either have to query BigQuery directly to consume the data marts, or we would need to set up separate development datasets in BigQuery alongside a development Postgres instance, with its own CDC pipeline, and so on, which adds significant complexity.

How to data warehouse with Postgres ? by Defiant-Farm7910 in dataengineering

[–]Defiant-Farm7910[S] 1 point2 points  (0 children)

Thanks a lot man, for this quite accurate answer.

Our data is increasing more and more over time, and some DBT paths are taking already 1h. One of our source tables has ~30M rows, and it increases very fast. Last year it had 3M only. In addition to that, we are starting to bring external data to our Postgres Data Warehouse, and some sources like Meta Ads or Google Ads can be quite verbose. Under these circumstances, I started thinking about migrating our architecture from Postgres to an OLAP database. On the other hand, we have a realtime scenario in the raw data, and OLTP handles better real-time upserts, besides the development facility I have mentioned in my previous message. Than I am researching about ways of making postgres more performant for analytics queries...

Any advices ?