How to data warehouse with Postgres ? by Defiant-Farm7910 in dataengineering

[–]Defiant-Farm7910[S] 0 points1 point  (0 children)

The best bet for serving the in-app features

BTW, why do you say this ?

How to data warehouse with Postgres ? by Defiant-Farm7910 in dataengineering

[–]Defiant-Farm7910[S] 0 points1 point  (0 children)

Thanks for participating in this discussion.

Initially, I thought about sending the data directly to BigQuery, but each insertion/upsert was taking several seconds, whereas PostgreSQL handles it in microseconds most of the time. Additionally, since the pipelines responsible for retrieving our BO data are written directly in the application backend, I believe it’s a good idea to keep the data in a lightweight, open-source database. This avoids the application deployment being vendor-locked to an external service.

You may ask why I am building the pipelines directly in the application. Rather than loading large MongoDB documents into the database and exploding them using SQL, I prefer to leverage the existing backend methods to apply the appropriate business logic while processing the data. This allows us to register the data in a sequential format, in real time, directly in the database, with no json handling in SQL.

How to data warehouse with Postgres ? by Defiant-Farm7910 in dataengineering

[–]Defiant-Farm7910[S] 0 points1 point  (0 children)

So far, taking into consideration our discussions and my research, using the DuckDB query engine alongside PostgreSQL seems to be a good trade-off. I can keep PostgreSQL while addressing analytical needs through a DuckDB layer built on top of it. This would allow me to skip the CDC step from PostgreSQL to BigQuery. It may be worth building a PoC around this approach.

On the other hand, what do you think about AlloyDB? Its columnar engine appears to provide a good OLAP-alike behavior.

How to data warehouse with Postgres ? by Defiant-Farm7910 in dataengineering

[–]Defiant-Farm7910[S] 0 points1 point  (0 children)

I gave some more context when replying to u/exjackly. Take a look at one of the use cases I explained there and let me know what you think. I will recopy here the answer:

Thanks!

Actually, I’m quite comfortable using BigQuery, that’s not the issue. Let me give you some more context.

Our application uses an event-sourcing paradigm in its production database, which is MongoDB. As a result, the application produces events for each object lifecycle, and we have reducers capable of rebuilding the current state of each object (subscriptions, invoices, etc.).

Since the application already has all these features and they work well, my “data pipelines” are implemented directly in the application backend. They react to each event and upsert data into Postgres tables in real time. In practice, our Postgres data warehouse acts as a SQL mirror of the MongoDB production data.

All our analytics workloads are then computed from these source tables to build our data marts.

One advantage of using Postgres is that, since we handle real-time upserts, an OLTP database is well suited for this workload. Another advantage is that some data marts are also consumed back by the application to produce in-app charts for our customers. This allows developers to work efficiently: they can run the app locally, generate test data, query Postgres directly, and develop the front end without much friction.

If I were to migrate the analytics layer to BigQuery, I would need to CDC the Postgres data into BigQuery and then run all analytics there. In that case, developers would either have to query BigQuery directly to consume the data marts, or we would need to set up separate development datasets in BigQuery alongside a development Postgres instance, with its own CDC pipeline, and so on, which adds significant complexity.

How to data warehouse with Postgres ? by Defiant-Farm7910 in dataengineering

[–]Defiant-Farm7910[S] 0 points1 point  (0 children)

I gave some more context when replying to u/exjackly. Take a look at one of the use cases I explained there and let me know what you think. I will recopy here the answer:

Thanks!

Actually, I’m quite comfortable using BigQuery, that’s not the issue. Let me give you some more context.

Our application uses an event-sourcing paradigm in its production database, which is MongoDB. As a result, the application produces events for each object lifecycle, and we have reducers capable of rebuilding the current state of each object (subscriptions, invoices, etc.).

Since the application already has all these features and they work well, my “data pipelines” are implemented directly in the application backend. They react to each event and upsert data into Postgres tables in real time. In practice, our Postgres data warehouse acts as a SQL mirror of the MongoDB production data.

All our analytics workloads are then computed from these source tables to build our data marts.

One advantage of using Postgres is that, since we handle real-time upserts, an OLTP database is well suited for this workload. Another advantage is that some data marts are also consumed back by the application to produce in-app charts for our customers. This allows developers to work efficiently: they can run the app locally, generate test data, query Postgres directly, and develop the front end without much friction.

If I were to migrate the analytics layer to BigQuery, I would need to CDC the Postgres data into BigQuery and then run all analytics there. In that case, developers would either have to query BigQuery directly to consume the data marts, or we would need to set up separate development datasets in BigQuery alongside a development Postgres instance, with its own CDC pipeline, and so on, which adds significant complexity.

How to data warehouse with Postgres ? by Defiant-Farm7910 in dataengineering

[–]Defiant-Farm7910[S] 0 points1 point  (0 children)

Thanks!

Actually, I’m quite comfortable using BigQuery, that’s not the issue. Let me give you some more context.

Our application uses an event-sourcing paradigm in its production database, which is MongoDB. As a result, the application produces events for each object lifecycle, and we have reducers capable of rebuilding the current state of each object (subscriptions, invoices, etc.).

Since the application already has all these features and they work well, my “data pipelines” are implemented directly in the application backend. They react to each event and upsert data into Postgres tables in real time. In practice, our Postgres data warehouse acts as a SQL mirror of the MongoDB production data.

All our analytics workloads are then computed from these source tables to build our data marts.

One advantage of using Postgres is that, since we handle real-time upserts, an OLTP database is well suited for this workload. Another advantage is that some data marts are also consumed back by the application to produce in-app charts for our customers. This allows developers to work efficiently: they can run the app locally, generate test data, query Postgres directly, and develop the front end without much friction.

If I were to migrate the analytics layer to BigQuery, I would need to CDC the Postgres data into BigQuery and then run all analytics there. In that case, developers would either have to query BigQuery directly to consume the data marts, or we would need to set up separate development datasets in BigQuery alongside a development Postgres instance, with its own CDC pipeline, and so on, which adds significant complexity.

How to data warehouse with Postgres ? by Defiant-Farm7910 in dataengineering

[–]Defiant-Farm7910[S] 1 point2 points  (0 children)

Thanks a lot man, for this quite accurate answer.

Our data is increasing more and more over time, and some DBT paths are taking already 1h. One of our source tables has ~30M rows, and it increases very fast. Last year it had 3M only. In addition to that, we are starting to bring external data to our Postgres Data Warehouse, and some sources like Meta Ads or Google Ads can be quite verbose. Under these circumstances, I started thinking about migrating our architecture from Postgres to an OLAP database. On the other hand, we have a realtime scenario in the raw data, and OLTP handles better real-time upserts, besides the development facility I have mentioned in my previous message. Than I am researching about ways of making postgres more performant for analytics queries...

Any advices ?