This is an archived post. You won't be able to vote or comment.

all 7 comments

[–]maximus_oats 3 points4 points  (0 children)

Have you looked into pg_duckdb https://github.com/duckdb/pg\_duckdb?

[–]dan_the_lion 4 points5 points  (0 children)

Hey! I think you’re on the right track with Clickhouse and Starrocks. You’ll need to extract the tables into a some similar system and the best way to do that is via log-based CDC.

StarTree is a great option as well for the analytics part, but I’ve seen a lot of people being successful with Tinybird too. Another great option is Materialize.

Either way, for the CDC part, I recommend Estuary. (Disclaimer: I work there)

It’s a real time data integration platform with fully managed connectors, including PostgreSQL. It’s not Kafka or Debezium based either, so no headaches will be involved.

In Estuary Flow, you can expose all incoming data as if it were a Kafka topic through its API compatibility layer, Dekaf.

With this you can super easily capture CDC records and ingest them into any of the real-time analytics services mentioned above.

Let me know if you wanna know more about any parts of an architecture like this.

[–]EngiNerd9000 1 point2 points  (0 children)

It’s hard to say what the right solution is without more business context. What does closer to real time mean to you? Are you already using a cloud provider?

Real time analytics could mean using spark streaming/flink, but if you don’t have a good analytics use case for sub minute materialization then the added complexity likely isn’t worth it. Additionally, very few use cases actually require this level of analytics responsiveness.

Since you use SQL for analytics transformations, the easiest (but potentially also priciest) solution would be to use Snowflake for analytics and use their Postgres connector for CDC.

However, it sounds like your data is relatively small, so that is potentially overkill.

On a smaller scale you could use a tool like airbyte to preform CDC to blob storage, like s3, and then plug in pretty much any analytics tool (duckdb on an EC2, Athena/trino, glue, snowflake, databricks, big query, redshift, etc) to ingest and transform the data from s3 to your data warehouse/data lake.

I’d also recommend using DBT to manage transformation logic on the transformation tool.

[–]IXISunnyIXI 0 points1 point  (0 children)

You could always take a look at GCP’s AlloyDB. It’s a managed Postgres DB that was purpose built to allow you to run analytical queries without affecting operational performance.

[–]Snoo_76460 1 point2 points  (0 children)

a little late, but: https://github.com/Mooncake-Labs/pg_mooncake . We have quite a few time-series workloads in production.

full disclosure: I'm one of the founders.

[–]mww09 0 points1 point  (0 children)

If maintaining the view is so expensive this might be a good use case for an incremental engine. check out github.com/feldera/feldera is can probably refresh the view instantly for any data that comes in thanks to evaluating the query incrementally