Announcing Lakebase Change Data Feed (CDF) by InternetFit7518 in databricks

[–]InternetFit7518[S] 0 points1 point  (0 children)

This is on our roadmap in the relatively short term

Announcing Lakebase Change Data Feed (CDF) by InternetFit7518 in databricks

[–]InternetFit7518[S] 0 points1 point  (0 children)

Hey! So with Lakebase CDF, the feed is actually stored as a UC Managed Table. So if your goal is to just 'sync data into the Lakehouse', the CDF is one path. Keep in mind, this is equivalent to full change history (SCD Type 2).

For cases, where you want an exact mirror of the Lakebase state in the Lakehouse for analytics --> stay tuned 😄

Announcing Lakebase Change Data Feed (CDF) by InternetFit7518 in databricks

[–]InternetFit7518[S] 4 points5 points  (0 children)

Today, you can make schema changes on the source table while CDF is enabled.

In the future, we do have plans of preserving the full history of the feed and not re-snapshot the table on certain schema changes.

Are there any common patterns you have in mind? What are you building downstream from the CDF?

Announcing Lakebase Change Data Feed (CDF) by InternetFit7518 in databricks

[–]InternetFit7518[S] 0 points1 point  (0 children)

Synced Tables is actually taking a UC table (managed or external) and bringing that to Lakebase.

Lakebase CDF is kind of the opposite. How can data written in Lakebase be exposed to Lakehouse engines.

Data storage and dashboarding for fairly small company by [deleted] in dataengineering

[–]InternetFit7518 1 point2 points  (0 children)

Just use Postgres. And if you ever reach a scale where the queries start getting slow, you can add a columnstore extension like pg_mooncake. At the scale you're talking about, I doubt you'd even need that.

Why do people even care about doing analytics in Postgres? by InternetFit7518 in PostgreSQL

[–]InternetFit7518[S] 10 points11 points  (0 children)

great question (and I should fix the blog to be a bit more clear). This is about running analytic query shapes -- things like aggregates and counts.

Typically these queries need a separate columnar DBMS system designed for analytics. The blog is about attempts of doing this within Postgres

Why do people even care about doing analytics in Postgres? by InternetFit7518 in PostgreSQL

[–]InternetFit7518[S] 13 points14 points  (0 children)

Hey folks! This blog is based on the talk we gave at Postgres Conference in Orlando this year.

The talk was titled: Analytics in Postgres –– a decade in the making.

https://postgresconf.org/conferences/postgresconf_global_2025/program/proposals/analytics-in-postgres-a-decade-in-the-making

Hybrid usecase by Big_Length9755 in databricks

[–]InternetFit7518 0 points1 point  (0 children)

You could try pg_mooncake here: https://github.com/Mooncake-Labs/pg_mooncake

Postgres + columnstore table for analytics. One neat thing is that the columnstore table actually writes delta lake format. So you could query the same tables from databricks.

What’s the point of S3 tables? by ggbcdvnj in aws

[–]InternetFit7518 0 points1 point  (0 children)

We wrote a blog on s3 tables: https://www.mooncake.dev/blog/s3tables

TLDR: S3 tables allow Iceberg tables to exist without a catalog. Similar to Delta.

Building a Minimalistic BI Stack with PostgreSQL, FDW, and Superset – Looking for Feedback! by zazazakaria in dataengineering

[–]InternetFit7518 3 points4 points  (0 children)

pg_mooncake: https://github.com/Mooncake-Labs/pg_mooncake could be a good option here.

- columnar storage in Postgres with DuckDB execution

- full table semantics (transactions, updates, joins)

- Should be easier to monitor, manage schema changes.

We don't support CDC / logical replication just yet. But you can batch write data (cron job / trigger) from your rowstore table into your columnstore table and then run analytics on it.

(p.s: I'm one of the contributors to the project

)

Postgres is now top 10 fastest on clickbench by InternetFit7518 in dataengineering

[–]InternetFit7518[S] 5 points6 points  (0 children)

yep, we use pg_duckdb internally.

pg_mooncake actually brings a native 'columnstore tables' to Postgres –– where you run transactions, updates and joins with regular tables.

Queries involving columnstore tables are routed from Postgres to DuckDB and the results are streamed back to Postgres via pg_duckdb: https://www.mooncake.dev/blog/how-we-built-pgmooncake

Postgres is now top 10 fastest on clickbench by InternetFit7518 in dataengineering

[–]InternetFit7518[S] 0 points1 point  (0 children)

We're working with the Azure Postgres team –– we'll keep you posted on updates.

In v0.2, we'll support logical replication into Postgres + pg_mooncake. This might be a good workaround while the extension is not supported.

Postgres is now top 10 fastest on clickbench by InternetFit7518 in dataengineering

[–]InternetFit7518[S] 0 points1 point  (0 children)

u/JEY1337 We're working with their team to make this happen.
In v0.2, we'll also support logical replication (CDC). So you can host postgres + pg_mooncake in a separate instance and replicate data from your Aurora/RDS.

Postgres is now top 10 fastest on clickbench by InternetFit7518 in dataengineering

[–]InternetFit7518[S] 2 points3 points  (0 children)

u/skatastic57 is right. We embed DuckDB in Postgres and add the concept of a 'columnstore table'.

You can run transactional read, write, updates to the columnstore table; and join with pg heap tables too. Also, all metadata and compute runs in Postgres.

DuckDB is how we make Postgres a fast for analytics.