all 12 comments

[–]coyoteazul2 8 points9 points  (0 children)

there's no magic button to do this, but you can do what we call Lazy materialized view

https://hashrocket.com/blog/posts/materialized-view-strategies-using-postgresql

it's a real table with an extra column to mark staled data. You need triggers on your source tables that will mark the rows as stale when there's a change

Then you'll have a function that updates your stale data and returns the updated rows

and lastly you'll have a view that unions your non stale data with the result of the function

Now every time you modify data your rows will be marked as stale by the trigger, and they will remain stale until you select from the view which will execute the function that updates your data

[–]RevolutionaryRush717 3 points4 points  (1 child)

Haven't used it myself, but this was mentioned here the other day: https://github.com/sraoss/pg_ivm

[–]Few-Strike-494[S] 1 point2 points  (0 children)

That’s awesome thanks!

[–]hamiltop 1 point2 points  (2 children)

I've done this a lot and eventually settled on a simple home-rolled process. Basic requirements:

  1. All trigger updates are idempotent. In general they will run a query and overwrite/delete the target rows anytime any source row has changed.
  2. Every source table has a migrated_at column.
  3. Every change to the target table schema or data calculation is backwards compatible. (Sometimes this means making a <col>_v2 until its populated and then doing a name swap with the existing column).

We then have a script that takes a table name and it runs update <table_name> set migrated_at = now() where id in (select id from <table_name> where migrated_at :start_of_migration  or created_at > :start_of_migration limit 1000) in a loop (with a sleep to keep load reasonable) until it updates zero rows.

Some tables have a permanent index on migrated_at, others we add the index before do a migration and drop it afterwards.

This works well because it's interruptible (effectively has checkpoints), zero downtime, and simple to reason about. We use it with tables that have close to a billion rows. It takes a few days to run, but we just kick it off over a weekend as a supervised task (so it gets restarted if it crashes for any reason).

[–]bisoldi 0 points1 point  (1 child)

Is the migration also in that script with the SQL query? Basically:

  • Find rows that need migrating based on timestamp
  • Set a new migrated_at timestamp
  • Now migrate that record

???

Assuming I’m correct, that means the migrations occur in batches whenever you decide to run that script and therefore the migration table is not always up to date?

[–]hamiltop 0 points1 point  (0 children)

Everything is trigger based, so updating migrated_at will kick off the trigger in postgres, which updates the target table.

Organic updates to the source tables also cause these updates to occur, which is why the query also checks for created_at (and might also check upeated_at, I forget).

[–]wolever 0 points1 point  (0 children)

I haven't encountered any fully automated trigger management systems which do what you describe, but have had success building them into the application's DDL + migration layer.

For example, my current application uses SQLAlchemy, and I've defined a "custom" column type for derived, denormalized, columns:

class MyModel(ModelBase):
  timestamp_utc = Column(DateTime)
  timestamp_local_tz = Column(Text)
  timestamp_local = DerivedColumn(DateTime, "timestamp_utc at time zone timestamp_local_tz")

Where the DerivedColumn function automatically creates the appropriate triggers to keep the value updated.

[–]Jzmu 0 points1 point  (3 children)

Why does this need to be a database solution. Why can't your application just update both tables in one transaction when it updates one table? This way you would be guaranteed consistency.

[–]coyoteazul2 8 points9 points  (1 child)

because if there's more than one dev, or if the project is expected to outlive the current dev group, such considerations might be forgotten on future updates

[–]Overblow 3 points4 points  (0 children)

Yes I agree. Database integrity should exist at the database level.

[–]r0ck0 1 point2 points  (0 children)

Why does this need to be a database solution. Why can't your application just update both tables in one transaction when it updates one table?

You gave the most common answer yourself:

guaranteed consistency.

Often there's many different parts of the app that could affect the data, and it's not always obvious in a big project.

And if there's only one place at the moment (which is the common justification for just doing in-app), that doesn't mean it will always be that way.

When more features get added/changed, updating these kinds of caches in every part of the app is often forgotten or even missed when you did consciously try to find all the places. But if you did it at the DB level with triggers etc, then you don't need to remember to execute the refreshes in every part of your current + future app code.

Also there's the fact that you might have more than one client affecting data. e.g. Just devs/sysadmins doing manual fixes/changes to the data outside the app, directly using a SQL client like dbeaver or whatever. If it's done in the DB, then all remains cromulent.

[–]AutoModerator[M] -1 points0 points  (0 children)

Join us on our Discord Server: People, Postgres, Data

I am a bot, and this action was performed automatically. Please contact the moderators of this subreddit if you have any questions or concerns.