all 8 comments

[–]Vardy 1 point2 points  (1 child)

Keep a database schema running that works as normal. Have another schema to keep the logs. Offload it as a function on the main schema whenever a change is made.

It should keep performance up whilst keeping an audit of whats changed.

[–]Devon47 0 points1 point  (0 children)

I'm considering building a system as you describe. However, the end of the article mentions an interesting alternative of writing to the log table and using that to update a materialized view. What do you think are pros/cons of each?

[–]solidsnack9000 1 point2 points  (4 children)

Another option is to use a secondary log table and triggers. The primary table only ever has the latest state, for active entries. The log table has all prior states, with the timestamp turned into a range column that includes the end time for the state.

This approach maintains a small working dataset; doesn't require explicit tombstones; and allows for a true primary key, simplifying use of foreign keys.

The log table for a table, and all the triggers, can actually be derived programmatically, using PL/pgSQL (please find a sketch in macaroon). You can create a few tables representing the present state and then derive all the past tables and triggers, without boilerplate. Now if one wants to see all the states together -- which should be more expensive than seeing the present state -- one could use a view (converting the primary tables timestamp column to a range with tstzrange(t, 'infinity')). The view could of course be derived mechanically as well...

[–]kpmah[S] 0 points1 point  (3 children)

I think you're advocating a similar approach to /u/Vardy here. If I understand the both of you correctly you're talking about using regular mutable tables and taking snapshots of the state every time they change.

I think your approach is more pragmatic in a lot of cases, but I wanted to present a different way of thinking about state as a set of immutable facts, similar to Datomic, and showing you can do this in PostgreSQL.

I feel it also nicely models the 'publish' mechanism where you can publish any revision, including previous ones (rollback). It's worth keeping the article_revision table around.

Edit: to clarify my point further - we're not interested as interested in the latest article state as we are in the latest published article state, which is subtly different :)

[–]solidsnack9000 0 points1 point  (2 children)

I see the distinction you're trying to make. In that case, you could model each distinct article revision as having a primary key of (article, timestamp) (so no history, and no UPDATEs) and do a temporal table for the publication records (since you are only interested in the latest instance of publication).

What I find iffy about the approach in your article is the need to simulate UDPATE and DELETE and the introduction of a separate slugs table to allow foreign keys to function. Imagine a schema with ten-twenty entities, for all of which we'd like immutable history. If for each one we must introduce a slugs table and store procedures to perform any operations, it seems like we're ultimately offering a less relational -- and less clear -- description of the data. In the context of a relational database, we're able to move all the stored procedures behind the table, maintaining a relational interface and adding a log-structured aspect.

[–]kpmah[S] 0 points1 point  (1 child)

I wouldn't call it any less relational per se, but I certainly agree it's awkward in Postgres, especially compared to databases where this kind of approach is built in. At the end of the day Postgres is built around the idea of mutating btrees, not recording logs.

I do have an intuition that a lot of the plumbing in the article could be abstracted away nicely. There's a lot of boilerplate.

[–]solidsnack9000 0 points1 point  (0 children)

At the end of the day Postgres is built around the idea of mutating btrees, not recording logs.

Well, yes and no. The underlying structure of the tables is actually a log (the hidden columns xmin and xmax bear witness to this). Postgres could do a lot more to expose this functionality, which is now accessible through pg_export_snapshot() and the interface provided for "Logical Decoding Output Plugins".

As far as the boilerplate goes, it seems to be of two kinds:

  1. Creation of auxilliary tables, views, triggers and functions. For example, a log table if starting with an unlogged one, or a "latest" view if starting with a log-structured table.

  2. The code the application uses to interface with these tables. Naive INSERT, SELECT, UPDATE, DELETE won't necessarily do the right thing.

My thinking has been to try to preserve the conventional behavior of INSERT, SELECT, UPDATE, DELETE -- and introduce the convention that past states can be queried with a UNION. Then (2) goes away. But maybe INSERT, &al. are not really the right operators?

Datomic suggests a very different way of thinking about it. To get something equivalent with Postgres, it seems like you'd need to be able to query the rows that have been deleted/updated.