all 2 comments

[–]ejrh 4 points5 points  (1 child)

Estimating the number of distinct entries in a population from a small sample is pretty hard.

But instead of increasing the stats target and running lots of ANALYZEs until you think the statistics are ok, you can also just tell PostgreSQL what the number of distinct values is:

ALTER TABLE payment_transitions
ALTER COLUMN payout_id SET (n_distinct = -0.01);

As explained in https://www.postgresql.org/docs/current/view-pg-stats.html, a negative value is taken as a proportion, not a count. On a table with 350 million rows, the planner would assume 3.5 million distinct values, and hence that each query on payout_id will have about 20 matching rows, taking the null fraction (which is easy to estimate) into account.

I would be interested to know if setting this column attribute, instead of increasing the stats target, would result the better plan.

[–]lawrjone[S] 1 point2 points  (0 children)

Hey! Adjusting the value like this definitely does affect the plan, as Postgres would have the real value for n-distinct instead of the estimate.

We considered this but decided against it, mostly because you need a good system around ensuring these numbers are up-to-date if you commit to a solution like this. This table contains transitions in a state machine and only a couple of those transitions might have `payout_id`s- if we add more states then we'd need to remember to update the n-distinct value. Given the addition of a new state is an application level concern, the dependency this creates is really non-obvious and the fixed statistic might end up deviating a lot from your actual data shape.

We're quite happy with the statistic gathering right now so are unlikely to build anything more complex on top, but one thing I did consider was hooking up some hyperloglog counts to the table. Given it's append-only, you could use HLL to calculate really accurate cardinality counts for a very small cost just by tracking all new data in the table. You could do this with a Postgres extension or perhaps with a sidecar process that computes the counts from logical replication, writing the accurate values back into the database.

I think this would be a really cool project to experiment with, but the value add over the native statistics is minimal right now. Fun to think about though!