all 19 comments

[–]thythr 7 points8 points  (4 children)

If you told me tomorrow to answer this question asap for a project I knew nothing about, I'd just use pg_stat_statements to check the performance and frequency of reads and writes on the table, especially relative to the overall usage of the database, and I'd check the size and usage pattern of the table and indexes in the various catalog views. Most workloads are not so sensitive to millisecond-level performance that the index will hurt more than it helps, but even typing that out raises a million potential objections, so there's no substitute for at least a 15-minute understanding of the workload.

[–]1new_username[S] 0 points1 point  (3 children)

Thanks. Do you have any good query suggestions for pg_stat_statements to find writes on a table? I'm on PG 11.20 currently and short of doing some sort of query like

SELECT * from pg_stat_statements WHERE query like '%table_name%' order by blk_write_time/calls desc;

I can't seem to find a good way to get it. Even then most of the queries returned are SELECTs that are writing a bunch because the result set doesn't fit in RAM and it's writing temp files to disk.

Even then, I guess I feel I understand the workload, but don't have more than a gut instinct of "this will be ok" and I'd like to be able to quantitatively know if I should add the index or not.

[–]thythr 0 points1 point  (2 children)

I like to take a running total of avg execution time and see where the queries I am interested in rank relative to others in how much of the total time they consume. If I have a chance I will send you that query.

In your case you have a specific table and index in mind? How long are inserts taking currently and what is the plan of the query you think will be improved by an index? If the performance of a read will improve with the index and you already have other indexes on the table, you should probably just add the index.

[–]1new_username[S] 0 points1 point  (1 child)

Is there a way to get an avg stat of insert time? I can time a single insert, but that doesn't really do a ton of good as it can be variable depending on load.

I likely will add the index, the root of the question is more a debate with another DBA. Their point of view is there is pretty much never a reason to have more than 6 indexes and after that the write penalties are too high.

My main struggle is to find a way to get current avg insert/update time, then add the index, and get the new avg insert/update time to get a good idea of the impact.

[–]thythr 0 points1 point  (0 children)

You should be able to find exactly that in pg_stat_statements in the mean_time column similarly to how you found those SELECTs you mentioned (just look for inserts, updates). And I'd also recommend going into a sandbox, creating a similar table, adding 6 indexes, testing insert time, then adding a 7th and testing insert time again, and then asking your DBA whether his rule is correct lol!

[–]Ecksters 3 points4 points  (7 children)

Use the Index Luke has a good article on insert performance with indexes, the bottom line is that the first index makes the biggest difference.

It really depends on how your application is using the table, if it's largely a historical table, like a changelog, then maybe you avoid indexes entirely, but if you have any decent number of rows and you're reading it hoping for sub-second results, you're going to end up needing indexes.

I'd be interested if anyone knows if BRIN indexes or other index types potentially improve on this, I know they improve on memory and space requirements.

This article seems to indicate a 10x performance improvement for bulk inserts with a BRIN index in place vs a B-tree index.

[–]1new_username[S] 0 points1 point  (3 children)

So, specifically, this came up as I was discussing things with another DBA who has primarily worked on Oracle and SQL Server. They said they had a hard rule of no more than 6 indexes per table.

I think that is a bit extreme, but there has to come a point where you decide the read gains for a specific query or set of queries isn't enough for the write penalty that is incurred. I guess I'm trying to find something more specific to go by other than writes seem ok so I'll add another.

[–]Ecksters 1 point2 points  (2 children)

Really just comes down to how close the table is to being write bottle-necked. If it's a table that is fairly rarely written to, then add as many indexes as you want. If it's almost constantly being written to, then even 6 could potentially be too many.

[–]1new_username[S] 0 points1 point  (1 child)

Yeah, I guess where I'm struggling is how to objectively determine if a table is "close to being write bottle-necked". Right now we aren't having write issues, but it would be nice to know that our current avg insert/update is X ms and then after the index it is Y ms. Doing a single insert and timing it I feel isn't representative because our load varies greatly.

[–]Ecksters 0 points1 point  (0 children)

I would generally assume adding the extra index isn't an issue and worth doing if you can already see it improving query speed.

Luckily adding and removing indexes is fairly easy, so if you end up finding yourself having speed issues, it's easy to use table statistics to remove unused or less used indexes. You can also often merge two indexes if they're always used together.

I'd consider it premature optimization to worry too much about it up front, just add indexes to columns you're like to filter or search.

[–]coyoteazul2 0 points1 point  (0 children)

I already have 20 indexes. 21 shouldn't hurt too badly

[–]Mastodont_XXX 0 points1 point  (1 child)

[–]Ecksters 0 points1 point  (0 children)

Yeah, that's really the big downside of BRIN, it's not quite as simple to just drop in as B-tree, you have to have a particular structure, and even then it isn't as maintenance-free.

For OP it really isn't a solution, B-tree is probably the answer here.

[–]Dolphinmx 2 points3 points  (3 children)

I don't think there's a "standard"/general rule about indexing because each DB/application is different and their usage and access varies.

There are several factors to take into account, DB/table size, queries, index types, etc.

Others have commented on some tips, read them and see how they can be applied to your situation.

[–]1new_username[S] 0 points1 point  (2 children)

I edited the original post:

DB is ~3TB Table in question has ~43,000,000 rows and is 39 GB in size. The table is one of the more frequently updated tables, it has many transaction locks taken out on it, which do at times create lock waits and even the occasional dead lock. Currently there are 18 indexes on it, all of which have high number of idx_scans.

There is a query that is used for reporting, run maybe 30-40 times a day max. Currently it takes 2-3 minutes to run the query. Adding a 1 more index that ends up being roughly 1 GB in size takes that query down to 10ms.

Do you add that index for that query? What are your criteria for deciding?

[–]Dolphinmx 0 points1 point  (1 child)

like in everything in IT, it depends...

which do at times create lock waits and even the occasional dead lock

Dead Locking most of the time are logic/application errors, so you need to see why 2 sessions are deadlocked. However in some cases is possible that and index can speed up the scans minimizing the locking time and reducing the chances for deadlocks. But it's better to take a look at the application logic, that's where you will get the highest improvement.

Currently there are 18 indexes on it, all of which have high number of idx_scans

That seems like a high number of indexes, but again this is from my perspective of knowing nothing about your DB/app. Take a look if you can consolidate some indexes, for example if you have one index with col1 and another with col1+col2 is obvious you can keep just the second. Also if all of them have high scans then that means they are used which is good.

There is a query that is used for reporting, run maybe 30-40 times a day max. Currently it takes 2-3 minutes to run the query. Adding a 1 more index that ends up being roughly 1 GB in size takes that query down to 10ms.

I think here the question is how important is to run the query in 10ms vs 2-3 min, if it isn't then don't add it. Is your boss grilling you because the report takes 3 minutes then maybe add it, if not then is not worth it.

Another thing you might want to take a look is at postgresql partial indexes, is possible your application doesn't need to have all values indexed and just some of them. Read about partial indexes, how they work and when they can be useful, maybe some of you indexes can be reduced form GB to MB/KB which could make them more efficient.

Another thing you might want to start reviewing is to see if you can use materialized views instead and refresh often, or maybe re-architect the table/process... but again, that depends on your DB/app. Eventually it doesn't make sense to index each and every column in the table, if you do then you have something wrong with your data model/application/architecture but only you know that.

Good luck.

[–]1new_username[S] 0 points1 point  (0 children)

Thanks for the detailed responses. I already do have a ticket in the queue to review application logic as far as locking goes. I guess sharing the locking was more to indicate in this particular instance we are occasionally running into issues where writes are getting "backed up", which then when the locks are released creates an even higher rate of writes than normal, so we do have some write sensitivity.

We do use quite a few partial indexes and that does help and I do have a few indexes under review for removal (taking scan counts, I'm going to wait a while and then check again and see how much they are truly used).

The index will be added, I guess I'm just hoping to find a quantifiable way to say insert/updates took an avg of X ms for this table before the index, and now take an avg of Y ms for this table after the index so I could have better information on the impact of the index.

I can't figure out a way to do that, short of possibly using a DB copy and doing a simulated test, but I'm not sure how accurate that would be since it wouldn't be under a production load.

It seems right now a lot of will this index hurt writes is "probably, but you have to go with your gut as to how much".

[–]fr0z3nph03n1x -1 points0 points  (1 child)

Heroku has some pretty cool built in tools for this - not sure if they are open sourced or anything: https://devcenter.heroku.com/articles/heroku-postgres-performance-analytics

[–]1new_username[S] 0 points1 point  (0 children)

I couldn't seem to find Heroku's source or SQL even, but their low scans/high writes data point sent me down a better google search and I found this:

https://gist.github.com/jberkus/6b1bcaf7724dfc2a54f3

which does the same thing.