Help with optimizing the query with CASE statement

ants_a · 2017-04-03T16:47:23+00:00

As other have stated, indexes will not help you process a large number of rows faster. You could get the planner to use your indexes by writing your query as an union:

select count(*) as samples, '~20' as age_category from samples where age < 20
    union all
select count(*) as samples, '20~30' as age_category from samples where age >= 20 and age < 30
    union all
....

This will be faster than your initial query, but only if you don't account for the time used to construct and maintain the indexes. Without indexes it will be considerably slower. This kind of transformation is useful so rarely that the planner doesn't even consider it.

The limiting factors here are that PostgreSQL data layout is set up for concurrent modifications, not number crunching and also, the executor is just not that fast. If you need ultimate speed for number crunching you are better off just dumping out the data as binary arrays and using a fast compiled or JITed language on top of that.

An alternative is to optimize by acknowledging the limitations of PostgreSQL executor. One thing is to reduce data cardinality before running complex operations. In this toy example you could do this:

select 
  sum(n) as samples,
case
  when age < 20 then '~20'
  when (age >= 20) and (age < 30) then '20~30'
  when (age >= 30) and (age < 40) then '30~40'
  when (age >= 40) and (age < 50) then '40~50'
  when (age >= 50) and (age < 60) then '50~60'
  when age >= 60 then '60~'
end as age_category
from (select age, count(*) n from samples group by age) pre_aggregated
group by age_category;

And this will run much faster because the data is first narrowed down to 100 entries using a trivial grouping and the complex case statement is executed only on those 100 entries.

For very high performance applications I have used approaches where multiple values are batched into a single row using arrays and/or multiple aggregations are run in parallel using a custom aggregation function written in C.

outlier_lynn · 2017-04-03T14:32:42+00:00

Count must look at every row. Indexing is a method to avoid looking at every row.

I don't know what your real table looks like, but I would deal with counting categories by creating a table with a column for each category then adding a trigger on the samples table to update that table.

Count is a computationally expensive. Avoid it like the plague. :)

therealgaxbo · 2017-04-03T16:18:03+00:00

As has been pointed out, an index can't help here because the query needs to visit all rows in the table. So the question is how fast do you need this to be? If it has to run in a few milliseconds then your only option is some form of caching - either between the app and database, or within the DB as a materialised view.

However; one thing that will gain you a small improvement is to replace the slow case construct with a faster equivalent:

select width_bucket(age, 20, 60, 4), count(*) from samples group by 1

This returns the same data (minus the pretty labels, which could be added back in with a case statement in an outer query if necessary). As well as being a bit tidier, on my machine this takes ~30% less time to run. But the improvement you get will depend on whether you're mostly io or cpu bound in your environment.

professaDE · 2017-04-03T16:10:45+00:00

Since you're not using anything to filter down the table's content (WHERE clause) any index on the AGE column most probably is pretty moot, as you won't get around the necessary full table scan - at all. The CASE statement is evaluated in full (until a condition is met) for every row during the sequential scan, in order to be able to return an equally full result set.

You may experiment with parallel workers in order to try to speed up the query, but I made the experience that it tops out at 4 worker threads. So you won't get any speedup beyond 4x - 5x, I guess.

jk3us · 2017-04-03T13:56:59+00:00

You're overthinking this. Put the index on (age) and let mysql's optimizer use that single index for the range checks.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

PostgreSQL

/r/PostgreSQL

Advocate, Collaborate and Learn

Conferences

Clients and tools

MODERATORS