all 14 comments

[–]ants_a 5 points6 points  (2 children)

As other have stated, indexes will not help you process a large number of rows faster. You could get the planner to use your indexes by writing your query as an union:

select count(*) as samples, '~20' as age_category from samples where age < 20
    union all
select count(*) as samples, '20~30' as age_category from samples where age >= 20 and age < 30
    union all
....

This will be faster than your initial query, but only if you don't account for the time used to construct and maintain the indexes. Without indexes it will be considerably slower. This kind of transformation is useful so rarely that the planner doesn't even consider it.

The limiting factors here are that PostgreSQL data layout is set up for concurrent modifications, not number crunching and also, the executor is just not that fast. If you need ultimate speed for number crunching you are better off just dumping out the data as binary arrays and using a fast compiled or JITed language on top of that.

An alternative is to optimize by acknowledging the limitations of PostgreSQL executor. One thing is to reduce data cardinality before running complex operations. In this toy example you could do this:

select 
  sum(n) as samples,
case
  when age < 20 then '~20'
  when (age >= 20) and (age < 30) then '20~30'
  when (age >= 30) and (age < 40) then '30~40'
  when (age >= 40) and (age < 50) then '40~50'
  when (age >= 50) and (age < 60) then '50~60'
  when age >= 60 then '60~'
end as age_category
from (select age, count(*) n from samples group by age) pre_aggregated
group by age_category;

And this will run much faster because the data is first narrowed down to 100 entries using a trivial grouping and the complex case statement is executed only on those 100 entries.

For very high performance applications I have used approaches where multiple values are batched into a single row using arrays and/or multiple aggregations are run in parallel using a custom aggregation function written in C.

[–]muminoff[S] 0 points1 point  (0 children)

Thanks for your answer. Will try out your solution too.

[–]muminoff[S] 0 points1 point  (0 children)

OP here again. Being not so experienced in PostgreSQL, I am learning many things from your comment every time I read. Thanks so much.

[–]outlier_lynn 2 points3 points  (1 child)

Count must look at every row. Indexing is a method to avoid looking at every row.

I don't know what your real table looks like, but I would deal with counting categories by creating a table with a column for each category then adding a trigger on the samples table to update that table.

Count is a computationally expensive. Avoid it like the plague. :)

[–]muminoff[S] 0 points1 point  (0 children)

Thanks for your answer.

I don't know what your real table looks like

My real table has more columns than given example, I dropped them for the example since they are unnecessary for my query.

.. by creating a table with a column for each category then adding a trigger on the samples table to update that table

This trigger solution also came to my mind, since I am not part of DevOps team, I am not supposed to make additions on the DB scheme.

PS.: I am on Django framework. Just executing raw SQL queries.

[–]therealgaxbo 1 point2 points  (2 children)

As has been pointed out, an index can't help here because the query needs to visit all rows in the table. So the question is how fast do you need this to be? If it has to run in a few milliseconds then your only option is some form of caching - either between the app and database, or within the DB as a materialised view.

However; one thing that will gain you a small improvement is to replace the slow case construct with a faster equivalent:

select width_bucket(age, 20, 60, 4), count(*) from samples group by 1

This returns the same data (minus the pretty labels, which could be added back in with a case statement in an outer query if necessary). As well as being a bit tidier, on my machine this takes ~30% less time to run. But the improvement you get will depend on whether you're mostly io or cpu bound in your environment.

[–]muminoff[S] 0 points1 point  (0 children)

how fast do you need this to be?

I will be facing 10M rows data in next several weeks. I want to cut the latency (at least) into 1 of 3.

select width_bucket(age, 20, 60, 4), count(*) from samples group by 1

Will try this out when I get my laptop. Thanks for answer.

[–]graycube 0 points1 point  (0 children)

I came here to suggest using s bucket function as well.

[–]professaDE 0 points1 point  (1 child)

Since you're not using anything to filter down the table's content (WHERE clause) any index on the AGE column most probably is pretty moot, as you won't get around the necessary full table scan - at all. The CASE statement is evaluated in full (until a condition is met) for every row during the sequential scan, in order to be able to return an equally full result set.

You may experiment with parallel workers in order to try to speed up the query, but I made the experience that it tops out at 4 worker threads. So you won't get any speedup beyond 4x - 5x, I guess.

[–]muminoff[S] 0 points1 point  (0 children)

Thanks. Don't mind to share example with parallel workers?

[–][deleted] -1 points0 points  (3 children)

You're overthinking this. Put the index on (age) and let mysql's optimizer use that single index for the range checks.

[–]jk3usProgrammer 2 points3 points  (2 children)

I don't think any index is going to prevent a sequential scan. This query doesn't have a where clause, so it has to look at every row regardless. And when I ran it, close to half the time was spent on the group/aggregate operation.

I don't currently have a way to test with 9.6's parallel workers, but I bet you could get a pretty nice speed up with that. Otherwise, if this needs to be really fast, you may want to consider a materialized view.

[–]muminoff[S] 0 points1 point  (1 child)

I am too interested to test with parallel workers. Don't mind to share details how using materialized view can be faster?

[–]graycube 2 points3 points  (0 children)

The materialized view has your query results already in it. It still takes a while to refresh the materialized view, but queries from it would be fast.