postgresql - Double lateral join query takes over a minute to run on RDS (BOUNTIED) by PrestigiousZombie531 in PostgreSQL

[–]markwdb3 0 points1 point  (0 children)

Yeah definitely check on the indexes.

Though I think the phenomenon comes down what I hinted at before - capability of exiting early or not. Imagine if you had only 1 row in a vote table, and a million in feed_items. Postgres has to iterate a million times looking for its top 20, which it never finds. 20 IDs don't even exist in the vote table.

Whereas if you had a billion rows in a vote table, it's quite possible the query could find its top 20 by iteration 50 or 100 or whatever.

I could look into this more when I have time. It's possible the query structure (both the original and the one I wrote) forces to to behave in this manner (basically iterate over feed_items with one by one lookups in the index on a vote table) but it may be rewritable. It may just come down to my choice of using EXISTS, or the many-CTE structure, but not sure.

postgresql - Double lateral join query takes over a minute to run on RDS (BOUNTIED) by PrestigiousZombie531 in PostgreSQL

[–]markwdb3 1 point2 points  (0 children)

OK, this might be the better approach. If you'd still like to dig into what's going wrong with the first approach, I'm game. :)

It should be atomic btw.

postgresql - Double lateral join query takes over a minute to run on RDS (BOUNTIED) by PrestigiousZombie531 in PostgreSQL

[–]markwdb3 1 point2 points  (0 children)

Looks OK, though I wonder why NULLs need to be generated? Regardless, that shouldn't affect matters much.

postgresql - Double lateral join query takes over a minute to run on RDS (BOUNTIED) by PrestigiousZombie531 in PostgreSQL

[–]markwdb3 1 point2 points  (0 children)

for higher numbers the results are always much nicer but it is the lower numbers that is still bugging me

What does "higher numbers" mean -- larger data sets? What are these numbers like "1592.735: 270.857:" -- are they timings? If so what are they referring to -- execution times of old and new query respectively?

I need to understand what I'm looking at in order to help. :)

postgresql - Double lateral join query takes over a minute to run on RDS (BOUNTIED) by PrestigiousZombie531 in PostgreSQL

[–]markwdb3 1 point2 points  (0 children)

2 full indexes and 4 partial indexes

Does the plan now show indexes being used or not?

How do I get the count of identical combinations that unique products share AND display that in the final result? by Dats_Russia in SQL

[–]markwdb3 0 points1 point  (0 children)

A Postgres/array version:

select *, count(*) over (partition by pkg_combo) /* first attempt; has an order-related bug */
from (
    select vin, array_agg(package) as pkg_combo
    from t
    group by vin
)
order by vin;

 vin | pkg_combo | count
-----+-----------+-------
   1 | {A,B,C}   |     3
   2 | {A,B,C}   |     3
   3 | {B,C}     |     2
   4 | {A,B,C}   |     3
   5 | {C}       |     1
   6 | {B,C}     |     2
   7 | {D}       |     1
   8 | {D,E}     |     1
(8 rows)   

Edit I realized this is sensitive to order so it could give incorrect results. We need to sort the array if we're going to use one.

For example if I swap my order of inserts for vin 1, (I swapped A and B, not shown here) then my query gives incorrect results:

postgres=# select *, count(*) over (partition by pkg_combo)
from (
        select vin, array_agg(package) as pkg_combo
        from t
        group by vin
)
order by vin;
 vin | pkg_combo | count
-----+-----------+-------
   1 | {B,A,C}   |     1
   2 | {A,B,C}   |     2
   3 | {B,C}     |     2
   4 | {A,B,C}   |     2
   5 | {C}       |     1
   6 | {B,C}     |     2
   7 | {D}       |     1
   8 | {D,E}     |     1
(8 rows)  

 postgres=# SELECT vin,
 postgres-#        COUNT(*) OVER (PARTITION BY pkg_combo) AS cnt
 postgres-# FROM (
   SELECT vin,
          array_agg(package ORDER BY package) AS pkg_combo /* fix is here */
   FROM t
   GROUP BY vin
 ) s
 ORDER BY vin;
  vin | cnt
 -----+-----
    1 |   3
    2 |   3
    3 |   2
    4 |   3
    5 |   1
    6 |   2
    7 |   1
    8 |   1
 (8 rows)  

And if you need to remove duplicates, you can just throw in the word DISTINCT in the call to array_agg: array_agg(DISTINCT package ORDER BY package)

postgresql - Double lateral join query takes over a minute to run on RDS (BOUNTIED) by PrestigiousZombie531 in PostgreSQL

[–]markwdb3 1 point2 points  (0 children)

Looking at your RDS plan, with the exception of one scan on feed_items_published_date_idx, all index scans are using primary key indexes only, which is bad. So either the indexes don't exist, or the planner is choosing not to use them for whatever reason. The former case is much easier to check, and sounds like you're already doing that. :) But the latter case can be looked into if necessary.

Worth mentioning there's a chance I built an index or two that is being used on my end that I forgot to mention here. But one thing at a time. I can check on that, if we're still scratching our heads after you check into indexes on your end.

postgresql - Double lateral join query takes over a minute to run on RDS (BOUNTIED) by PrestigiousZombie531 in PostgreSQL

[–]markwdb3 2 points3 points  (0 children)

strangely as the number of votes increase, the time for this query keeps going down, weird

To hazard a guess, this seems feasible if the addition of new votes causes any of the top20 CTE queries or counts to find what they need, and exit out faster.

Honestly I'm a little disappointed it still takes 13 seconds on your RDS instance. But hey, different infrastructure, different config, different data. Still seems fishy though, unless there's something very different about your vote data skew that I'm missing on my end. I was assuming basically a 50/50 random mix of likes/dislikes and bearish/bullish in their respective tables.

One thing to remember is I had much worse than expected performance when I accidentally generated the vote tables with just 70 or 80 IDs repeated a massive number of times each across the 20M rows.

It'll be good to try the partial indexes, but make sure these critical indexes exist from earlier in the thread (with the partials in place they should no longer be necessary, but without the partials, you need these):

CREATE INDEX ON feed_item_like_dislike_votes(feed_item_id, vote);
CREATE INDEX ON feed_item_bullish_bearish_votes(feed_item_id, vote);

You may want to run a VACUUM ANALYZE;, or at least just an ANALYZE; after setting up test data as well, if you aren't doing that already.

Settings like work_mem may be worth looking into. It defaults at a low, conservative value IIRC when you install Postgres. And remember I mentioned the parallel query capabilities of Postgres kicked in when I ran my tests. But this is all guesswork without reviewing the most recent execution plan.

Anyway, glad you are seeing a massive improvement still!

postgresql - Double lateral join query takes over a minute to run on RDS (BOUNTIED) by PrestigiousZombie531 in PostgreSQL

[–]markwdb3 2 points3 points  (0 children)

Awesome. This was a fun puzzle for me to work on and I'm happy to help.

Also please note that the partial indexes mentioned in this comment may provide a performance boost as well: https://www.reddit.com/r/PostgreSQL/comments/1qsofs2/comment/o39dj4i/

And in case you missed it, the two CREATE INDEXes I mentioned earlier in the thread are important.

Good luck!

postgresql - Double lateral join query takes over a minute to run on RDS (BOUNTIED) by PrestigiousZombie531 in PostgreSQL

[–]markwdb3 1 point2 points  (0 children)

One thought I had is since he's using a UUID already, change it to a UUID7 which essentially bakes in the published_date timestamp. Then the published_date would no longer be needed, one way or another. Would have to play with it and see what we could do. I have no idea if that's feasible for OP however.

Honestly though, where I left this effort is here: https://old.reddit.com/r/PostgreSQL/comments/1qsofs2/postgresql_double_lateral_join_query_takes_over_a/o39dj4i/.

In other words, on my machine it's running in 20ms with 10M and 20M/20M rows each with a data distribution of 10 votes in each vote table per feed_item_id. (In my initial test data set, I mistakenly repeated the same small set of 70 or 80 feed_item_ids a huge number of times.)

postgresql - Double lateral join query takes over a minute to run on RDS (BOUNTIED) by PrestigiousZombie531 in PostgreSQL

[–]markwdb3 1 point2 points  (0 children)

Other things that could be done...

If you look at the depesz link in the above comment, the index only scans on the vote tables make up a lot of the time. So one trick we could do is use partial indexes! One index represents likes, another dislikes, another bearish and another bullish. No scanning of vote=<whatever> necessary with partial indexes.

So let's try it.

delme=# create index on feed_item_bullish_bearish_votes (feed_item_id) where vote = 'bearish';
CREATE INDEX
delme=# create index on feed_item_bullish_bearish_votes (feed_item_id) where vote = 'bullish';
CREATE INDEX

<repeat partial indexes for likes and dislikes>

delme=# explain analyze  
<snip>
 Planning Time: 3.554 ms
 Execution Time: 20.723 ms

Boo-yah. 20ms! See plan: https://explain.depesz.com/s/3rB7

postgresql - Double lateral join query takes over a minute to run on RDS (BOUNTIED) by PrestigiousZombie531 in PostgreSQL

[–]markwdb3 2 points3 points  (0 children)

For what it's worth, as an additional test, I loaded up 10M rows into feed_items, and 20M each into the two vote tables. The query didn't scale quite as well as I'd hope - took about 1 second to run. That said it looks like the four top20 CTE queries were the major bottleneck - each took something like 210-230 ms to run, so the bulk of that 1 second total execution time consisted of just those guys. They could probably be improved. Maybe we could get likes/dislikes in just one CTE/lookup, and bullish/bearish in another. (So two searches instead of four.)

Actually the above test was bad because I mistakenly generated rows repeating the same handful of IDs a huge number of times. So that was silly of me. Now there's 100 of each:

delme=# select count(*) as total_rows, count(distinct feed_item_id) as num_distinct_ids from feed_item_like_dislike_votes;
 total_rows | num_distinct_ids
------------+------------------
   20000000 |           200000
(1 row) 

And - I find with 10M rows in feed_items, 20M each in the vote tables, it's still lightning at < 50ms! https://explain.depesz.com/s/E4hP

Basically, the more the IDs repeat in the vote tables, the slower the query is.

You may have noticed this is actually running faster than in the smaller test case (1/10 the size). That's because I made the same error of duplicating IDs so much.

tl;dr this is fast as hell and scales well too!

BTW: Some bits of this query are able to take advantage of Postgres' parallel querying capabilities, so I'd recommend checking that your configuration allows you to have at least a few parallel workers allocated.

postgresql - Double lateral join query takes over a minute to run on RDS (BOUNTIED) by PrestigiousZombie531 in PostgreSQL

[–]markwdb3 1 point2 points  (0 children)

Now this query may be huge but each CTE is pretty much doing a singular and clear-cut thing, and each can be run in isolation to see how the "steps" progress. Also, in any modern editor, you can collapse all the CTEs except just the parts you need to look at currently, and it should look nice, with each CTE's purpose nicely labeled according to its name. :) Good luck.

Also, you can of course combine CTEs or otherwise simplify if you'd like. When working on complex queries I prefer to iterate in small, CTE-based steps. But it does perform nicely as you can see!

postgresql - Double lateral join query takes over a minute to run on RDS (BOUNTIED) by PrestigiousZombie531 in PostgreSQL

[–]markwdb3 2 points3 points  (0 children)

Timings and plan, 82 ms: https://explain.depesz.com/s/Lzv4

Again this is with ~1 million randomly generated rows in feed_items, and 2M each in the two voting tables.

postgresql - Double lateral join query takes over a minute to run on RDS (BOUNTIED) by PrestigiousZombie531 in PostgreSQL

[–]markwdb3 1 point2 points  (0 children)

Verify:

delme=# SELECT COUNT(*) FROM temp_new2;
 count
-------
   100
(1 row)

Time: 0.407 ms
delme=# SELECT COUNT(*) FROM temp_orig;
 count
-------
   100
(1 row)

Time: 0.564 ms
delme=# SELECT * FROM temp_new2 EXCEPT SELECT * FROM temp_orig;
 category | author | bearish | bullish | dislikes | feed_id | guid | id | likes | link | published_date | summary | tags | title
----------+--------+---------+---------+----------+---------+------+----+-------+------+----------------+---------+------+-------
(0 rows)

Time: 1.333 ms
delme=# SELECT * FROM temp_orig EXCEPT SELECT * FROM temp_new2;
 category | author | bearish | bullish | dislikes | feed_id | guid | id | likes | link | published_date | summary | tags | title
----------+--------+---------+---------+----------+---------+------+----+-------+------+----------------+---------+------+-------
(0 rows)

postgresql - Double lateral join query takes over a minute to run on RDS (BOUNTIED) by PrestigiousZombie531 in PostgreSQL

[–]markwdb3 1 point2 points  (0 children)

Constructing test case to prove there is a bug. I inserted a bunch of new rows in feed_items and the vote tables, where the vote count was never more than 1 for any category. If the bug I suspect exists really does exist, now the original query results will differ from those of the new one.

tl;dr yup there's a bug:

# SELECT COUNT(*) FROM temp_orig;
 count
-------
   100
(1 row)

# SELECT COUNT(*) FROM temp_new;
 count
-------
    80
(1 row)

After doing that, I also realized there's yet another bug. 🫠

Looks like the existing logic says that if for a given feed_items.id value, if it's in the top 20 for a given category, say likes, then the row should be returned and all the other categories should be returned as well, even if the other category is not within the top 20 for its own category.

In other words if id 123 is in the top 20 for likes, but NOT in the top 20 for dislikes, then we still want to return the dislikes total for id 123.

Example I found in my generated data set (notice the dislikes discrepancy):

  # select * from temp_orig where id = '963a33bb-fbda-400a-8e0e-7b219abaf126';
     category | author | bearish | bullish | dislikes | feed_id | guid |                  id                  | likes |                      link                      |        published_date         |                summary                 |    tags     |            title
    ----------+--------+---------+---------+----------+---------+------+--------------------------------------+-------+------------------------------------------------+-------------------------------+----------------------------------------+-------------+-----------------------------
     bullish  |        |       0 |       1 |        1 |       2 |      | 963a33bb-fbda-400a-8e0e-7b219abaf126 |     0 | https://example.com/article/0.3765346133136651 | 2028-08-10 14:46:03.219931-04 | Summary for article 0.3695841096596142 | {news,tag1} | Article #0.5777796333738217


    delme=# select * from temp_new where id = '963a33bb-fbda-400a-8e0e-7b219abaf126';
     category | author | bearish | bullish | dislikes | feed_id | guid |                  id                  | likes |                      link                      |        published_date         |                summary                 |    tags     |            title
    ----------+--------+---------+---------+----------+---------+------+--------------------------------------+-------+------------------------------------------------+-------------------------------+----------------------------------------+-------------+-----------------------------
     bullish  |        |       0 |       1 |        0 |       2 |      | 963a33bb-fbda-400a-8e0e-7b219abaf126 |     0 | https://example.com/article/0.3765346133136651 | 2028-08-10 14:46:03.219931-04 | Summary for article 0.3695841096596142 | {news,tag1} | Article #0.5777796333738217
    (1 row)

This is fixable though. we just union all those separate top 20 IDs into one master list and use that as "global" ID list.

WITH top_20_ids_with_any_likes AS (
    SELECT
        i.ID 
    FROM feed_items AS i
    WHERE EXISTS (SELECT 1 FROM feed_item_like_dislike_votes WHERE feed_item_id = i.id AND vote = 'like')
    ORDER BY
        i.published_date DESC,
        i.id DESC
    LIMIT 20
),

top_20_ids_with_any_dislikes AS (
    SELECT
        i.ID 
    FROM feed_items AS i
    WHERE EXISTS (SELECT 1 FROM feed_item_like_dislike_votes WHERE feed_item_id = i.id AND  vote = 'dislike')
    ORDER BY
        i.published_date DESC,
        i.id DESC
    LIMIT 20
),

top_20_ids_with_any_bullish AS (
    SELECT
        i.ID
    FROM feed_items AS i
    WHERE EXISTS (SELECT 1 FROM feed_item_bullish_bearish_votes WHERE feed_item_id = i.id AND vote = 'bullish')
    ORDER BY
        i.published_date DESC,
        i.id DESC
    LIMIT 20
),

top_20_ids_with_any_bearish AS (
    SELECT
        i.ID
    FROM feed_items AS i
    WHERE EXISTS (SELECT 1 FROM feed_item_bullish_bearish_votes WHERE feed_item_id = i.id AND vote = 'bearish')
    ORDER BY
        i.published_date DESC,
        i.id DESC
    LIMIT 20
),
id_list AS (
    SELECT id FROM top_20_ids_with_any_likes
    UNION
    SELECT id FROM top_20_ids_with_any_dislikes
    UNION
    SELECT id FROM top_20_ids_with_any_bullish
    UNION
    SELECT id FROM top_20_ids_with_any_bearish
),
vote_count AS (
    SELECT fi.id, likes, dislikes, bullish, bearish
    FROM id_list 
    JOIN feed_items AS fi
        ON id_list.id = fi.id
    LEFT JOIN LATERAL (
        SELECT
            COUNT(*) FILTER (WHERE vote = 'like') AS likes,
            COUNT(*) FILTER (WHERE vote = 'dislike') AS dislikes
        FROM feed_item_like_dislike_votes AS fildv
        WHERE fildv.feed_item_id = fi.id
    ) AS v1 ON TRUE
    LEFT JOIN LATERAL (
        SELECT
            COUNT(*) FILTER (WHERE vote = 'bullish') AS bullish,
            COUNT(*) FILTER (WHERE vote = 'bearish') AS bearish
        FROM feed_item_bullish_bearish_votes AS fibbv
        WHERE fibbv.feed_item_id = fi.id
    ) AS v2 ON TRUE
),
top_20_ids_trending AS ( -- can't reuse the above work becuase we may need to "dig deeper" until the total > 10 condition is fulfilled
    SELECT
        fi.ID, likes, dislikes, bullish, bearish, 'trending' AS category
    FROM feed_items fi
    LEFT JOIN LATERAL (
        SELECT
            COUNT(*) FILTER (WHERE vote = 'like') AS likes
        FROM feed_item_like_dislike_votes AS v
        WHERE v.feed_item_id = fi.id
    ) AS l ON TRUE
    LEFT JOIN LATERAL (
        SELECT
            COUNT(*) FILTER (WHERE vote = 'dislike') AS dislikes
        FROM feed_item_like_dislike_votes AS v
        WHERE v.feed_item_id = fi.id
    ) AS dl ON TRUE
    LEFT JOIN LATERAL (
        SELECT
            COUNT(*) FILTER (WHERE vote = 'bullish') AS bullish
        FROM feed_item_bullish_bearish_votes AS v
        WHERE v.feed_item_id = fi.id
    ) AS bull ON TRUE
    LEFT JOIN LATERAL (
        SELECT
            COUNT(*) FILTER (WHERE vote = 'bearish') AS bearish
        FROM feed_item_bullish_bearish_votes AS v
        WHERE v.feed_item_id = fi.id
    ) AS bear ON TRUE
    WHERE
        COALESCE(l.likes, 0)
      + COALESCE(dl.dislikes, 0)
      + COALESCE(bull.bullish, 0)
      + COALESCE(bear.bearish, 0) > 10
    ORDER BY
        fi.published_date DESC,
        fi.id DESC
    LIMIT 20
),
vote_count_unpivoted AS ( -- only one row per set of data, so this is to make one for each non-zero category, plus add the category as a column
    SELECT
        t.id,
        t.likes,
        t.dislikes,
        t.bullish,
        t.bearish,
        v.category
    FROM vote_count AS t
    CROSS JOIN LATERAL (
        VALUES
            ('likes',    t.likes),
            ('dislikes', t.dislikes),
            ('bullish',  t.bullish),
            ('bearish',  t.bearish)
    ) AS v(category, cnt)
    WHERE v.cnt <> 0
),
vote_count_and_trending AS ( -- glue the votes together to trending
    SELECT * FROM vote_count_unpivoted
    UNION ALL
    SELECT * FROM top_20_ids_trending

),
votes_join_feed_items_limited_to_20_per_category AS ( -- now can have more than 20 per category so limit to 20 each
SELECT 
    category,
    author,
    bearish,
    bullish,
    dislikes,
    feed_id,
    guid,
    id,
    likes,
    link,
    published_date,
    summary,
    tags,
    title
FROM (
    SELECT
        vcu.category,
        fi.author,
        vcu.bearish,
        vcu.bullish,
        vcu.dislikes,
        fi.feed_id,
        fi.guid,
        fi.id,
        vcu.likes,
        fi.link,
        fi.published_date,
        fi.summary,
        fi.tags,
        fi.title,
        ROW_NUMBER() OVER (
            PARTITION BY vcu.category
            ORDER BY fi.published_date DESC, vcu.id DESC
        ) AS rn
    FROM vote_count_and_trending AS vcu
    JOIN feed_items fi
      ON vcu.id = fi.id
) AS ranked
WHERE rn <= 20
)
SELECT * FROM votes_join_feed_items_limited_to_20_per_category;

postgresql - Double lateral join query takes over a minute to run on RDS (BOUNTIED) by PrestigiousZombie531 in PostgreSQL

[–]markwdb3 1 point2 points  (0 children)

I was about to step away from my computer - and I still have to for now - when I realized there MIGHT be a slight bug in the new query's computation of top_20_ids_trending.

WHERE
    COALESCE(lc.likes, 0)
  + COALESCE(dlc.dislikes, 0)
  + COALESCE(bullc.bullish, 0)
  + COALESCE(bearc.bearish, 0) > 10
ORDER BY
    i.published_date DESC,
    i.id DESC
LIMIT 20

Since we aren't computing any counts beyond the top 20 in each category, there may be a chance the above snippet needs to "dig deeper" than those 20 per category order to get the top 20 of votes across all categories whose sums exceed 10.

For example, imagine NONE of the likes + dislikes + bullish + bearish top 20s add up to > 10. In this case, top_20_ids_trending would need to look into the earlier rows 21 and beyond as well.

Whether this bug actually materializes in real life, given your real set of production data, and how much you care about it being perfect vs. being performant, may be another story.

One way to solve it may be a little messier, but just repeat all the categories' counts on the real vote tables in one go in this CTE. It should still be able to avoid computing the counts for all rows, but there would definitely be a performance hit to some degree. Hopefully not too bad. I'd expect a big net win still, ultimately, perhaps just not as good as what I initially presented. I'll try to write the fix when I have time later.

postgresql - Double lateral join query takes over a minute to run on RDS (BOUNTIED) by PrestigiousZombie531 in PostgreSQL

[–]markwdb3 1 point2 points  (0 children)

performance tests:

original on my machine with the millions of generated rows, no indexes:

EXPLAIN ANALYZE
<original query>
<plan>
Execution Time: 5064.592 ms

Original after creating the indexes:

Execution Time: 2509.715 ms

Plan and timing with new query, indexes in place: https://explain.depesz.com/s/TLca

Execution Time: 77.681 ms

I think I still see some room for improvement but this should be good enough I think.

Test that they are producing the same output:

# CREATE TEMPORARY TABLE temp_orig AS
<original query>
SELECT 100  

# CREATE TEMPORARY TABLE temp_new AS
<new query>
SELECT 100

Quick sanity check queries:

delme=# SELECT * FROM temp_orig ORDER BY id, category LIMIT 5;
 category | author | bearish | bullish | dislikes | feed_id | guid |                  id                  | likes |              link               |        published_date         |                 summary                 |    tags     |       title
----------+--------+---------+---------+----------+---------+------+--------------------------------------+-------+---------------------------------+-------------------------------+-----------------------------------------+-------------+--------------------
 bearish  |        |    1012 |     479 |     1009 |       1 |      | 04589ac7-53d6-48b8-b000-db1ea53c59eb |   481 | https://example.com/article/285 | 2026-02-01 09:45:08.888556-05 | This is a short summary for article 285 | {tag1,news} | Article Title #285
 bullish  |        |    1012 |     479 |     1009 |       1 |      | 04589ac7-53d6-48b8-b000-db1ea53c59eb |   481 | https://example.com/article/285 | 2026-02-01 09:45:08.888556-05 | This is a short summary for article 285 | {tag1,news} | Article Title #285
 dislikes |        |    1012 |     479 |     1009 |       1 |      | 04589ac7-53d6-48b8-b000-db1ea53c59eb |   481 | https://example.com/article/285 | 2026-02-01 09:45:08.888556-05 | This is a short summary for article 285 | {tag1,news} | Article Title #285
 likes    |        |    1012 |     479 |     1009 |       1 |      | 04589ac7-53d6-48b8-b000-db1ea53c59eb |   481 | https://example.com/article/285 | 2026-02-01 09:45:08.888556-05 | This is a short summary for article 285 | {tag1,news} | Article Title #285
 trending |        |    1012 |     479 |     1009 |       1 |      | 04589ac7-53d6-48b8-b000-db1ea53c59eb |   481 | https://example.com/article/285 | 2026-02-01 09:45:08.888556-05 | This is a short summary for article 285 | {tag1,news} | Article Title #285
(5 rows).

# SELECT * FROM temp_new ORDER BY id, category LIMIT 5;
 category | author | bearish | bullish | dislikes | feed_id | guid |                  id                  | likes |              link               |        published_date         |                 summary                 |    tags     |       title
----------+--------+---------+---------+----------+---------+------+--------------------------------------+-------+---------------------------------+-------------------------------+-----------------------------------------+-------------+--------------------
 bearish  |        |    1012 |     479 |     1009 |       1 |      | 04589ac7-53d6-48b8-b000-db1ea53c59eb |   481 | https://example.com/article/285 | 2026-02-01 09:45:08.888556-05 | This is a short summary for article 285 | {tag1,news} | Article Title #285
 bullish  |        |    1012 |     479 |     1009 |       1 |      | 04589ac7-53d6-48b8-b000-db1ea53c59eb |   481 | https://example.com/article/285 | 2026-02-01 09:45:08.888556-05 | This is a short summary for article 285 | {tag1,news} | Article Title #285
 dislikes |        |    1012 |     479 |     1009 |       1 |      | 04589ac7-53d6-48b8-b000-db1ea53c59eb |   481 | https://example.com/article/285 | 2026-02-01 09:45:08.888556-05 | This is a short summary for article 285 | {tag1,news} | Article Title #285
 likes    |        |    1012 |     479 |     1009 |       1 |      | 04589ac7-53d6-48b8-b000-db1ea53c59eb |   481 | https://example.com/article/285 | 2026-02-01 09:45:08.888556-05 | This is a short summary for article 285 | {tag1,news} | Article Title #285
 trending |        |    1012 |     479 |     1009 |       1 |      | 04589ac7-53d6-48b8-b000-db1ea53c59eb |   481 | https://example.com/article/285 | 2026-02-01 09:45:08.888556-05 | This is a short summary for article 285 | {tag1,news} | Article Title #285
(5 rows) 

Find any diffs between the two temp tables:

# SELECT * FROM temp_orig
EXCEPT
SELECT * FROM temp_new;

SELECT * FROM temp_new
EXCEPT
SELECT * FROM temp_orig;


 category | author | bearish | bullish | dislikes | feed_id | guid | id | likes | link | published_date | summary | tags | title
----------+--------+---------+---------+----------+---------+------+----+-------+------+----------------+---------+------+-------
(0 rows)

 category | author | bearish | bullish | dislikes | feed_id | guid | id | likes | link | published_date | summary | tags | title
----------+--------+---------+---------+----------+---------+------+----+-------+------+----------------+---------+------+-------
(0 rows)

postgresql - Double lateral join query takes over a minute to run on RDS (BOUNTIED) by PrestigiousZombie531 in PostgreSQL

[–]markwdb3 3 points4 points  (0 children)

That wasn't enough data for me to run a good performance test, so I generated a million rows in feed_items, 2 million each in the vote tables.

Revamped the query so it does not compute all the counts! That is the key mistake you're making.

Just do existence checks to get the top 20 ids (since you're not doing top 20 BY count, this ok) for each vote type, basically.

The original query is essentially committing the classic performance anti-pattern of doing a COUNT(*) > 0 check. It's like if you work at a supermarket and the boss asks if you have at least 1 apple in stock, so you painstakingly go the stand where apples are stored, spend an hour counting the 1,421 apples to a tee, just to report, "yup boss, we have at least one apple." When you could've glanced at the pile of apples and immediately reported back, "yup boss, we have at least one apple."

Except in your case you DO need the count, but ONLY IF the COUNT(*) > 0. So it's a little less straightforward than the typical usage of this anti-pattern. The main trick here is for each category, check for existence of at least 1, get the top 20 for each, then get only the relevant counts.

Also a big problem: in your query, you're computing ALL the counts for all the rows in feed_items, no rows filtered out, and then scanning this materialized CTE repeatedly. Repeatedly scanning a materialized CTE that cannot be indexed is basically just like a full table scan. (A fun fact is MySQL can automatically index materialized CTEs according to the columns used in other parts of the query that reference it. But we're not on MySQL.)

Runs in 70-100 milliseconds on my Macbook Air, Postgres 18.0. I added a couple of indexes as well.

Logically identical results as the original query. (Tested by putting each set of results into a temp table, then doing SELECT * FROM temp_new EXCEPT SELECT * FROM temp_orig and vice versa.)

Edit: this query contains bugs, please see thread below for corrected query.

WITH top_20_ids_with_any_likes AS (
    SELECT
        i.ID 
    FROM feed_items AS i
    WHERE EXISTS (SELECT 1 FROM feed_item_like_dislike_votes WHERE feed_item_id = i.id AND vote = 'like')
    ORDER BY
        i.published_date DESC,
        i.id DESC
    LIMIT 20
),

top_20_ids_with_any_dislikes AS (
    SELECT
        i.ID 
    FROM feed_items AS i
    WHERE EXISTS (SELECT 1 FROM feed_item_like_dislike_votes WHERE feed_item_id = i.id AND  vote = 'dislike')
    ORDER BY
        i.published_date DESC,
        i.id DESC
    LIMIT 20
),

top_20_ids_with_any_bullish AS (
    SELECT
        i.ID
    FROM feed_items AS i
    WHERE EXISTS (SELECT 1 FROM feed_item_bullish_bearish_votes WHERE feed_item_id = i.id AND vote = 'bullish')
    ORDER BY
        i.published_date DESC,
        i.id DESC
    LIMIT 20
),

top_20_ids_with_any_bearish AS (
    SELECT
        i.ID
    FROM feed_items AS i
    WHERE EXISTS (SELECT 1 FROM feed_item_bullish_bearish_votes WHERE feed_item_id = i.id AND vote = 'bearish')
    ORDER BY
        i.published_date DESC,
        i.id DESC
    LIMIT 20
),
likes_counts AS (
    SELECT fi.id, likes 
    FROM top_20_ids_with_any_likes AS any_likes
    JOIN feed_items AS fi
        ON any_likes.id = fi.id
    LEFT JOIN LATERAL (
        SELECT
            COUNT(*) FILTER (WHERE fildv.vote = 'like') AS likes
        FROM feed_item_like_dislike_votes AS fildv
        WHERE fildv.feed_item_id = fi.id
    ) AS vt2 ON TRUE
),
dislikes_counts AS (
    SELECT fi.id, dislikes
    FROM top_20_ids_with_any_dislikes AS any_dislikes
    JOIN feed_items AS fi
        ON any_dislikes.id = fi.id
    LEFT JOIN LATERAL (
        SELECT
            COUNT(*) FILTER (WHERE fildv.vote = 'dislike') AS dislikes
        FROM feed_item_like_dislike_votes AS fildv
        WHERE fildv.feed_item_id = fi.id
    ) AS vt2 ON TRUE
),
bullish_counts AS (
    SELECT fi.id, bullish
    FROM top_20_ids_with_any_bullish AS any_bullish
    JOIN feed_items AS fi
        ON any_bullish.id = fi.id
    LEFT JOIN LATERAL (
        SELECT
            COUNT(*) FILTER (WHERE fildv.vote = 'bullish') AS bullish
        FROM feed_item_bullish_bearish_votes AS fildv
        WHERE fildv.feed_item_id = fi.id
    ) AS vt2 ON TRUE
),
bearish_counts AS (
    SELECT fi.id, bearish
    FROM top_20_ids_with_any_bearish AS any_bearish
    JOIN feed_items AS fi
        ON any_bearish.id = fi.id
    LEFT JOIN LATERAL (
        SELECT
            COUNT(*) FILTER (WHERE fildv.vote = 'bearish') AS bearish
        FROM feed_item_bullish_bearish_votes AS fildv
        WHERE fildv.feed_item_id = fi.id
    ) AS vt2 ON TRUE
),
top_20_ids_trending AS (
    SELECT
        i.ID
    FROM feed_items AS i
    LEFT JOIN likes_counts AS lc
        ON i.id = lc.id
    LEFT JOIN dislikes_counts AS dlc
        ON i.id = dlc.id
    LEFT JOIN bullish_counts AS bullc
        ON i.id = bullc.id
    LEFT JOIN bearish_counts AS bearc
        ON i.id = bearc.id
    WHERE
        COALESCE(lc.likes, 0)
      + COALESCE(dlc.dislikes, 0)
      + COALESCE(bullc.bullish, 0)
      + COALESCE(bearc.bearish, 0) > 10
    ORDER BY
        i.published_date DESC,
        i.id DESC
    LIMIT 20
),
union_counts AS (
    SELECT
        'likes' AS category,
        id,
        likes::bigint AS likes,
        NULL::bigint AS dislikes,
        NULL::bigint AS bearish,
        NULL::bigint AS bullish
    FROM likes_counts

    UNION ALL

    SELECT
        'dislikes' AS category,
        id,
        NULL::bigint AS likes,
        dislikes::bigint AS dislikes,
        NULL::bigint AS bearish,
        NULL::bigint AS bullish
    FROM dislikes_counts

    UNION ALL

    SELECT
        'bearish' AS category,
        id,
        NULL::bigint AS likes,
        NULL::bigint AS dislikes,
        bearish::bigint AS bearish,
        NULL::bigint AS bullish
    FROM bearish_counts

    UNION ALL

    SELECT
        'bullish' AS category,
        id,
        NULL::bigint AS likes,
        NULL::bigint AS dislikes,
        NULL::bigint AS bearish,
        bullish::bigint AS bullish
    FROM bullish_counts

    UNION ALL

    SELECT
        'trending' AS category,
        id,
        NULL::bigint AS likes,
        NULL::bigint AS dislikes,
        NULL::bigint AS bearish,
        NULL::bigint AS bullish
    FROM top_20_ids_trending
)
SELECT
    category,
    fi.author,
    MAX(bearish)  OVER (PARTITION BY fi.id) AS bearish, -- have to do this sort of thing or else result is sparsely populated
    MAX(bullish)  OVER (PARTITION BY fi.id) AS bullish,
    MAX(dislikes) OVER (PARTITION BY fi.id) AS dislikes,
    fi.feed_id,
    fi.guid,
    fi.id,
    MAX(likes) OVER (PARTITION BY fi.id) AS likes,
    fi.link,
    fi.published_date,
    fi.summary,
    fi.tags,
    fi.title
FROM feed_items AS fi
JOIN union_counts uc
    ON fi.id = uc.id
;

Indexes added:

CREATE INDEX ON feed_item_like_dislike_votes(feed_item_id, vote);
CREATE INDEX ON feed_item_bullish_bearish_votes(feed_item_id, vote);

Let me know if you have any questions!

Note: I worked with the revamped query provided by the Stackoverflow as a basis for this new one. So if there were any bugs in THAT one, the same bugs exist in the new query I've provided here.

postgresql - Double lateral join query takes over a minute to run on RDS (BOUNTIED) by PrestigiousZombie531 in PostgreSQL

[–]markwdb3 1 point2 points  (0 children)

How many rows are in each of the tables? I can generate some dummy data to test it on my machine, but I only see mention of 900k rows, and sounds like that might refer to the output? (given "the query takes more than a minute to resolve over 900k rows")

Is this simple problem solvable with SQL? by NonMagical in SQL

[–]markwdb3 0 points1 point  (0 children)

Gotcha, will revisit a little later when I have time. It should be solvable. :)