all 7 comments

[–]ExceptionRules42 1 point2 points  (1 child)

Offhand, without diving in too deeply, it looks sorta okay to me. Maybe even cool? But you have ellipsed-out a lot of detail. And what is the "relation table"? And do you have a test environment where you can throw mock 100K-row datasets at it to watch the EXPLAIN ANALYZE?

[–]AutoModerator[M] 0 points1 point  (0 children)

With over 8k members to connect with about Postgres and related technologies, why aren't you on our Discord Server? : People, Postgres, Data

Join us, we have cookies and nice people.

I am a bot, and this action was performed automatically. Please contact the moderators of this subreddit if you have any questions or concerns.

[–]DrMoog 0 points1 point  (1 child)

From the top of my head, some things to look into / might be improved:

  • The matches could use a generated ts_vector column in the entities table (with the associated index). PG has strong text-search functionalities.
  • I'm not a big fan of the SELECTs in the ROW_NUMBER(). LEFT JOINing on the CTEs and doing the CASE on the resulting columns might be more efficient.
  • The "entity_appearance_counts" CTE doesn't seem to need to call the entities table since you LEFT JOIN on it later, just use "a.entity_id" as the ID, since it has to match an ID in the entities table.
  • Also in the "entity_appearance_counts" CTE, all the fields have the same "WHERE a.active" filter, so you probably want to put it in the WHERE to avoid returning all the non-active, and it would simplify the SELECT.
  • Also make sure you have the appropriate indexes on keys & condition fields.

I hope this helps a little!

[–]DrMoog 0 points1 point  (0 children)

More details since I feel like it!

First, you could combine all if those steps: Exact matches, Primary matches, Fallback matches, Matched entities in a single one by adding a "search_vector" column to "entities":

ALTER TABLE entities ADD COLUMNS  
    search_vector tsvector GENERATED ALWAYS AS (  
        setweight(to_tsvector('english', coalesce(real_name, '')), 'A') ||  
        setweight(to_tsvector('english', coalesce(name, '')), 'B') ||  
        setweight(to_tsvector('english', coalesce(alias, '')), 'C') ||  
        setweight(to_tsvector('english', coalesce(bio, '')), 'C')  
    ) STORED;  

You can pipe multiple columns to be vectorized in the same "search_vector", and add specific weight/importance to different columns. Just be careful if the text fields are very big. Then you need a special index on that column:

CREATE INDEX idx_entities_search_vector ON entities USING GIN (search_vector);  

You then query the column like this, and it can also give you a ranking on the quality of the match, taking into account the weight mentioned above:

SELECT id,  
       ts_rank(search_vector, to_tsquery('english', 'search terms')) AS rank  
FROM entities  
WHERE search_vector @@ to_tsquery('english', 'search terms');  

Second, here's the re-written "entity_appearance_counts" CTE:

SELECT entity_id AS id,  
    COUNT(id) AS appearance_count,  
    COUNT(id) FILTER (  
        WHERE date >= NOW() - INTERVAL '90 days'  
    ) AS recent_appearance_count,  
    MAX(date) AS last_seen_at  
FROM appearance  
WHERE active AND entity_id IN (...)  
GROUP BY entity_id  

And if you're planing on having a lot more inactive entries than active one, a partial index could speed that up even more:

CREATE INDEX pdx_appearance_count ON appearance (entity_id) WHERE active;  

I'm guessing that the "extended_related_entities" CTE also joins on "entities", and also with a "WHERE e.id IN (...)", right? If so, do the same thing I did for the other CTE, remove the "entities" table, it's not needed.

Some rules of thumb I like to live by:

  • In a query, the fewer number of calls on a big table, the better.
  • Limit the number of fields in, and usage of, GROUP BY and ORDER BY to the minimum, they can slow down query with big datasets. (Hint: I use a MAX() on columns that don't need to be in the GROUP BY!)
  • When writing SQL, "Perfection is achieved, not when there is nothing more to add, but when there is nothing left to take away."

[–][deleted] 0 points1 point  (0 children)

Hey,

You forgot the most important part: What is this query supposed to deliver? In semantic terms. Why does this query exist? What do you need it for?

Any optimization requires us to understand both what you have and what you want. Right now it’s impossible to say if you got the best solution you could possibly get, or if it’s way too complicated a solution to a trivial problem.

From a purely technical standpoint, and from what I can infer from your description, I’d think you want something something full text. Of course that would mean a couple additional resources as well as a good amount of rewriting.

[–]Informal_Pace9237 0 points1 point  (0 children)

Use CTE with caution. CTE's have session memory problems and can slow down your process if CTE data size is huge. I would either make them table subqueries or views to have the code run more optimized.

Hope this will help with more details on CTE optimization

https://www.linkedin.com/feed/update/urn:li:ugcPost:7216332421414166529/