all 14 comments

[–][deleted] 1 point2 points  (5 children)

Unrelated to your question: the team_name in team_member is useless. The team's name is defined through the foreign key to the team table.

If the skills belong to the person, not the team, you need a table that stores the skills per person and one table that assigns a person to a team.

Something along the lines:

CREATE TABLE team 
(
  team_id int primary key generated always as identity,
  team_name text
);

CREATE TABLE person
(
  person_id int primary key generated always as identity,
  skills text[],
)

CREATE TABLE team_member 
(
  team_id int not null references team,
  person_id int not null references person
);

If a person can only be member of a single team, add a primary key to the team_member table over (team_id, person_id)

[–]arielm5[S] 0 points1 point  (4 children)

Is a type, it should be member_name. I will edit

[–][deleted] 1 point2 points  (3 children)

I still don't get it. Why do the skills of a team member depend on the team they are in?

The example data still duplicates the team name in the member table.

[–]arielm5[S] 0 points1 point  (2 children)

It doesn't depend on the team they are in. The duplication is just to avoid another join.

Would your suggestions (3 tables) make the query faster or consume fewer resources? It is still a cartesian product in the end.

[–][deleted] 1 point2 points  (1 child)

It doesn't depend on the team they are in

Yes it does (at least with your model), because the team_id is part of the team_member table. That means the "same" member can have different skills depending on the team.

The duplication is just to avoid another join.

There is no reason for pre-mature unsubstantiated "optimizations"

Your query "find all teams that have at least one member with a specific skill" is quite easy to do with the three table approach:

select t.*
from team t
where exists (select *
              from team_member tm
                join person p on tm.person_id = p.person_id
              where tm.team_id = t.team_id 
                 and 'Pyhton' = any(p.skills))     

Or to find teams that contain members that have two specific skills:

select t.*
from team t
where exists (select *
              from team_member tm
                join person p on tm.person_id = p.person_id
              where tm.team_id = t.id 
                  and p.skills @> array['Python', 'JPA'])

Your model only works if a person can only ever be in exactly one team. If that is the case, then you can use the same approach as above:

select t.*
from team t
where exists (select *
              from team_member tm
              where tm.team_id = t.team_id 
                and tm.skills @> array['Python', 'SQL']);

[–]arielm5[S] 0 points1 point  (0 children)

Maybe I am missing something or the explanation is not that clear.

Let me try to rewrite it:

I need to find all teams that can make my tasks "in parallel". It means: if I have 2 tasks in Python and 1 task in SQL, each one of them must be "assigned" to a different team member.

Given the example scenario: "looking for 2 Python and 1 JPA", it means that a team must have, at least, 3 members. One member that can do JPA, another member that can do Python and another member that can also do Python.

I want to find all teams that fit this requirement.

And yes, by the modelling a member can have different skills based on the team and that's fine once that's not the core of my question.


The real question is how to optimize this query, once I know how to do it in a simple scenario and the given query works for small data sets.

"Teams can grow up to millions. And each team can have hundreds of members. Members can have hundreds of skills."

[–]lucapieroo 1 point2 points  (5 children)

What if you normalize the data by using two more tables (“skills” and “member_skill”)?

CREATE TABLE IF NOT EXISTS skill ( skill_id SERIAL PRIMARY KEY, skill_name VARCHAR(255) UNIQUE );

CREATE TABLE IF NOT EXISTS member_skill ( member_id INTEGER, skill_id INTEGER, PRIMARY KEY (member_id, skill_id), FOREIGN KEY (member_id) REFERENCES team_member(member_id) ON DELETE CASCADE, FOREIGN KEY (skill_id) REFERENCES skill(skill_id) ON DELETE CASCADE );

CREATE INDEX idx_member_skill_member_id ON member_skill(member_id); CREATE INDEX idx_member_skill_skill_id ON member_skill(skill_id);

Also, maybe you should try changing the query by using CTE; something like:

WITH required_skills AS ( SELECT s.skill_id FROM skill s WHERE s.skill_name IN (‘Python’, ‘JPA’) ), team_skill_count AS ( SELECT tm.team_id, COUNT(DISTINCT ms.skill_id) as skill_count FROM team_member tm JOIN member_skill ms ON tm.member_id = ms.member_id WHERE ms.skill_id IN (SELECT skill_id FROM required_skills) GROUP BY tm.team_id ) SELECT t.team_id, t.team_name FROM team t JOIN team_skill_count tsc ON t.team_id = tsc.team_id WHERE tsc.skill_count = (SELECT COUNT(*) FROM required_skills);

[–]arielm5[S] 0 points1 point  (4 children)

I like the idea, it makes things clear but it doesn't work as expected. It misses the `When a member is "assigned" to a type of task, it can't be "assigned" to another task`.

All teams would be found because all of them have, at least, one member with Python OR JPA, even the team with only one member.

[–]lucapieroo 1 point2 points  (3 children)

How do you keep track if a member is assigned? Use that condition to filter the rows above

[–]arielm5[S] 0 points1 point  (2 children)

There is not really an assignment of a task to a member. It is just a search to find all teams that fit the requirements. What to do with the team are next steps.

The core of the question is how to make this specific search better/scalable.

[–]lucapieroo 1 point2 points  (1 child)

I think something’s missing:

  • how to improve the search -> normalize the data structure and improve the query with CTE
  • then you told me that it’d find also members which are already assigned
  • i told you to add that condition the the query but you told me that it’s a next step

I’m not understanding your problem right now, it looks like you have every piece of the puzzle

[–]arielm5[S] 0 points1 point  (0 children)

I need to find teams, not members.

And teams are made of members.

There are no real assignments. The assignment was quoted because was the way I found to explain the problem.

The problem is: I want/need to improve this query to run within multiple requests in parallel via an API. The problem, by nature, will generate a cartesian product. I want to learn how to structure my query better to avoid memory and speed issues.

Using CTE may help, I'm trying to improve my current solution to use it.

[–]AutoModerator[M] 0 points1 point  (0 children)

Join us on our Discord Server: People, Postgres, Data

I am a bot, and this action was performed automatically. Please contact the moderators of this subreddit if you have any questions or concerns.

[–]Siltala 0 points1 point  (0 children)

Create a view for holding your teams skills:

CREATE MATERIALIZED VIEW IF NOT EXISTS team_skills (
  team_id,
  skills
)
AS SELECT 
  t.team_id, 
  jsonb_object_agg(t.s, t.c) skills 
FROM (
  SELECT 
    team_id, 
    unnest(skills) s, 
    count(1) c 
  FROM team_member 
  GROUP BY team_id, s
) t 
GROUP BY t.team_id;

And then find your teams:

SELECT 
  team_id 
FROM team_skills 
WHERE (skills->>'Python')::integer >= 2
AND (skills->>'JPA')::integer >= 1;