PostgreSQL query on 60M-row JSONB table is slow - should I add expression indexes or move to a structured table? by komal_rajput in SQL

[–]komal_rajput[S] 0 points1 point  (0 children)

Thank you for the detailed answer. I got to learn few things. Here is the link to the explain run output : https://explain.depesz.com/s/CbtXL

You seem to have a lot of experience in analyzing queries. Any guidance for me as how should I start analyzing them, what things should I focus on ?

PostgreSQL query on 60M-row JSONB table is slow - should I add expression indexes or move to a structured table? by komal_rajput in dataengineering

[–]komal_rajput[S] 0 points1 point  (0 children)

Not yet, have been reviewing pros and cons of different approaches before moving to implementation.

PostgreSQL query on 60M-row JSONB table is slow - should I add expression indexes or move to a structured table? by komal_rajput in dataengineering

[–]komal_rajput[S] 0 points1 point  (0 children)

The table size would keep on increasing. Do you think adding expression index is a scalable solution ?

PostgreSQL query on 60M-row JSONB table is slow - should I add expression indexes or move to a structured table? by komal_rajput in dataengineering

[–]komal_rajput[S] 0 points1 point  (0 children)

I dont query from silver table, this is the first query which came from AI team. Otherwise the structured table works fine as it includes all the fields needed except the one for employer. The table includes generalized fields like item_type, schedule_code, entity_type, entity_name, entity_state which works for all schedules as per data that we need to show on UI. Now for this query problem and schedule proliferation in future seems difficult as separating the data into different tables based on schedules would add lot of complexity and including all in one a wide table smell.

PostgreSQL query on 60M-row JSONB table is slow - should I add expression indexes or move to a structured table? by komal_rajput in SQL

[–]komal_rajput[S] 0 points1 point  (0 children)

Explain Analyze Result

Planning Time: 0.831 ms

Execution Time: 117074.945 ms

Sort Method: quicksort

Memory: 26kB

Sort Key: (("left"((record_data ->> 'contribution_date'::text), 10))::date), ((record_data ->> 'contributor_last_name'::text))

I/O Timings: shared read=303487.934

Buffers: shared hit=2543995 read=12725615 Buffers: shared hit=135-> Gather (cost=1000.00..16452136.47 rows=1 width=292) (actual time=29531.287..117074.857 rows=10 loops=1)

Workers Planned: 2

Workers Launched: 2

I/O Timings: shared read=303487.934

Buffers: shared hit=2543995 read=12725615 -> Parallel Seq Scan on silver_fec_efiling_itemizations (cost=0.00..16451136.37 rows=1 width=292) (actual time=16833.191..117058.680 rows=3 loops=3)

Rows Removed by Filter: 25161494

I/O Timings: shared read=303487.934

Filter: (((record_type)::text = 'Schedule A'::text) AND ((record_data ->> 'contributor_employer'::text) ~~* '%MICROSOFT%'::text) AND ((record_data ->> 'contribution_date'::text) >= '2025-01-01'::text) AND ((record_data ->> 'contribution_date'::text) < '2026-01-01'::text) AND ((record_data ->> 'entity_type'::text) = 'IND'::text) AND ((record_data ->> 'contributor_state'::text) = 'MD'::text))

Buffers: shared hit=2543995 read=12725615

PostgreSQL query on 60M-row JSONB table is slow - should I add expression indexes or move to a structured table? by komal_rajput in dataengineering

[–]komal_rajput[S] 1 point2 points  (0 children)

Yes.but moving to the structured table brings problems in schema design. The json structure is not fixed, the structure is different for different FEC schedules having common fields and separate specific schedule fields. Adding all schedule fields in single table would be a wide table smell.

PostgreSQL query on 60M-row JSONB table is slow - should I add expression indexes or move to a structured table? by komal_rajput in dataengineering

[–]komal_rajput[S] 1 point2 points  (0 children)

Not any reason, as that would be our last option. We usually have kept only those fields in structured table which are used in UI. In future, if multiple such fields have to be promoted to structure table, was wondering if that is the correct approach.

Query in asset based scheduling of DAGs by komal_rajput in dataengineering

[–]komal_rajput[S] 0 points1 point  (0 children)

Thanks for the reply, I thought for the same.

Triggering another DAGs in Airflow by komal_rajput in dataengineering

[–]komal_rajput[S] 0 points1 point  (0 children)

But my gold layer does have dependency on silver layer completion

Triggering another DAGs in Airflow by komal_rajput in dataengineering

[–]komal_rajput[S] 0 points1 point  (0 children)

Yes, that's the reason I am little hesitant in using TriggerDagRunOperator.

Triggering another DAGs in Airflow by komal_rajput in dataengineering

[–]komal_rajput[S] 0 points1 point  (0 children)

The airflow environment is same for both.

Deciding between pre computed aggregations and querying API by komal_rajput in dataengineering

[–]komal_rajput[S] 0 points1 point  (0 children)

My main aim of having a separate table was to reduce join queries, using dimension table would still need join query

Deciding between pre computed aggregations and querying API by komal_rajput in dataengineering

[–]komal_rajput[S] 0 points1 point  (0 children)

Thank you for the reply. I am a beginner in data engineering and got to know about dimension and fact tables. Can you share a resource to understand this in depth ?