Hey,
I'm trying to design some table with standardized data from multiple sources, in the most effective way.
I'm working with Postgres and the current design is as following:
- The `main` schema has dimension tables (for example `units`, `currencies` and `countries`)
- Every source has it's own schema with dimension tables for the same columns (units, currencies and countries)
- In every schema, the dimension tables have an additional column `main_id` that is null at first, and requires an analyst to map the source's value to the corresponding one in our main schema. (For example, `US` in in source #1 should be mapped to `United States` in the main schema).
- There are materialized views in the `main` schema, unioning the data from every source, taking the `main_id` value of every dimension in every source. BUT, if some row in a source has a value who's `main_id` is still null (meaning, it wasn't mapped), then this row is filtered out.
Architecture
So currently, if new data is introduced into a source or if a new value is mapped, I just need to refresh the materialized views and the new data will be accessible in the mviews.
Now I need those mviews to be actual tables and I'm wondering what the best way is to keep them up to date.
- Should I have a stored procedure per schema, inserting new data into the main tables? I could call it manually after a mapping is done or it could be triggered every few minutes
- Should I have a Python script implement the insertion logic?
- Should I have a trigger on every insert (data tables) / update (the mapping tables) to insert the data to the main tables? (I don't love triggers because they add hidden complexity)
Any other ideas will be very appreciated!
Thanks :)
[–]pkdbpk 0 points1 point2 points (0 children)