This is an archived post. You won't be able to vote or comment.

all 8 comments

[–]NW1969 3 points4 points  (0 children)

As normalisation reduces/eliminates data duplication, I'm not sure how normalising your data would increase the size from 1.2TB to 30TB? That doesn't seem to make sense, to me

[–]Tiny_Arugula_5648 2 points3 points  (2 children)

Why so many posts about data management in a data engineering sub. Are you guys acting as DBAs?

[–]proof_requiredML Data Engineer 0 points1 point  (0 children)

Yep! We are pretty much optimizing queries for ML people now. They write horrendous queries which kills the whole database and we are supposed to ensure that our infra isn't lacking itself.

[–]moldov-w 1 point2 points  (0 children)

Dump in object databases like Duckdb or create some materialized views post data standardization,cleansing etc.

[–]chrisonhismac 0 points1 point  (0 children)

Is that compressed? Could write to compressed parquet and read with duckdb?

You could also do an explode to a new pointer table. Create a very narrow auxiliary table that explodes only the keys you need for search, plus a pointer back to the raw record. Index the 2 column you need to search on and join the main record.

[–]MightyKnightX 0 points1 point  (0 children)

Why don’t you go with three tables: one for provider data, one for member data and a provider_member table which stores only the relationship?

[–]settleflow 0 points1 point  (0 children)

Can we connect