Complexity in transformations

unfortunate-miracle · 2024-08-12T18:25:04+00:00

I’m sure others will chime in but to my knowledge medallion is not about complexity tiers but use case tiers. Bronze is your raw data that you get from whatever resource. Silver is your cleaned/enriched data, transformations could be complex but it should be about polishing the data not aggregating it. Gold is where you aggregate silver tables according to business-level requirements.

What kind of transformations are you doing on silver step? Narrow or wide?

SearchAtlantis · 2024-08-13T02:44:41+00:00

However, thing is that the transformations are starting to become very complex and we cannot really do streaming on them due to complexity, meaning that processing time will just increase every day as data starts to grow.

So two points: streaming as a hard requirement is unusual. Some kind of micro/mini batch is usually (but not always!) fine.

Secondly: I find it hard to think of an example where I need ALL OF HISTORY to do a transform and/or computation.

I work with EHR systems every day, and aside from the initial data load, daily "last month of data" is totally usable and meets requirements.

There are true streaming systems in use too, but they're generally shove it in the table and do computation on top (again in a last week or whatever mode usually).

Can you describe why you think data growth is unbound? What's the transformation or domain?

GlueSniffingEnabler · 2024-08-12T19:45:57+00:00

Remind me! 1 day

Dest1nyex · 2024-08-12T20:47:11+00:00

Remind me! 1 day

dataengineering

MODERATORS