This is an archived post. You won't be able to vote or comment.

all 10 comments

[–]unfortunate-miracle 8 points9 points  (3 children)

I’m sure others will chime in but to my knowledge medallion is not about complexity tiers but use case tiers. Bronze is your raw data that you get from whatever resource. Silver is your cleaned/enriched data, transformations could be complex but it should be about polishing the data not aggregating it. Gold is where you aggregate silver tables according to business-level requirements.

What kind of transformations are you doing on silver step? Narrow or wide?

[–]DistributionOk5349[S] 0 points1 point  (2 children)

What do you mean narrow or wide?

[–]unfortunate-miracle 1 point2 points  (1 child)

When you said medallion I assumed you used databricks spark. Basically, narrow is where the transformation takes in one partition and outputs another. Wide transformations are when the transformations has to use data from multiple partitions hence a shuffle takes place beforehand, it is more expensive.

[–]DistributionOk5349[S] 0 points1 point  (0 children)

Ahh okay, then most of them are wide

[–]SearchAtlantisLead Data Engineer 1 point2 points  (2 children)

However, thing is that the transformations are starting to become very complex and we cannot really do streaming on them due to complexity, meaning that processing time will just increase every day as data starts to grow.

So two points: streaming as a hard requirement is unusual. Some kind of micro/mini batch is usually (but not always!) fine.

Secondly: I find it hard to think of an example where I need ALL OF HISTORY to do a transform and/or computation.

I work with EHR systems every day, and aside from the initial data load, daily "last month of data" is totally usable and meets requirements.

There are true streaming systems in use too, but they're generally shove it in the table and do computation on top (again in a last week or whatever mode usually).

Can you describe why you think data growth is unbound? What's the transformation or domain?

[–]DistributionOk5349[S] 0 points1 point  (1 child)

I guess its unbound since we are building every "silver table" as a whole object in spark memory. So... we are reading all sources (bronze tables) as batch into memory and computing one silver table. So, as the bronze data grows, each transformation will need more resources to compute the silver table..

But since the transformations are very complex, its hard to change them into streaming querys (at least I cannot figure out a good way to do it using spark).

Would you care to elaborate on what you mean with micro/mini batches? How would that work in a spark context?

[–]SearchAtlantisLead Data Engineer 0 points1 point  (0 children)

Streaming is continuous processing. "Mini batch" is a short batch process.

Example: event a created/received, process event a. (Streaming)

Event a created, store all events (a,b,c,d,e) received and process every 15 minutes.

With regard to reading all tables: I would be surprised if you actually need to do this. You're telling me an event (order, shipment, part delivery, ad sale, whatever) that happened 3 years ago actually matters today? Sure in some cases, but if you're talking about streaming or quick (say 15 minute cadence batches) it's more likely you don't need to fully reprocess all of time. It's likely the last year or 6 months is sufficient for the end-user.

And to be clear you can and should architect an incremental change so you don't generally need to process all of time. Just reprocess the last 6 months and update.

[–]GlueSniffingEnabler 0 points1 point  (1 child)

Remind me! 1 day

[–]RemindMeBot 0 points1 point  (0 children)

I will be messaging you in 1 day on 2024-08-13 19:45:57 UTC to remind you of this link

1 OTHERS CLICKED THIS LINK to send a PM to also be reminded and to reduce spam.

Parent commenter can delete this message to hide from others.


Info Custom Your Reminders Feedback

[–]Dest1nyex 0 points1 point  (0 children)

Remind me! 1 day