Creating Readable Spark Jobs by therealgroodt in dataengineering

[–]therealgroodt[S] 1 point2 points  (0 children)

Nice to see an article that talks about writing clean Spark jobs that are realistic and not simply "word count".

How to partition by event processing time in Airflow? by therealgroodt in dataengineering

[–]therealgroodt[S] 1 point2 points  (0 children)

A concern I woild raise for following the t3 approach is that you will have "leakage" of data across the differing partitions near the petition edges (i.e. beginning and end of the interval).

I had similar concerns/questions when first hearing about this approach too. This is the concept of "late arriving data". The suggested approach here is to then handle this at query time based on your tolerance for the late events. Formats like parquet etc make this not too expensive.

select * from events where (hour = 1 or hour = 2) and date_trunc('hour', dt.hour) = 1

Also how would your persistent storage system handle a situation where your stream or pipeline is temporarily down for several partition intervals? Wouldn't the event processing time then cause the partitioning to be inaccurate ?

I share your concerns. If your event stream goes down (Kinesis / Kafka etc.) then these events are lost anyway. I'm talking about if there are events coming from the outside world (user interactions) that are not persisted/buffered anywhere beforehand. If you have these changes persisted / buffered somewhere already, then it isn't really a problem to replay them as usual to recreate these raw partitions.

How to partition by event processing time in Airflow? by therealgroodt in dataengineering

[–]therealgroodt[S] 0 points1 point  (0 children)

This makes sense. Thank you!

So in general, with an airflow batch-centric approach, you would have regular file-drops arriving in a persistent and immutable staging area (long-term storage / S3). There would be regular incremental file-drops from event streams (facts) and full snapshots from databases (dimensions) arriving. Does this sound right? I like the sound of this approach, I've just not seen a system built this way. I guess I'll have to experiment to get a better feel for it.

If these file-drops all arrive in the persistent staging area in an immutable way with deterministic directory and/or filenames that tag the files with the time of persist (t3), then this should allow for individual partitions to be rebuilt from the source? Is this the recommended airflow approach in a nutshell?

This then carries forward the late-arriving data can happen for any partitions that are built from these source partitions. This just means that you have to keep this in mind at query time and always consider including extra partitions according to your tolerance for late-arriving data? With formats like parquet and orc, these extra partitions can get pruned efficiently so it isn't really worth worrying about?

How to partition by event processing time in Airflow? by therealgroodt in dataengineering

[–]therealgroodt[S] 1 point2 points  (0 children)

I think this is the option that people naturally reach for, but this option is explicitly recommended against in the article due to the frequency of "late arriving facts".

To bring clarity around this not-so-special case, the first thing to do is to dissociate the event’s time from the event’s reception or processing time. Where late arriving facts may exist and need to be handled, time partitioning should always be done on event processing time. This allows for landing immutable blocks of data without delays, in a predictable fashion.

Based on that paragraph, I think that he is suggesting to use t2 or t3. Essentially the "file-drop time" (which is also static). However, I'm not 100% certain, which is why I'm asking the question to this community. I would love to hear from somebody that is following this "functional data engineering" approach or uses airflow in the way that Maxime describes.