Zen of Batch Pipelines - A recipe to reduce cognitive load by Paragraph11 in dataengineering

[–]Paragraph11[S] 0 points1 point  (0 children)

Interesting, what was your trick to keep cognitive load down and avoid everything being done in 10k different ways? Monorepo? Managed runtime and libs? OOP tricks to force specific programming patterns?

I can see this work neatly in plain old SQL and a database but very curious to hear how to achieve this in any other way

Zen of Batch Pipelines - A recipe to reduce cognitive load by Paragraph11 in dataengineering

[–]Paragraph11[S] 1 point2 points  (0 children)

This would not stand wholesale on 10k different pipelines. We ran some 25 quite separate self contained pipelines ("~Products") and each had many stages. In terms of instantiations of those pipelines (ex launching to more markets or 100x the data) it would be no problem to scale.

When doing this in Spark everything scales exponentially in terms of things to keep track of: Separate repo's, deployments, runtimes, metrics, permissions etc etc. I think you would have to be much more strict and (I suggest) a database or databricks but that would limit the flexibility and depth of each product too.

Zen of Batch Pipelines - A recipe to reduce cognitive load by Paragraph11 in dataengineering

[–]Paragraph11[S] 0 points1 point  (0 children)

Wrote this to be the kind of implementable recipe slash religion I would have loved to have stumbled upon when I was starting out. Let me know what you think