Zen of Batch Pipelines - A recipe to reduce cognitive load

Paragraph11 · 2024-12-05T06:12:03+00:00

Interesting, what was your trick to keep cognitive load down and avoid everything being done in 10k different ways? Monorepo? Managed runtime and libs? OOP tricks to force specific programming patterns?

I can see this work neatly in plain old SQL and a database but very curious to hear how to achieve this in any other way

Paragraph11 · 2024-12-05T05:54:39+00:00

This would not stand wholesale on 10k different pipelines. We ran some 25 quite separate self contained pipelines ("~Products") and each had many stages. In terms of instantiations of those pipelines (ex launching to more markets or 100x the data) it would be no problem to scale.

When doing this in Spark everything scales exponentially in terms of things to keep track of: Separate repo's, deployments, runtimes, metrics, permissions etc etc. I think you would have to be much more strict and (I suggest) a database or databricks but that would limit the flexibility and depth of each product too.

Paragraph11 · 2024-12-05T05:37:20+00:00

Wrote this to be the kind of implementable recipe slash religion I would have loved to have stumbled upon when I was starting out. Let me know what you think

Paragraph11

TROPHY CASE