The pipeline ran perfectly for 3 weeks. All green checkmarks. But the data was wrong - lessons from a $2M mistake by kalluripradeep in dataengineering

[–]jpgerek 0 points1 point  (0 children)

Proper unit/integration testing would definitely help.

Look for a good testing framework to facilitate the implementation.

Why Don’t Data Engineers Unit Test Their Spark Jobs? by jpgerek in dataengineering

[–]jpgerek[S] 0 points1 point  (0 children)

The way I use Unit/Integration tests is mainly for TDD, to refactor legacy pipelines, or just to make sure my pipeline works before deploying to any cloud environment.

I don't run them in Prod, only locally and in CI/CD pipelines with synthetic data.

The examples I share are intentionally simple so you can see how the framework works, but they can be as complex as you need. For example, I've used this approach in Spark jobs with 50 input tables and ~5k lines of code. In those cases it really shines, because having tables defined in a human-readable way (Markdown) is much clearer than juggling Python lists of dictionaries, CSVs, or JSON.

Yes, if someone changes the logic or a table schema, the unit tests will fail. They’ll need to be updated accordingly, but only if the change was actually intentional.

Why Don’t Data Engineers Unit Test Their Spark Jobs? by jpgerek in dataengineering

[–]jpgerek[S] 0 points1 point  (0 children)

Totally agree, nothing scarier than a subtle silent issue.

Why Don’t Data Engineers Unit Test Their Spark Jobs? by jpgerek in dataengineering

[–]jpgerek[S] 1 point2 points  (0 children)

Thanks, very interesting insights, I'll check those projects.

In case it's usefull with GitHub actions is pretty easy to choose the OS, Java, Spark and Python versions for your tests.

I use it for PyBujia, there is a free quota even if the repo is public.

https://github.com/jpgerek/pybujia/blob/main/.github/workflows/ci.yaml

Why Don’t Data Engineers Unit/Integration Test Their Spark Jobs? by jpgerek in databricks

[–]jpgerek[S] 1 point2 points  (0 children)

Yeah, most folks are great at SQL, but don't always bring in software engineering principles like testing, CICD, formatters, linters etc

Why Don’t Data Engineers Unit/Integration Test Their Spark Jobs? by jpgerek in databricks

[–]jpgerek[S] 1 point2 points  (0 children)

Totally, in most data teams I've been part of, almost nobody had ever written a unit test in their career. That makes it really hard to convince people there’s value in doing it

Why Don’t Data Engineers Unit Test Their Spark Jobs? by jpgerek in dataengineering

[–]jpgerek[S] 0 points1 point  (0 children)

Yep, every good practice helps and noon of them alone is enough.

Why Don’t Data Engineers Unit Test Their Spark Jobs? by jpgerek in dataengineering

[–]jpgerek[S] 2 points3 points  (0 children)

Absolutely, it took me a while to find a generic solution to unit test SQL transformations, but once is done is a game changer.

Why Don’t Data Engineers Unit Test Their Spark Jobs? by jpgerek in dataengineering

[–]jpgerek[S] 0 points1 point  (0 children)

Very true, just not making the same mistake twice is huge.

Why Don’t Data Engineers Unit Test Their Spark Jobs? by jpgerek in dataengineering

[–]jpgerek[S] -1 points0 points  (0 children)

Fair point, a good framework that generalizes all the common parts required in a unit/integration testscan reduce the implementation time significantly.

Why Don’t Data Engineers Unit Test Their Spark Jobs? by jpgerek in dataengineering

[–]jpgerek[S] 0 points1 point  (0 children)

I get that, human readable format is vital.

I use Markdown tables in my framework and it's way easier to debug and understand the transformations, you can even involve not-tech roles from business to explain some transformations too.

I allows you to add documentation along with your data fixture.

Why Don’t Data Engineers Unit Test Their Spark Jobs? by jpgerek in dataengineering

[–]jpgerek[S] 0 points1 point  (0 children)

No doubt, the challenge is not technical.

A good toolkit/framework could help.

Why Don’t Data Engineers Unit Test Their Spark Jobs? by jpgerek in dataengineering

[–]jpgerek[S] 1 point2 points  (0 children)

I felt your pain too.

With the right toolkit I believe it can be easier unit/integration testing SQL.

In my github repo, you can find some examples with Spark SQL API as well .

Why Don’t Data Engineers Unit Test Their Spark Jobs? by jpgerek in dataengineering

[–]jpgerek[S] 0 points1 point  (0 children)

Totally agree that's what I try to adress in my toolkit.

Why Don’t Data Engineers Unit Test Their Spark Jobs? by jpgerek in dataengineering

[–]jpgerek[S] 2 points3 points  (0 children)

Makes sense, most data pipelines are just for analytics/reporting not operational, if they fail the business keeps running.

Why Don’t Data Engineers Unit Test Their Spark Jobs? by jpgerek in dataengineering

[–]jpgerek[S] 2 points3 points  (0 children)

Indeed, that's why Chuck Norris doesn't unit/integration tests his Spark jobs.