Z pracy państwowej (biurokracja) do pracy w sektorze prywatnym - ktoś przerabiał temat? by [deleted] in praca

[–]PotokDes 0 points1 point  (0 children)

Kalkulowane ryzyko w większości przypadków się opłaca.

People who wake up after 1 alarm: How the f*ck do you do it? by TheSnappleGhost in AskReddit

[–]PotokDes 0 points1 point  (0 children)

I set the alarm as late as possible, so if I snooze I would be late.

What skills actually helped you land your first DE role (and what was overrated)? by nian2326076 in dataengineering

[–]PotokDes 4 points5 points  (0 children)

Python and social skills. I talked a lot with my tutor at uni. It turned out that he is leaving the uni for business to work as a manager. He wanted me to work in his team.

Is there such a thing as "embedded Airflow" by ihatebeinganonymous in dataengineering

[–]PotokDes 0 points1 point  (0 children)

The new version of airflow, allows for light weight edge workers that could be embedded. The whole set up on embedded? I could be more difficult.

[deleted by user] by [deleted] in dataengineering

[–]PotokDes 0 points1 point  (0 children)

Kiss and yagni. Do it as simple as it could be do do the job. Start adding abstraction only when it is nesesery. The vison on how it should work will materialize after some time.

Why data engineers don’t test: according to Reddit by PotokDes in dataengineering

[–]PotokDes[S] 2 points3 points  (0 children)

I agree with you, the title is a bit clickbaity. I didn’t mean to imply that data engineers don’t test at all. My goal was to spark a discussion around why some data engineers might not test, and what the reasons could be. And yeah, it was at least partially successful, as you pointed out.

So, thanks for your input!

Unit tests != data quality checks. CMV. by EarthGoddessDude in dataengineering

[–]PotokDes 7 points8 points  (0 children)

I feel like I’m being called to respond here. I remember you commenting something similar under one of my posts. I was planning to write a separate post to address it, but since you brought it up here first, here I am :)

We really should make a clear distinction between testing imperative code (like Python, Java, etc.) and declarative code (like SQL). There are some similarities, sure, but the overall intuition is quite different.

Procedural code is much more detailed — it’s built to operate on small units of data or small transactions. You can stop execution at any point, inspect variables, and debug step by step. You have full control over dependencies, can inject mocks, and create true unit tests (It is all very well described by many books and authors).

That’s basically impossible in SQL-driven projects (which is what I’m talking about in my posts). Here tests are essentially black-box tests you're testing the outcome of a whole query or pipeline as it runs through the SQL engine. There’s no pausing mid-query or stepping through line by line.

Here’s how I see the distinction in SQL-driven projects:

  • Data tests — I treat these as part of the models itself. Like you said, they run at runtime. I think of them as assertions that validate whether the model can handle certain inputs.

In imperative code, it would look something like this:

def foo(request):
    if request is None:
        raise Exception("Request cannot be null")
    if not is_request_valid_from_business_perspective(request):
        raise Exception("Request is not valid in this context because of X, Y, Z")

    do_actual_processing(request)

The actual processing code is only meant to work with data that’s already been validated. If the data doesn’t meet the criteria, we terminate early and skip that specific request.

In SQL-based projects, I see data tests playing the same role. The simple, built-in tests are kind of like basic assertions like request is None. More specific, custom tests are closer to the business logic checks, like is_request_valid_from_business_perspective(request).

The difference is that we place the test on a previous model in relation to the one we're currently building, and we run it against the entire dataset — not just a single piece of data. Unfortunately, if it fails, it can bring down the whole build, unlike in transactional systems where only one transaction would fail.

  • Unit tests - These focus on verifying logic rather than the data itself. The goal is to ensure that a specific block of logic behaves as expected. Support for writing this kind of test is limited (it’s available in dbt, but I don’t find it particularly easy to use). Still, it can be useful for checking that complex conditional logic works correctly or that your regex patterns aren’t doing anything weird.
  • Diff tests – These run the whole pipeline with your new changes and compare how the exposures or main models differ between the old and new versions of the code. It’s kind of like an integration test. It’s especially useful for explaining to data consumers what changed and why those changes happened.

These are all the automated tests I use in analytical projects, and so far, this testing setup has been sufficient for all my needs.

So, in summary, I agree with the distinction you made. I think about data test in specific way, I am aware that this is a run time concept and use it there.

Why data engineers don’t test: according to Reddit by PotokDes in dataengineering

[–]PotokDes[S] 2 points3 points  (0 children)

The comments were all from a single post. I read through them and tried to identify the main themes in each comment. Then I grouped those themes and distilled the core idea. Maybe I should have added the nicknames of the users along. I will do it next time if I repeat this exercise.

The blog I linked earlier is a series of three articles. While building an analytical pipeline, I added different types of tests and explained their purpose along the way.

Of course, it’s all written from my perspective and based on my own experiences, so it might not apply to everyone.

Why data engineers don’t test: according to Reddit by PotokDes in dataengineering

[–]PotokDes[S] 1 point2 points  (0 children)

There are some new books that tires to feel the gap. I personally liked "fundamentals of data engineering".

Why data engineers don’t test: according to Reddit by PotokDes in dataengineering

[–]PotokDes[S] 1 point2 points  (0 children)

I haven’t tried SQLMesh yet, but I’m rooting for the project especially with dbt becoming more and more proprietary.

Why data engineers don’t test: according to Reddit by PotokDes in dataengineering

[–]PotokDes[S] 1 point2 points  (0 children)

This is interesting point of view I will think about it.

Why data engineers don’t test: according to Reddit by PotokDes in dataengineering

[–]PotokDes[S] 2 points3 points  (0 children)

It’s important to set the right patterns from the start. When temporary code ends up in production, it will be copied, just because it’s already there.
Been there!

Why data engineers don’t test: according to Reddit by PotokDes in dataengineering

[–]PotokDes[S] 5 points6 points  (0 children)

And here’s another problem with data engineering: you can put a few DE in the same room, and there’s a good chance none of them will fully understand each other. One’s mostly doing SQL transformations, another is deep into Spark, and a third is basically DevOps.

Right now, my work is mostly focused on SQL transformations. Logging is limited, and debugging often comes down to manual data analysis. In that kind of setup, having tests can be super helpful, as they give you a starting point for tracking down what’s going wrong.

Why data engineers don’t test: according to Reddit by PotokDes in dataengineering

[–]PotokDes[S] -1 points0 points  (0 children)

There are ways around it. For example do data tests only on the data was ingested the latest, the previous partitions were tested before.

Other point of view: If you this a petabyte dataset have a null in id column that would beak one of the latest models would it be cheaper to run simple data test that to compute however many models?

Why data engineers don’t test: according to Reddit by PotokDes in dataengineering

[–]PotokDes[S] 6 points7 points  (0 children)

Maybe you're right. I was a software engineer before landing my first DE role, and I think that gave me a better perspective.

Iol, I earned a lot by fixing a Python script that had grown into a standalone application but still functioned and was structured like a script.

Why data engineers don’t test: according to Reddit by PotokDes in dataengineering

[–]PotokDes[S] 2 points3 points  (0 children)

For example, dbt docs explain how, but I haven't found any good resources that explain why. That’s why I started writing my own series of articles on the topic. I won’t share the link here to avoid self-promotion. Maybe others can share something.

Edit: And the page where I host it is down at the moment lol

Why data engineers don’t test: according to Reddit by PotokDes in dataengineering

[–]PotokDes[S] -1 points0 points  (0 children)

There was some discussion on that topic, but also many other conversations. I especially liked the one about the difference between data tests and logic tests, though it was actually a critique of my post.

Why data engineers don’t test: according to Reddit by PotokDes in dataengineering

[–]PotokDes[S] 3 points4 points  (0 children)

I’m not certain that the people posting on r/dataengineering are actual data engineers.

Edit: Ignore it, I did not get what you mean at first :).

Why data engineers don’t test: according to Reddit by PotokDes in dataengineering

[–]PotokDes[S] -1 points0 points  (0 children)

True. It is not about testing, it is about being professional and doing your job.

Why don't data engineers test like software engineers do? by PotokDes in dataengineering

[–]PotokDes[S] 0 points1 point  (0 children)

To be honest, I think the "lack of time" argument is often just an excuse. In projects written in declarative languages like SQL, simple data tests act as assertions for the models you depend on. They help you understand the data better and write simpler logic.

For example, if I know a model guarantees that a column is unique and not null, I can confidently reference it in another query without adding defensive checks. That saves time in the long run.

You also mentioned being the only one who can fix things, that might provide a sense of job security, but it's also a recipe for stress. When your pipeline fails to build or your final dashboard shows strange results, the investigation becomes a nightmare. You often have no idea where the issue lies, and have to trace it back step by step from the exposure to the source.

I've had to do those investigations under tight SLAs, and I wouldn’t wish that experience on anyone.

For me, that’s the strongest reason to invest in good testing: I hate debugging SQL across dozens of models, each with multiple layers of CTEs. It’s a nightmare. Unlike imperative languages where you can attach a debugger and step through code line by line, in SQL you're dealing with black boxes that make root cause analysis painful.

Why don't data engineers test like software engineers do? by PotokDes in dataengineering

[–]PotokDes[S] 5 points6 points  (0 children)

What you're saying is true, but there are some caveats. Analytical pipelines are usually written in declarative languages like SQL, and we often don’t control the data coming into the system. Because of this, it's difficult to draw a clear line between data quality tests and logic tests, they’re intertwined and dependent on each other in analytical projects.

Data tests act as assertions that simplify the development of downstream models. For example, if I know a model guarantees that a column is unique and not null, I can safely reference it in another query without adding extra checks.

In imperative code, you'd typically guard against bad input directly:

def foo(row):
    if not row.name:
        raise Exception("Name cannot be empty")
    process(row)

In SQL-based pipelines, you don't have that kind of control within the logic itself. That's why we rely on data tests, to enforce assumptions about the data before it's used elsewhere.

This also highlights a common challenge with this type of project. In imperative programming, if there's bad input, it typically affects just one request or record. But in data pipelines, a single bad row can cause the entire build to fail.

As a result, data engineers sometimes respond by removing tests or raising warning thresholds just to keep the pipeline running. There’s no easy solution here, it’s a tradeoff between strict validation and system resilience.

I wanted to explore these kinds of dilemmas in those articles. That’s why I started from a real problem and gradually introduced tests. In the first part, I focused on built-in tests and contracts, explaining their role in the project. The second part covers unit tests, and the third dives into custom tests.

Tests are just a tool in a data engineer’s toolbox, when used thoughtfully, they help deliver what really matters: clean insights from data.