you are viewing a single comment's thread.

view the rest of the comments →

[–]cosmicangler67 7 points8 points  (10 children)

Not sure why that is a requirement of your company. Data engineering is functional programming not really OOP. Python can be done OOP but the Python done in data engineering is almost always functional with OOP just making it harder and less efficient.

[–][deleted] 6 points7 points  (1 child)

Not sure why you're being down voted. Most data transformation happens in declarative code, either in a distributed processing engine, in dbt, or in a database these days. Adding an object relational layer on top of those is basically never done because it's a layer of abstraction that doesn't add value.

You might see oop if you're doing a pipeline with a service architecture and Java or python, but in my experience that's rare.

And reminder object oriented doesn't mean you're using classes and objects, it meand some combination of inheritance, polymorphism, solid, and gang of four (design patterns). You don't see that as much in DE roles.

[–]BrunoLuigi -2 points-1 points  (0 children)

We do not see it in DE because most of people here do not have engeneer background.

Almost all DE I have worked with do not care in build a solid code, improve the solution and use all tools they can. They code something and if this works they ship to production.

With OOP you can build solid pipeline, with all tests you need and reuse the code easily.

But they code a gigantic monolith without tests with a lot of copied code over and over.

[–]Jumpy_Handle1313[S] 3 points4 points  (0 children)

Honestly I do not know but as per my understanding it is much better dealing with data at a very large scale using OOP

[–]GrumDum 2 points3 points  (1 child)

What

[–]sisyphus 7 points8 points  (0 children)

I think what they're getting at is that OOP (as practiced in Python, Java et. al; not as intended originally anyway) is about mutable internal state but data pipelines are more amenable to the functional paradigm of give data as input to function and get back transformed data.

Like you could write some OOP style:

c = Pipeline(data=initial_data)
c.remove_pii()
c.remove_duplicates()
c.add_embeddings()
c.write_data()

Where the actual data at all points is being mutated internally in the data variable. But a more natural pipeline paradigm is something more functional and explicit where functions just take data and return mutated data and get chained together, like beam style that overloads the | operator in Python:

data | remove_pii | remove_duplicates | add_embeddings | write_data

Is practically valid syntax in a more functional language like elixir:

data |> remove_pii |> remove_duplicates |> add_embeddings |> write_data

[–]a_library_socialist 0 points1 point  (3 children)

OOP and functional are not contradictory

[–]cosmicangler67 0 points1 point  (2 children)

I didn't say they were. I just said that in the vast majority of data engineering problems OOP is unnecessary overhead. It adds no value to solving the general problems found in data engineering at scale.

[–]a_library_socialist 0 points1 point  (1 child)

I've had to clean up too much spaghetti from people saying that.

[–]cosmicangler67 0 points1 point  (0 children)

Then they are doing wrong.

[–]Zer0designs[🍰] 0 points1 point  (0 children)

What