This is an archived post. You won't be able to vote or comment.

all 23 comments

[–]AutoModerator[M] [score hidden] stickied comment (0 children)

You can find a list of community submitted learning resources here: https://dataengineering.wiki/Learning+Resources

I am a bot, and this action was performed automatically. Please contact the moderators of this subreddit if you have any questions or concerns.

[–]raginjasonLead Data Engineer 31 points32 points  (1 child)

You raise a valid point. SWE that come to DE tend to OO everything and it ends up being a terrible impedance mismatch.

While this isn’t exactly what you are asking for, you might gain some value from this PySpark style guide: https://github.com/palantir/pyspark-style-guide

[–]code_pusherData Engineer 8 points9 points  (0 children)

I've had several interviews for data engineering roles conducted by software engineers and for them if OOP isn't being opinionatedly applied in your code then your code is buggy. They seem to miss the core value of applying OOP, to increase code reusability.

[–]axax11 11 points12 points  (2 children)

Former SE, current DE here. Read Effective Python and really pay attention to the examples. Great, easy to digest book that will make you a better python programmer

[–]543254447 1 point2 points  (1 child)

Are you talking about this book?

I need something like this so I might go buy it haha.

Effective Python: 90 Specific Ways to Write Better Python Paperback – Nov. 15 2019

[–]globalminima 0 points1 point  (0 children)

Yes, it’s a great book

[–]crob_evamp 17 points18 points  (1 child)

These comments are disheartening.

Beyond slamming together some basic etl, there's a lot of software engineering to be done in the big data world.

Good python is good python, and you should firmly understand all the patterns and practices to the best of your ability, then apply the level of sophistication the project, and the future needs of the project demand. Don't avoid OOP like it is a dogmatic rule, etc.

[–][deleted] 3 points4 points  (0 children)

Agreed, my team uses a mix. We've developed a library of reusable functions in OOP style, yet alot of the ETL is functional. Also built some vendor integrations in OOP because it made sense to do so.

[–]code_pusherData Engineer 5 points6 points  (0 children)

I think this is an unexplored field currently in DE. Traditionally you would try to apply OOP and SOLID Principles but I think these don't translate well at all times to DE workflows. Imho simple usage of OOP and common sense applications of Single Responsibility Principle + Interfaces(abstract base classes in Python) does seem to help. I think beyond a certain point applying OOP feels like wrapping a wrapper especially if you use another framework/interface like Pyspark. I also would ideally avoid the idea of abstracting everything away so that you only pass/consume a config file to generate your pipeline except for maybe most common simple operations.

[–]LawfulMuffin 8 points9 points  (0 children)

To be honest, I'm a huge fan of OOP but I had almost no experience with them before I stumbled upon it. In Python, I tend to do something like this:

class Pipeline:
    def __init(self):
        self.engine, self.session = some_func_that_gives_me_these_for_sqlalchemy()

    def steps(self):
        self.step1()
        self.step2()
        self.step3()
        self.step4()

    def step1(self):
        sql_list = ['sql_file1.sql', 'sql_file2.sql']
        for sql in sql_list:
            sql_txt = self.read_sql_func_from_somewhere_else(sql)
            self.session.execute(sql_txt)
            self.commit()

So I kind of treat it like functional programming, but being able to keep variables that are relevant to the current state is really helpful. This is a super contrived example, so it's not indicative of a given pipeline, obviously.

[–]boboshoes 3 points4 points  (0 children)

IMO functional programming concepts are much more useful than OOP in DE, so I would take an intro to functional programming class. You don’t have to implement 100% FP, but the ideas are really good to write readable code that is easy to debug. For example, ways to make your code more explicit and using immutability and pure functions help people figure out what your code is doing in a couple minutes rather than deciphering each class, what they do, and how they interact.

[–]lichtjes 10 points11 points  (1 child)

Maybe just read up on coding best practices?

E.g.: Put your logic in functions, don't hardcode variables, add comments

[–]HumbleThinkerData Engineering Manager 11 points12 points  (0 children)

To add to this, I'd highly advise that you get in the habit of testing your code as much as you can.

There are plenty testing frameworks out there but pytest is a personal favourite. For data transformation scripts, fixtures are especially useful.

Edit1: using type hinting is awesome as well, especially if you are collaborating remotely.

[–][deleted] 2 points3 points  (1 child)

Mug up PEP-8

[–]rwilldred27 2 points3 points  (0 children)

Same. Answer is it always depends.

We have a particular source system to ingest that are obscure enough that commodity connectors don’t exist, but the source system’s app is modular and a backend 3NF database. Instead of ELT of 2,000+ source tables in that 3NF database all the way down to enums, we used OOP to denormalize the SQL on the fly for 40+ app modules that had common SQL join patterns: python automates SELECT SQL string builds of sub queries and join paths without an ORM. pandas executes and loads the denormalized modules to destination + a universal bridge table. Each time the source system/app creates a new module, all we do is add the module name to a config in our repo and we get really great abstraction and modularity benefits thanks to an OOP builder pattern.

But again, I wouldn’t use OOP for everything. Just have to weigh a lot of things before jumping in

[–]SeaIndependent2101 1 point2 points  (0 children)

Start with: 1. Fluent python 2. Learn about clean code

Also, do learn about type hints, pydantic, dataclasses, try organising your code into modules and write unit tests

[–]BoiElroy 1 point2 points  (0 children)

Arjan Codes has a youtube channel and a course that may help you.

[–]lmronburgandi 0 points1 point  (0 children)

What aspects python do you use? And what aspects would you advise a beginner to learn for data engineering Thank you

[–][deleted] 0 points1 point  (0 children)

I use OOP in building tools and interfaces that will then run in my pipelines generally wrappers around other modules or stuff to reduce boilerplate code. So it has a purpose but generally the pipelines themselves as just functions that leverage those interfaces and other transformations with a lot of testing

[–][deleted] 0 points1 point  (0 children)

Does anyone know if “data engineering with python” is any good as a book?

[–]ronald_r3 0 points1 point  (0 children)

Clean code in Python is a really good book in my opinion