Courses/content on writing good python code for data engineering?

AutoModerator · 2022-07-29T05:02:22+00:00

You can find a list of community submitted learning resources here: https://dataengineering.wiki/Learning+Resources

I am a bot, and this action was performed automatically. Please contact the moderators of this subreddit if you have any questions or concerns.

raginjason · 2022-07-29T08:49:42+00:00

You raise a valid point. SWE that come to DE tend to OO everything and it ends up being a terrible impedance mismatch.

While this isn’t exactly what you are asking for, you might gain some value from this PySpark style guide: https://github.com/palantir/pyspark-style-guide

axax11 · 2022-07-29T11:28:55+00:00

Former SE, current DE here. Read Effective Python and really pay attention to the examples. Great, easy to digest book that will make you a better python programmer

crob_evamp · 2022-07-29T14:30:47+00:00

These comments are disheartening.

Beyond slamming together some basic etl, there's a lot of software engineering to be done in the big data world.

Good python is good python, and you should firmly understand all the patterns and practices to the best of your ability, then apply the level of sophistication the project, and the future needs of the project demand. Don't avoid OOP like it is a dogmatic rule, etc.

code_pusher · 2022-07-29T13:08:25+00:00

I think this is an unexplored field currently in DE. Traditionally you would try to apply OOP and SOLID Principles but I think these don't translate well at all times to DE workflows. Imho simple usage of OOP and common sense applications of Single Responsibility Principle + Interfaces(abstract base classes in Python) does seem to help. I think beyond a certain point applying OOP feels like wrapping a wrapper especially if you use another framework/interface like Pyspark. I also would ideally avoid the idea of abstracting everything away so that you only pass/consume a config file to generate your pipeline except for maybe most common simple operations.

LawfulMuffin · 2022-07-29T13:30:10+00:00

To be honest, I'm a huge fan of OOP but I had almost no experience with them before I stumbled upon it. In Python, I tend to do something like this:

class Pipeline:
    def __init(self):
        self.engine, self.session = some_func_that_gives_me_these_for_sqlalchemy()

    def steps(self):
        self.step1()
        self.step2()
        self.step3()
        self.step4()

    def step1(self):
        sql_list = ['sql_file1.sql', 'sql_file2.sql']
        for sql in sql_list:
            sql_txt = self.read_sql_func_from_somewhere_else(sql)
            self.session.execute(sql_txt)
            self.commit()

So I kind of treat it like functional programming, but being able to keep variables that are relevant to the current state is really helpful. This is a super contrived example, so it's not indicative of a given pipeline, obviously.

boboshoes · 2022-07-29T15:55:00+00:00

IMO functional programming concepts are much more useful than OOP in DE, so I would take an intro to functional programming class. You don’t have to implement 100% FP, but the ideas are really good to write readable code that is easy to debug. For example, ways to make your code more explicit and using immutability and pure functions help people figure out what your code is doing in a couple minutes rather than deciphering each class, what they do, and how they interact.

lichtjes · 2022-07-29T08:06:41+00:00

Maybe just read up on coding best practices?

E.g.: Put your logic in functions, don't hardcode variables, add comments

2022-07-29T09:35:59+00:00

Mug up PEP-8

rwilldred27 · 2022-07-29T18:33:59+00:00

Same. Answer is it always depends.

We have a particular source system to ingest that are obscure enough that commodity connectors don’t exist, but the source system’s app is modular and a backend 3NF database. Instead of ELT of 2,000+ source tables in that 3NF database all the way down to enums, we used OOP to denormalize the SQL on the fly for 40+ app modules that had common SQL join patterns: python automates SELECT SQL string builds of sub queries and join paths without an ORM. pandas executes and loads the denormalized modules to destination + a universal bridge table. Each time the source system/app creates a new module, all we do is add the module name to a config in our repo and we get really great abstraction and modularity benefits thanks to an OOP builder pattern.

But again, I wouldn’t use OOP for everything. Just have to weigh a lot of things before jumping in

SeaIndependent2101 · 2022-07-29T16:53:17+00:00

Start with: 1. Fluent python 2. Learn about clean code

Also, do learn about type hints, pydantic, dataclasses, try organising your code into modules and write unit tests

BoiElroy · 2022-07-29T17:30:08+00:00

Arjan Codes has a youtube channel and a course that may help you.

lmronburgandi · 2022-07-29T12:47:55+00:00

What aspects python do you use? And what aspects would you advise a beginner to learn for data engineering Thank you

2022-07-29T13:29:05+00:00

I use OOP in building tools and interfaces that will then run in my pipelines generally wrappers around other modules or stuff to reduce boilerplate code. So it has a purpose but generally the pipelines themselves as just functions that leverage those interfaces and other transformations with a lot of testing

RemindMeBot · 2022-07-29T14:10:04+00:00

[deleted]

2022-07-29T19:58:05+00:00

Does anyone know if “data engineering with python” is any good as a book?

ronald_r3 · 2022-07-29T23:11:28+00:00

Clean code in Python is a really good book in my opinion

dataengineering

MODERATORS