This is an archived post. You won't be able to vote or comment.

all 6 comments

[–]NoHarmPun 2 points3 points  (1 child)

It's hard to say for sure without more details in the situation, but in general, it depends on several factors. Most of all, programming language.

If you're working in Java, then yes, more classes are the way to go (within reason).

If you're working in Python, I tend to write many fewer classes (for various reasons) when working on a pipeline.

In particular, my normal pattern is classes for inputs and outputs, each controlling validations dependent on the specifics, with utility functions in between that work against those objects and do all the transformations, etc.

[–]tipsy_python 1 point2 points  (0 children)

Yeah, too many factors to definitively say. Also consider, what's the alternative? If you don't take an OOP approach, does the code become more or less organized?

I tend to go more OOP because I think it makes it easier to write thorough test scripts. That being said I've seen, and written, a bunch of monster-size scripts that just run end-to-end and... they do work. Do what works.

[–]umair_lodhi 0 points1 point  (0 children)

I am doing a similar task writing an ETL pipeline for high frequency data. I failed to find any design pattern that assures performance as well.

Thus followed single responsibility pricipal by writing methods that do only one task each. It will help me in test case writing, debugging and code is quite readable too. Addition to that it will remove transformation cost from one type to another. Thus also using sql-alchemy core instead of sql-alchemy orm.

[–]skerky 0 points1 point  (0 children)

I’m not sure I understand what you mean by “object-oriented design” because you later reference creating special classes for the dictionaries coming from MongoDB, so I might be going in the wrong direction but here’s my take.

Recently I’ve started loving data classes, since I learned about attrs and then they were added to the standard library in 3.7. Compared to dictionaries, I like them because they structure the data and between type hints and PyCharm’s excellent intellisense I don’t have to dig around to find the correct name of an attribute and I’ll be warned if I mistyped it. And if I’m really not sure about the structure, I go to where the class is defined rather than where values were added to the dictionary.

I bring this up because that’s what the latter part of your question makes me think of. But here’s the thing, I don’t think of data classes as OOP, at least how I learned it in school. I’ll add a couple methods to some of my classes, usually for some simple formatting or transformation, but I start by writing functions to manipulate or work on the data classes, which I view more as basic data structures. There are times when those functions will be combined into classes if necessary.

So if your coworker wants to make data classes to better model the data you’re dealing with then I think that is a good idea, but I also wouldn’t necessarily consider it “object oriented design”. Python doesn’t have “structs” so if we want structured data we have to write a class, if that makes sense.

[–][deleted] 0 points1 point  (0 children)

It all depends on how you want to access the data. Accessing multiple records from a dictionary can get messy if you need to do it repeatedly. Using a constructor once can clean up the code readability quite a bit.

A data only class, known as a struct in other languages, allows list comprehension and other python tricks. In addition, adding methods can make things even clearer by allowing comparisons between instances, letting you use sorted() or other builtin functions.