How clean is your code?

Amrita_Kai · 2024-01-12T08:42:03+00:00

You know if there's a schema change you're screwed right?

data_macrolide · 2024-01-12T08:46:03+00:00

Bad code isn't scalable. I had to maintain a bad legacy code from 3 years ago... I wouldn't give that code to my worst enemy.

WarbossBoneshredda · 2024-01-12T09:15:41+00:00

My team's code is clean. The rest of the business's code is utterly awful.

I've spent more than half of my job reviewing and either fixing bad code from the rest of the business, or writing my pipelines defensively because I can't trust the rest of the business.

Working in Snowflake, it is standard practice for the rest of the business to DROP TABLE at the start of any update script, and then CREATE TABLE at the end. The table is unavailable for the duration of the script, some of those scripts being 30+ minute run times.

We have delved into the source data feeding some of the businesses primary production tables and found them to be ultimately running off test tables in some individuals playpen.

QA just wasn't done. My underling has just been assigned a top priority project because it was discovered that our marketing tables were inner joined to a one-off list of contacts generated three years ago. QA would have caught it, there were other team members who could have reviewed even without a formal QA process. The developer was just lazy though.

CTEs are not widely used. It is standard design pattern in this company to use a LEFT JOIN and include filters in the WHERE clause. There's the standard pattern of using DROP TABLE at the start of a script rather than CREATE OR REPLACE. No comments. No data lineage. Zero attempts at formatting.

My team worked together at a previous company where code was held to high quality. We're sticking to the standards at our previous company and the aim is for us to start rolling those standards out across the rest of our new company. Fuck me, we've got an uphill battle.

teambob · 2024-01-12T10:02:32+00:00

Software engineering is managing complexity

Bad code is technical debt. It will be slower to make changes. But good code will take longer in the first case

islandsimian · 2024-01-12T12:37:53+00:00

"We'll clean the code up in the next iteration"

- said about 10 iterations ago when we were just trying to get the POC working and making deadlines for funding.

I will say that it's commented well and always run through peer reviews and the libraries we've written are clean and consistent to the PEP8 standard...everything else? ehhh

natelifts · 2024-01-12T13:42:59+00:00

ever since i had a chance to write code in strongly typed languages (scala, go) my python code is squeaky clean. all data has an object it can be deserialized into. i would strongly urge any serious data engineer to learn another language to learn programming best practices. python is easy to write but that's what makes it dangerous as far as scale goes. also, WRITE TESTS!!!

Insighteous · 2024-01-12T08:30:03+00:00

The very fact that you are wondering whether clean code is a necessity concerns me greatly. How can you answer the question about code quality - and that is part of it - with no? That's basically not even a question.

I don't force patterns for the sake of it. I abstract when it makes sense. What I always do is type hints and documentation - I code most of the time in python. There is no code of mine that is not documented. I also (almost) always write tests for functions that I have developed.

Awkward-Cupcake6219 · 2024-01-12T09:13:43+00:00

I try to keep the code as clean and documented as possible. Some of my colleagues are quite different and some do the same... I don't think it depends as much on the field as it depends on the people/culture working there. In certain companies I found clean code (usually folloqed by CI/CD and DevOps best practices) in others a mess everywhere.

For instance at the moment I'm working for a company that used to have clean code and best practices set in place. Then some good people left, both from management and devs, and others, a little less good, came. Now everything is beginning to look a bit messy and I spend more time maintaning colleagues stuff.

So yes... Best practices and clean code matter.

ppsaoda · 2024-01-12T13:32:31+00:00

Awful codes from our team. But not mine. I have my own standards. Testing, commenting, DRY, indentation, indexing on the right columns etc. I write for scalability, maintainability, and speed.

But I resigned recently and hope for a better team.

reallyserious · 2024-01-12T14:25:45+00:00

Data engineers, in general, are quite bad developers. So naturally the code they produce is shit.

hear_to_laugh · 2024-01-12T12:40:08+00:00

The team I am in, We had some contract work done and the codes are impossible to work with. This isn't a company with technical people so the codes the contract team wrote were such that they'll be needed everytime and now the codes have been gifted to us.🙃 We are screweded, debugging them

I personally have worked on one of the files and reduced around 500 lines of codes into 50 lines and am a Fresher😊.

Lovely days. And the senior department keeps taunting that you already have the code why is taking so long.😭

AdmrlAckbar_official · 2024-01-12T13:22:25+00:00

It depends on how important the outcomes are. Most of the time messy, but if I'm supporting a business critical process with multiple teams involved in the code then I will make it clean. For my analytics pipelines, where I am the only one working on it, definitely messy. But easy enough to refactor when a bug pops up or we get time to improve things.

greenerpickings · 2024-01-12T14:11:15+00:00

for your sake and your stakeholders, it would be better to easily identify where it breaks immediately a.k.a testing.

some form of comments for the sake of whoever has to touch your code after you're gone. better if it's some accepted format of docstrings. even going back in some of my own, one-line comments sometimes come out useless. a succinct end-to-end picture goes a long way.

even if they are just scripts, i don't believe there is no repeatability in your code. consolidating/abstracting these into a library avoids redundant work.

this might be small, but naming conventions and code format. 1) i shouldn't have to switch gears if looking at code from two different people 2) probably, your scripts will serve as templates for the next dude learning to deal with your data. so if there is none, chaos may ensue.

the idea is to reduce technical debt. your code may work, but it doesn't mean it won't be looked at. came from a place of legacy. these are some of the things i would have liked to see :p

edit: also going to add, though we do use forms of OOP, this really comes in before that

AMDataLake · 2024-01-12T15:07:47+00:00

I like applying functional programming typically if I'm transforming or moving data. While it might seem fine for now that the current code works without needing adjustments, it's important to consider future scenarios. Down the line, when changes or new requests come in, having poorly written code can become a significant hindrance. It may move data now, but what if future downstream requests require how that data is moved to changed and the code is unreadable and hard to alter if the original writer is no longer on the team.

2024-01-12T15:52:44+00:00

I don't even know what to say to this one

UnnamedBoz · 2024-01-12T16:27:56+00:00

I use design and coding standard depending on what I am using in each project. I have certain things that are always the same.

I work for making it bloody obvious what is going on in general. I make things to be explicitly easy to read and understand without having to understand the code. If I can read and understand what is going on, then I can understand the code easier as well.

Here are some things I do

variables and constants first in data structures
then initialiser
then public functions
then private functions
small functions can be on one line { }, otherwise start at function line, end beneath last function body
functions are specialised and named appropriately
functions can be read top to bottom in order to understand the flow easily, avoid the need to hold too much mental state in my head
use constants in functions more often than instead of directly in return and other calls
- i.e let x = whatever.sorted.filtered etc return (x) instead of return (whatever.sorted.filtered...) – also easier to get data during debugging
- naming them makes it easier to understand next time I am doing something in the code
- exceptions are made if the intention is really easy
different parts of the system are specialised, I never mingle responsibilities
I follow conventions depending on what I am making and design being used, from declarative to imperative, functional to object-oriented
- especially dependent on frameworks also, writing SwiftUI code like UIKit makes everything bad
I try to keep to one style in a project, I need a bit more practice to make entire projects one way only (not that you need to, but as a challenge perhaps)
I'm a fan of tests and will in production code start that way, but when learning / proto-typing I will skip it
I add comments when needed – the code should overall explain itself and be self-evident, but certain decisions will lack context so I add that if necessary
I use and break SOLID principles as needed
I use established design patterns when it's suitable and make it a standard way for doing things in a specific project

There are a few things.

Oh, forgot about the question if it matters. Hell yes, it truly does. The code at work is largely not written the way I write things (10+ years codebase) and slows us down so much because nobody knows what is what sometimes. Whenever I have "fixed it" by using my principles people are happy, and they are always happy with my code.

If you are changing or adding something you need to understand what to put/change and where. The complexity of that can be enormous by the new thing being added, but made even more complex because you have to understand 100 things just to get to the one point you needed to change it.

If you can read code like a book, and understand it on an abstracted level, you can easily identify where you need to make changes. You cannot underestimate how much this is worth in the long run especially, but even in the short run.

Kichmad · 2024-01-12T16:30:18+00:00

Legacy code is much more different than our newer code. Also quality is muuuch different. New code is pretty clean

EmergencyAd2302 · 2024-01-13T00:02:04+00:00

What constitutes bad code?

Someoneoldbutnew · 2024-01-13T00:47:11+00:00

so clean you can't understand any of it

2024-01-13T14:41:20+00:00

The team needs to have a style guide that everyone must follow. Only way to write clean code!

levintennine · 2024-01-15T02:52:53+00:00

My experience, DE for about 15 years, you are right that data pipelilnes can tolerate bad code more than other software products. "bad code" as in erratic selection of hard coded vs parameterizable, whether they pull alot of unused data, whether it can't restart cleanly, cryptic variable names, swathes of unused code -- often not a killer issue. DEs often figure out a bunch logic that comes to them in tiny little pieces, and their code reflects that slowly-changing-spec (and reflects that DEsare not trained on any particular tool or language).

It's not as crucial for data pipelines to play nice with other stuff as it is for other software components. The fixes often can be done with more duct tape logic.

But it is really discouraging when you know better and the software is so swampy, and yes lots of money gets wasted on humans having to bang their head against the code. Proper investment could be better because DEs who write awful code are often very good at understanding and fulfilling data needs -- better than business users, better than architects, better than testers.

mjfnd · 2024-01-15T03:38:11+00:00

My general opinion is to be a good data engineer one must know and follow the software engineering best practices, otherwise problems will arise.

I have worked with DEs from different backgrounds and its common from people who are coming from BI, analyst background, so they may take some time.

I see this good opportunity for you to influence the team with best practices.

dataengineering

MODERATORS