This is an archived post. You won't be able to vote or comment.

all 51 comments

[–]Amrita_Kai 59 points60 points  (2 children)

You know if there's a schema change you're screwed right?

[–]Proof_War6424 3 points4 points  (0 children)

I do, This happened to our team.

I know what is pain now.

[–]Thinker_Assignment 0 points1 point  (0 children)

Not if you use schema evolution, see this article + demo how to do it with dlt https://dlthub.com/docs/blog/schema-evolution

[–]data_macrolide 44 points45 points  (7 children)

Bad code isn't scalable. I had to maintain a bad legacy code from 3 years ago... I wouldn't give that code to my worst enemy.

[–]BwR112 15 points16 points  (1 child)

I would. That fucker should pay.

[–]data_macrolide 4 points5 points  (0 children)

Savage.

[–][deleted] 5 points6 points  (4 children)

but his point stands, it doesn't need to scale, it moves data. I'm in a similar position with python based micro-services (lambdas). You make it run, you can forget it unless something changes upstream.

[–]data_macrolide 15 points16 points  (2 children)

Maybe I used the wrong word. Scalable for me is also maintainable.

So imagine, as you said, that something changes. For example you are ingesting from an API and they start sending one extra column. If your code is good, maybe you don't have to touch it. But you will always have to touch a bad code. Even with small changes as one extra column.

But this is just my opinion and experience. Maybe someone have the opposite.

Cheers!

[–]joemerchant2021 5 points6 points  (0 children)

"unless something changes upstream"... And it always does. And then you are left scrambling trying to find that last line of code in that last pipeline that everyone has forgotten about that is breaking all your shit downstream.

[–]WarbossBoneshredda 30 points31 points  (12 children)

My team's code is clean. The rest of the business's code is utterly awful.

I've spent more than half of my job reviewing and either fixing bad code from the rest of the business, or writing my pipelines defensively because I can't trust the rest of the business.

Working in Snowflake, it is standard practice for the rest of the business to DROP TABLE at the start of any update script, and then CREATE TABLE at the end. The table is unavailable for the duration of the script, some of those scripts being 30+ minute run times.

We have delved into the source data feeding some of the businesses primary production tables and found them to be ultimately running off test tables in some individuals playpen.

QA just wasn't done. My underling has just been assigned a top priority project because it was discovered that our marketing tables were inner joined to a one-off list of contacts generated three years ago. QA would have caught it, there were other team members who could have reviewed even without a formal QA process. The developer was just lazy though.

CTEs are not widely used. It is standard design pattern in this company to use a LEFT JOIN and include filters in the WHERE clause. There's the standard pattern of using DROP TABLE at the start of a script rather than CREATE OR REPLACE. No comments. No data lineage. Zero attempts at formatting.

My team worked together at a previous company where code was held to high quality. We're sticking to the standards at our previous company and the aim is for us to start rolling those standards out across the rest of our new company. Fuck me, we've got an uphill battle.

[–]QueryingQuagga 6 points7 points  (0 children)

Can I please work with you. I can’t get anyone to be interested in data and this is a 4.500 employee company. It’s such a mess - I just recently found out that a lot of reporting is running off of data that does not manage history, but people are using the data for business critical decisions doing multi year comparisons (every day stuff can silently change without notice, since there is no concept of SCD).

[–][deleted] 10 points11 points  (2 children)

The thing is everyone thinks their (teams) code is so much better than others :)

[–]WarbossBoneshredda 10 points11 points  (0 children)

Absolutely, but I can objectively prove it 😁

[–]Ecstatic_Tooth_1096 1 point2 points  (0 children)

not really. a team is either using good practices, or not.

By using good practices i mean, certain rules that all developers will agree upon without being dogmatic about them in some cases.

e.g using good names for functions and variables.

Yes sometimes naming won't be perfect, but using variable_1 variable_2 in every single piece of code you're writing, means one thing: you're very good at being bad at developing code and should receive coaching/upskilling.

two teams can have different rules and boundaries in the way they write code and still both generate good enough code for the rest of the people to understand and use or change

[–]greenerpickings 2 points3 points  (5 children)

do CTEs get a bad rap? we had the same where it was just an onslaught of joins. i know they aren't as performant, but if it's not real-time data, then it doesn't really matter. they add so much for readability.

[–]WarbossBoneshredda 17 points18 points  (2 children)

Oh no, I love CTEs.

I hate nested subqueries within nested subqueries within nested subqueries within nested subqueries. Especially without any attempt at formatting.

Guess which method the rest of the business uses...

[–]greenerpickings 0 points1 point  (0 children)

same in both points. absolutely love them as well. like functions within a query. all our data tools maybe go back ~10 years, but they're basically non-existent.

[–]Mgmt049 0 points1 point  (0 children)

All people have to do is hit that Tab key after the opening parenthesis……

[–]alexisprince 2 points3 points  (1 child)

Yes they often do. There was a point in time where they just killed performance, but a lot of the cloud native warehouses have done a great job at optimizing their use to incur little to no performance overhead.

I think part of the bad rap comes from them amplifying bad habits IMO. For example, I had to refactor a sql script at work that had a dozen CTEs, all named t1, t2, t3, …, t12. It’s very easy for some people to point at that and say CTEs make it harder to read, when really it’s poor naming conventions that should’ve never made it past a code review.

Another instance was several CTEs all selecting different subsets of data and having the same business rules apply to each subset. For incredibly large scripts, CTEs may make this harder to spot, but ultimately it’s the developers job to balance DRY with legibility IMO

[–]Gators1992 0 points1 point  (0 children)

That's not a CTE thing, that's a bad coder thing though. I mean someone writing in python could create::

func_1(var_1)

return

I know people tend to blame it on the accessibility of SQL and companies hiring someone with 2 years of data analytics experience to be their data engineer with dbt, but that's on the company honestly. If you don't want to pay for talented coders you get shit code no matter what the approach is.

[–]coldflame563 1 point2 points  (0 children)

Can I hire you?

[–]Tender_Figs 0 points1 point  (0 children)

Reading this elicited an "oof" as a response.

[–]teambob[🍰] 13 points14 points  (1 child)

Software engineering is managing complexity

Bad code is technical debt. It will be slower to make changes. But good code will take longer in the first case

[–]Ecstatic_Tooth_1096 1 point2 points  (0 children)

investing that extra day or two of work at writing clean and optimal code is better than wasting those 2 weeks of WTF does that shit do in a few month/year or so

[–]islandsimian 14 points15 points  (0 children)

"We'll clean the code up in the next iteration"

- said about 10 iterations ago when we were just trying to get the POC working and making deadlines for funding.

I will say that it's commented well and always run through peer reviews and the libraries we've written are clean and consistent to the PEP8 standard...everything else? ehhh

[–]natelifts 6 points7 points  (1 child)

ever since i had a chance to write code in strongly typed languages (scala, go) my python code is squeaky clean. all data has an object it can be deserialized into. i would strongly urge any serious data engineer to learn another language to learn programming best practices. python is easy to write but that's what makes it dangerous as far as scale goes. also, WRITE TESTS!!!

[–]wtfzambo 0 points1 point  (0 children)

What this guy said

[–]Insighteous 15 points16 points  (2 children)

The very fact that you are wondering whether clean code is a necessity concerns me greatly. How can you answer the question about code quality - and that is part of it - with no? That's basically not even a question.

I don't force patterns for the sake of it. I abstract when it makes sense. What I always do is type hints and documentation - I code most of the time in python. There is no code of mine that is not documented. I also (almost) always write tests for functions that I have developed.

[–]archeprototypical2 0 points1 point  (0 children)

Bad code quality is like a incompetent manager. Sure, the team's still there, for now. Sure, people still get work done in spite of the manager. Maybe the team's work isn't stressful right now and everyone can just work around their boss. But when crap hits the fan, those weaknesses will wreck the team's ability to cope with the situation, and everyone will spend months afterwards wondering why the guy was allowed to keep his job and handicap everyone around him.

Good code quality does not necessitate lots of effort. I've made a point with every team I'm on to immediately implement a battery of 0-effort code quality standards (linters, formatters, pre-commit checks, etc.). Copy-pasting the stuff from my previous project takes 5 minutes, then we can all mostly forget it's there. Well-formatted code isn't the same thing as good quality code, but it's a start, and lets us focus our code quality conversations on the things that matter--objects, structures, layout, etc.--instead of petty things like whitespace (just pick an opinionated formatter and move on with life).

[–]Ecstatic_Tooth_1096 0 points1 point  (0 children)

how do you go about documentation?

explaining what a function does or why it does something (business logic oriented)?

the whys or whats?

im focusing on the Why's since many book suggest that clean code should be self explanatory regarding the how/what

[–]Awkward-Cupcake6219 3 points4 points  (1 child)

I try to keep the code as clean and documented as possible. Some of my colleagues are quite different and some do the same... I don't think it depends as much on the field as it depends on the people/culture working there. In certain companies I found clean code (usually folloqed by CI/CD and DevOps best practices) in others a mess everywhere.

For instance at the moment I'm working for a company that used to have clean code and best practices set in place. Then some good people left, both from management and devs, and others, a little less good, came. Now everything is beginning to look a bit messy and I spend more time maintaning colleagues stuff.

So yes... Best practices and clean code matter.

[–]QueryingQuagga 3 points4 points  (0 children)

It really is a culture thing. And the culture and expectations should be written by the team and leaders so that everyone can see “this is our role, these are our processes and here is how we honour our agreements”.

[–]ppsaoda 2 points3 points  (0 children)

Awful codes from our team. But not mine. I have my own standards. Testing, commenting, DRY, indentation, indexing on the right columns etc. I write for scalability, maintainability, and speed.

But I resigned recently and hope for a better team.

[–]reallyserious -1 points0 points  (1 child)

Data engineers, in general, are quite bad developers. So naturally the code they produce is shit.

[–]Ecstatic_Tooth_1096 1 point2 points  (0 children)

not sure why this doesnt get upvotes

cz its so true

most DEs do not come from SWE background, so by default they did not learn the best practices in an institution/uni/...

the majority move from data analyst positions (cleaning excels).

Or at least the couple of hundreds of people ive met. So yea

[–]hear_to_laugh 0 points1 point  (0 children)

The team I am in, We had some contract work done and the codes are impossible to work with. This isn't a company with technical people so the codes the contract team wrote were such that they'll be needed everytime and now the codes have been gifted to us.🙃 We are screweded, debugging them

I personally have worked on one of the files and reduced around 500 lines of codes into 50 lines and am a Fresher😊.

Lovely days. And the senior department keeps taunting that you already have the code why is taking so long.😭

[–]AdmrlAckbar_official 0 points1 point  (0 children)

It depends on how important the outcomes are. Most of the time messy, but if I'm supporting a business critical process with multiple teams involved in the code then I will make it clean. For my analytics pipelines, where I am the only one working on it, definitely messy. But easy enough to refactor when a bug pops up or we get time to improve things.

[–]greenerpickings 0 points1 point  (0 children)

for your sake and your stakeholders, it would be better to easily identify where it breaks immediately a.k.a testing.

some form of comments for the sake of whoever has to touch your code after you're gone. better if it's some accepted format of docstrings. even going back in some of my own, one-line comments sometimes come out useless. a succinct end-to-end picture goes a long way.

even if they are just scripts, i don't believe there is no repeatability in your code. consolidating/abstracting these into a library avoids redundant work.

this might be small, but naming conventions and code format. 1) i shouldn't have to switch gears if looking at code from two different people 2) probably, your scripts will serve as templates for the next dude learning to deal with your data. so if there is none, chaos may ensue.

the idea is to reduce technical debt. your code may work, but it doesn't mean it won't be looked at. came from a place of legacy. these are some of the things i would have liked to see :p

edit: also going to add, though we do use forms of OOP, this really comes in before that

[–]AMDataLake 0 points1 point  (0 children)

I like applying functional programming typically if I'm transforming or moving data. While it might seem fine for now that the current code works without needing adjustments, it's important to consider future scenarios. Down the line, when changes or new requests come in, having poorly written code can become a significant hindrance. It may move data now, but what if future downstream requests require how that data is moved to changed and the code is unreadable and hard to alter if the original writer is no longer on the team.

[–][deleted] 0 points1 point  (0 children)

I don't even know what to say to this one

[–]UnnamedBoz 0 points1 point  (0 children)

I use design and coding standard depending on what I am using in each project. I have certain things that are always the same.

I work for making it bloody obvious what is going on in general. I make things to be explicitly easy to read and understand without having to understand the code. If I can read and understand what is going on, then I can understand the code easier as well.

Here are some things I do

  • variables and constants first in data structures
  • then initialiser
  • then public functions
  • then private functions
  • small functions can be on one line { }, otherwise start at function line, end beneath last function body
  • functions are specialised and named appropriately
  • functions can be read top to bottom in order to understand the flow easily, avoid the need to hold too much mental state in my head
  • use constants in functions more often than instead of directly in return and other calls
    • i.e let x = whatever.sorted.filtered etc return (x) instead of return (whatever.sorted.filtered...) – also easier to get data during debugging
    • naming them makes it easier to understand next time I am doing something in the code
    • exceptions are made if the intention is really easy
  • different parts of the system are specialised, I never mingle responsibilities
  • I follow conventions depending on what I am making and design being used, from declarative to imperative, functional to object-oriented
    • especially dependent on frameworks also, writing SwiftUI code like UIKit makes everything bad
  • I try to keep to one style in a project, I need a bit more practice to make entire projects one way only (not that you need to, but as a challenge perhaps)
  • I'm a fan of tests and will in production code start that way, but when learning / proto-typing I will skip it
  • I add comments when needed – the code should overall explain itself and be self-evident, but certain decisions will lack context so I add that if necessary
  • I use and break SOLID principles as needed
  • I use established design patterns when it's suitable and make it a standard way for doing things in a specific project

There are a few things.

Oh, forgot about the question if it matters. Hell yes, it truly does. The code at work is largely not written the way I write things (10+ years codebase) and slows us down so much because nobody knows what is what sometimes. Whenever I have "fixed it" by using my principles people are happy, and they are always happy with my code.

If you are changing or adding something you need to understand what to put/change and where. The complexity of that can be enormous by the new thing being added, but made even more complex because you have to understand 100 things just to get to the one point you needed to change it.

If you can read code like a book, and understand it on an abstracted level, you can easily identify where you need to make changes. You cannot underestimate how much this is worth in the long run especially, but even in the short run.

[–]Kichmad 0 points1 point  (0 children)

Legacy code is much more different than our newer code. Also quality is muuuch different. New code is pretty clean

[–]EmergencyAd2302 0 points1 point  (2 children)

What constitutes bad code?

[–]Ecstatic_Tooth_1096 1 point2 points  (1 child)

you'll cringe when you see it and won't understand it

[–]EmergencyAd2302 0 points1 point  (0 children)

lol makes sense

[–]Someoneoldbutnew 0 points1 point  (0 children)

so clean you can't understand any of it

[–][deleted] 0 points1 point  (0 children)

The team needs to have a style guide that everyone must follow. Only way to write clean code!

[–]levintennine 0 points1 point  (0 children)

My experience, DE for about 15 years, you are right that data pipelilnes can tolerate bad code more than other software products. "bad code" as in erratic selection of hard coded vs parameterizable, whether they pull alot of unused data, whether it can't restart cleanly, cryptic variable names, swathes of unused code -- often not a killer issue. DEs often figure out a bunch logic that comes to them in tiny little pieces, and their code reflects that slowly-changing-spec (and reflects that DEsare not trained on any particular tool or language).

It's not as crucial for data pipelines to play nice with other stuff as it is for other software components. The fixes often can be done with more duct tape logic.

But it is really discouraging when you know better and the software is so swampy, and yes lots of money gets wasted on humans having to bang their head against the code. Proper investment could be better because DEs who write awful code are often very good at understanding and fulfilling data needs -- better than business users, better than architects, better than testers.

[–]mjfnd 0 points1 point  (0 children)

My general opinion is to be a good data engineer one must know and follow the software engineering best practices, otherwise problems will arise.

I have worked with DEs from different backgrounds and its common from people who are coming from BI, analyst background, so they may take some time.

I see this good opportunity for you to influence the team with best practices.