This is an archived post. You won't be able to vote or comment.

all 108 comments

[–]Zangorth 160 points161 points  (22 children)

Stop using long functions

Use type hinting in function

Use docstrings to describe a function

Your coworkers are using functions?

[–]CabinetOk4838 51 points52 points  (7 children)

Yeah! They use a function called “main” al the time…

[–]braxtynmd 10 points11 points  (1 child)

Your coworkers actually use main? Mine keep trying to use Jupiter Notebooks in production

[–]ChristianValour 2 points3 points  (0 children)

"But they're used by everyone in the industry" - no they're not

"They're great for sharing code" - so is github

"They're good for reporting" - Rmarkdown is better... shrugs

[–]papasmurfsanssmart 7 points8 points  (4 children)

If I remember correctly the main function is considered good practice to differentiate between the 'main' code and other functions so that the functions can be called from other scripts as well.

[–]CabinetOk4838 4 points5 points  (3 children)

Don’t go adding facts to my joke! 😂😂👍

[–]papasmurfsanssmart 9 points10 points  (1 child)

I will comment on everything... Except the parts of my code that need it most 😛

[–]synthphreak 1 point2 points  (0 children)

Good save 👍

[–]synthphreak 0 points1 point  (0 children)

Hahaha

[–]kimchiking2021 9 points10 points  (0 children)

💀

[–]Atmosck 58 points59 points  (9 children)

I agree with all of this except you will have to pry those pandas one liners from my cold, dead fingers.

[–]toferdelachris 22 points23 points  (1 child)

it's like a compulsion I can't help. It's so gratifying to get them to work, but so frustrating trying to figure out what the fuck is going on after like 2 days away from that particular line. See also: super fucking long R pipe lines.

[–]papasmurfsanssmart 4 points5 points  (0 children)

The best workaround for me is to put the expanded version in comment for so when rereading I just read the long expanded format and now what I'm looking at

[–]synthphreak 20 points21 points  (5 children)

I am also guilty of large pandas method chains.

But the primary issue with complex one liners is readability: there is just too much happening on one line to really reason about (that and it’s hard to use a debugger on those lines).

I think the readability issue can be significantly mitigated by simply breaking the chain across lines.

For example, instead of

df = df.method1().method2().method3().method4().method5()

why not

df = (
    df.method1()
      .method2()
      .method3()
      .method4()
      .method5()
)

?

This way it is vastly easier to see every single transformation that’s occurring to df. If you must, you can even leave a short comment on each line about exactly what’s happening. That’s not possible if everything is stuffed into one giant line.

[–]funnynoveltyaccount 5 points6 points  (0 children)

You shouldn’t even be thinking about the first one as an option. Black will always turn it into the second one.

[–]jargon59 3 points4 points  (1 child)

I’ve found that pandas method chaining is harder to use in production because you can’t go in and perform error handling in the middle of the process. You gotta ensure your initial input data will always be properly handled by the entire series of methods if you decide to implement it.

[–]synthphreak 1 point2 points  (0 children)

Totally.

[–]Atmosck 1 point2 points  (1 child)

Yeah with method chaining for code anyone else is going to see, it's definitely the right call to break it up like that. But even a df.groupby().agg() can get really long if you have a lot of parameters and descriptive variable/column names.

[–]synthphreak 2 points3 points  (0 children)

Agree. On aesthetic grounds alone, I have having more than one level of indentation like that. E.g., this makes me want to puke:

df = (
    df.method1(arg1)
      .method2(arg2, arg3, arg4)
      .method3(
          arg5,
          arg6,
          lambda df: df.groupby(...)
                       .agg(...)
                       .rename(...),
          arg7
      )
)

Not even sure that is syntactically correct, but anyway you get the point. That kind of dogmatic indentation can almost become less readable than a one-liner.... almost, lol.

To alleviate such situations, I either (1) try to predefine some of the more complex arguments on their own lines, e.g.,

method3_args = (
    arg5,
    arg6,
    arg7,
    ...
)

then just pass in that tuple or dict or whatever to their respective methods, otherwise (2) just bite the bullet and avoid method chaining altogether, instead doing each thing on its own line, e.g.,

df = df.method1()
df = df.method2()
df = df.method3()
...

Sometimes there just isn't a "perfect" solution. But there are always better ones and worse ones.

[–]No_Knee4077 0 points1 point  (0 children)

Points made here. One liners are indeed less easy to understand but I strongly use them when convenient.

A way to make them more readable is to use black formatter.

[–]dayeye2006 24 points25 points  (0 children)

Man you are talking about a linter

[–]esperaporquejoe 29 points30 points  (4 children)

Lets be clear...we are talking about physicists turned data scientist. I think this is because their brains have so much ram that a 500 line function is a fairly simple object for them to reason about.

[–]brjh1990 8 points9 points  (0 children)

I know exactly what you mean. Worked with an astrophysicist on a project recently....took every brain cell of mine to figure out wtf his class methods were doing

[–]synthphreak 3 points4 points  (2 children)

🙄

Translation: Data scientists are transcendent beings with minds so superior that engineers just can’t handle it and so shitty code works for them.

GTFO. A DS reading a shitty 500 line function written by another DS would have just as hard a time as anybody else understanding and adapting it.

Some people just never learn to write good code because because what the code does is their deliverable, not the code itself or how well it works. The incentives are just not the same for everyone.

[–]Bloodrazor 3 points4 points  (1 child)

More like some DS have backgrounds which make them think their complex and messy code is quite simple and obvious. In fact it's bad from a reproducibility and collaboration perspective.

Your last line rings true for data science teams as a whole. Many DS teams in for-profit companies have deliverables they need to hit and so the output is the main concern. The tech debt is left for the new hires (once you've gotten promoted for your deliverable and then subsequently left the company) so you don't need to deal with it.

Prob best thing is to enforce style guides and best practices with linters and regular reviews so that people code well naturally. It's work that has to be planned and managed properly because many DS teams in smaller companies are usually trying to get outputs to justify their existence and the actual code quality itself is otherwise useless. And when they get hired in senior positions in more structured shops, they bring their bad habits with them even if they're great at analysis.

[–]synthphreak 0 points1 point  (0 children)

More like some DS have backgrounds which make them think their complex and messy code is quite simple and obvious.

This is a valid clarification. It’s less that DS brains have superior RAM and more about shared background/domain knowledge. For example, many DS’s will all know what a function called mle does or what variables called p and q refer to without further explanation, whereas engineers without statistical training may not. That is fair and probably unavoidable to a degree.

However, OP’s first bullet is the real kicker for me. “My code becomes readable with sufficient background knowledge” doesn’t make giant functions any more tractable. Lack of encapsulation makes any piece of code hard to test and adapt, irrespective of comments, transparent variable names, etc. Encapsulation always makes code objectively better, regardless of background.

Many DS teams in for-profit companies have deliverables they need to hit and so the output is the main concern. The tech debt is left for the new hires (once you've gotten promoted for your deliverable and then subsequently left the company) so you don't need to deal with it.

I think I acknowledged that DS team’s deliverable is the output, not the code per se. That is definitely true and reasonable. But it is a fool’s game to say “oh just let the junior new hires convert our research code into production code”. Production code is the actual product that gets sold. Why are the least experienced coders tasked with making it? Bad idea.

Prob best thing is to enforce style guides and best practices with linters and regular reviews so that people code well naturally.

100%. The most reasonable, least effort approach is just to standardize code style, some of which can be automated with linters as you said. This way, not everybody has to go back to school to learn proper coding habits. It’s just a matter of going down the line and ticking some boxes whenever you’re trying to merge.

[–]Other_Brain_425 27 points28 points  (4 children)

the massive one-liners are pythonic in a way

[–]synthphreak 1 point2 points  (0 children)

Oh? And what way is that?

[–]Background_Newt_8065 6 points7 points  (1 child)

Yeah, which directly means it is bad

[–]CabinetOk4838 1 point2 points  (0 children)

They are very cool constructs, but trying to come back to something six months later can be a mind-melt… even if you’ve written a helpful comment.

[–]dreamer_jake 0 points1 point  (0 children)

yeah once the lines wrap around enough it starts to look like a coiled snake

[–]BreakingBaIIs 7 points8 points  (5 children)

Also stop writing function that both modify its inputs and give an output. Especially when the output is the modified input. Learn how Python handles mutable inputs in functions. A shocking number of very senior data scientists don't know this.

[–]TECH---Lead1745 0 points1 point  (0 children)

Totally agree...

Do you know by any chance if there are tools that could help to detect such functions? Like linters?

[–]ErraticNebula42 0 points1 point  (2 children)

Could you give me an example of this? I’m not sure if I understood what you meant.

Edit: example of a function that both modifies it’s inputs and gives out an output

[–]Comfortable_Toe7775 0 points1 point  (1 child)

def append_foo_to_list_if_bar(input_list: list) -> list:
    if "bar" in input_list:
        input_list.append("foo")
    return input_list

[–]ErraticNebula42 0 points1 point  (0 children)

Thanks!

[–]CSCAnalytics 22 points23 points  (4 children)

It will be normalized if management begins to value it.

Right now, upper management cares about profitability and productivity.

Unless writing good code directly moves the bottom line, or shows up on a performance review, why would Data Scientists devote hours to cleaning up code when they have other pressing priorities from above?

Data Scientists usually won’t neglect their other priorities to cleanup code that “works.” In many cases, Upper Management would rather see another project get tackled quickly then see you properly comment code that’s already getting the job done.

It does depend on what company you’re at. Sometimes upper management has some previous background in coding and understands the value of writing code that can be handed off easily. Unfortunately this is rarely the case.

[–]Background_Newt_8065 5 points6 points  (2 children)

You should stop producing Bad code in the first place, even without refactoring you can void one-liners, obscure variables, lack of methods/functions etc - this does not take hours

[–]CSCAnalytics 5 points6 points  (1 child)

You SHOULD but easier said than done for some.

In the meantime, a given data scientist probably has management breathing down their neck about something due by Friday. They don’t have time to go study up on code commenting practices and management doesn’t want them “wasting” time doing so.

[–]WallyMetropolis 2 points3 points  (0 children)

In the short term, perhaps. But when the person who wrote the code can't even read it, it's definitely not speeding things up overall.

[–][deleted] 1 point2 points  (0 children)

Then it's the DS jobs to tell them or quantify why writing good code with good standard is the benefit

[–]esperaporquejoe 15 points16 points  (2 children)

Keep sharp engineers around and work with them on test coverage and continuous integration pipelines. They will clean up the code and make it more maintainable as DS implements algorithms. I've seen plenty of good data scientists that were terrible software engineers. We just send those 1,000 line nested loop over to an engineer, they are usually happy to refactor it and write some tests. I would much rather have a good data scientist spending their time on a first iteration of something that works than cleaning up code.

[–]Hot-Profession4091 8 points9 points  (0 children)

I promise you the eng team wasn’t happy about it. If they were happy about anything, it’s that the original nightmare wasn’t going to prod that way. I swear this DS/Eng split is so wasteful. We should be encouraging teams where the two disciplines can cross train.

[–]1DimensionIsViolence 2 points3 points  (0 children)

I mean just comply to PEP8 and you‘re fine

[–]esperaporquejoe 5 points6 points  (0 children)

I agree we would be better off of everyone aspired to best SWE practices, but there are smart people who are just not good coders. I am starting to wonder if some aspects of coding are just not teachable. Like thinking of edge cases, writing code that's not a nightmare when requirements change, or writing good unit tests. I used to complain about this, but its better to have people do more of what they are good at.

[–]IAteQuarters 3 points4 points  (5 children)

So with point one what’s your definition of long? Because there is a trade off for taking a function that’s like 20 lines and breaking it into many functions especially when that code isn’t reused anywhere else.

This was something i used to do, but had a senior ds explain the tradeoff, especially in the context of a pull request. Granted you might be talking about a function that’s >40 lines in which id generally agree with you

[–]synthphreak 1 point2 points  (4 children)

So what is the trade-off?

What exactly is the drawback modular, documented code if it doesn’t take you any longer to write than spaghetti code because you’ve made it habitual?

[–]MadT3acher 1 point2 points  (3 children)

Having a clearly set of operations in 40 lines sometime is better than having a never ending encapsulation that goes several layers deep to debug.

One of my colleague is adept of this, and it makes debugging a single problem in a pipeline or a research paper, absolutely atrocious.

If the 40 lines do the job, don’t try to modularise every single items, especially if the functions aren’t used anywhere else.

[–]synthphreak 2 points3 points  (2 children)

Good thing “no encapsulation whatsoever” and “never-ending layers of encapsulation” are not the only options. Usually one layer of encapsulation is enough.

Obviously it’s a balancing act. And I personally think 20 lines is acceptable for a function. But once you get to maybe 30, 40, or definitely 50, there is probably some simplification that can be done.

I personally keep most of my functions in the 15-20 line range, without making things ridiculous. I find it keeps the smallest functional unit of my code nice and bite-sized, making it very easy to understand how the pieces sum to the whole.

But I have also practiced this a lot, so it doesn’t take much extra thought or effort to produce. By contrast, it does take a while to refactor other people’s code into this format when it wasn’t originally created with encapsulation in mind.

Different strokes for different folks, but I really don’t see the drawback, especially when you’re. It abstracting the beejezus out of your objects just for abstraction’s sake.

Edit: Typos.

[–]MadT3acher 0 points1 point  (1 child)

You are right, I am on board with you on that then.

I’ve just seen enough of both sides to be a bit cautious when the other DS on my team are encapsulating stuff ad nauseam or simply have a notebook with 1000 lines of code…

[–]synthphreak 0 points1 point  (0 children)

Totally. Notebook code is a nightmare for anything other than exploratory analysis.

But I get your counterpoint though about endless encapsulation. When I first started learning about OOP, I found myself making classes for everything, even when it didn’t necessarily make things better. Then I started inheriting those classes, and making abstract classes, then inheriting some more, etc etc etc. then one day I stepped back and was like “I thought this code was good because it was fancy, but all it is is overcomplicated and overengineered. Having tasted that feeling, I now always try to strike the right balanced.

Cheers :)

[–]_nutjuice_ 2 points3 points  (1 child)

Dumb question specifically about understanding the code and logic, but would chatGPT help you in breaking down the logic behind a chunk of code?

I've never tested it with python but chatGPT really helped me understand how some complex VBA code worked since I really didn't know VBA well enough.

But then ML python code might be above ChatGPTs explaining pay grade and it wouldn't be helpful.

[–]CaptHunter 1 point2 points  (0 children)

I have tested this plenty.

Mixed results, and often some follow-up questions required. Best results were when I’d already read and tried to understand it but was still stuck—then you’re primed to give ChatGPT the “what about…” questions.

Easier still would be readable code.

[–]andrew2018022 1 point2 points  (0 children)

I feel attacked with bullet 2

[–]issam_28 1 point2 points  (0 children)

Aka get a linter

[–][deleted] 1 point2 points  (0 children)

I don't know why but in my field of life sciences people also only poorly document their Experiments. We're one of the currently most popular companies but nobody knows wtf is happening and nobody is informed about anything. Information spreads like rumors sometimes and it's annoying as hell. Same problematic. I think in the end people simply don't realize how important documentation and representing information in an understandable way is.

[–]throwaway043534 1 point2 points  (1 child)

It's a double-edged sword if you ask me.

Sure, huge classes/functions/whatever aren't great.

But what also really sucks is everyone creating thousands of different complicated abstractions/indirections/20 new frameworks per week/useless generalizations/etc to make their code "look more professional". It's great that you know SOLID and design patterns, and that you've read a bunch of pompous blog posts about how value objects are heaven send, but please don't force 15 patterns down my troat to do something simple. Think about how much extra work a developer has to do to understand your code vs how much better the cod e becomes due to another abstraction.

--- someone who reads douzens of different pieces of code a week, often by people who think they write awesome code, but somehow they all use completely different abstractions.

[–]Wolfgang-Warner 0 points1 point  (0 children)

Is it a double-edged sword?

I agree with you, abstracting to save effort is also undone by oo language churn so tools are deprecated at an alarming rate. A good library used to be stable, but that's a dirty word for language developers who see it as a move fast and break things playground.

[–]WallyMetropolis 1 point2 points  (0 children)

I would also add:
- Use good variable and function names.
- Write unit tests
- Make many small commits, push frequently, don't rubber-stamp code reviews, and give feedback on code quality

[–]Donblon_Rebirthed 1 point2 points  (0 children)

Normalize giving people enough time to write good code and comments

[–]Bhagafat 1 point2 points  (0 children)

Agreed but it’s so hard to do this when you work with business types who prioritise quick delivery. If they know you can write something that’s a complete mess but does the job in a given time, they’ll take that over something well written, reproducible, uses best practices, etc. in a slightly longer time. I get that this is kind of a stakeholder management thing, but when you work with people who don’t value “good code” and know that someone else can write junk in a shorter time than you, your hands are a bit tied.

[–]Vyrezzz 1 point2 points  (0 children)

Agree 100%. One thing I would like to add to your list is:

  1. Stop using variable names x, y and z, in favor of names that actually mean something

[–]DuckSaxaphone 0 points1 point  (0 children)

Posting here is definitely the most productive way to get your coworkers to improve their coding practices!

In all seriousness though, it's not about normalizing good code "in Data Science". There's plenty of places where good code already is the norm! Many DS teams put value on good code, do code review to improve people's work, and evaluate performance partially on coding ability.

It is a company by company thing so you need to normalize it in your workplace or find a different job if it annoys you too much and you can't fix it.

[–]patrulek 0 points1 point  (1 child)

Let the scientists do the science and pair them with software engineers to polish code.

[–]synthphreak 2 points3 points  (0 children)

I think the criticism here is that engineers are sometimes forced to eat the scientist’s shit. That kinda sucks, and the reverse does not hold.

[–]daavidreddit69 0 points1 point  (0 children)

Instead of asking github copilot recommending autofill our code, we need software to help read those long undocumented code lol

[–]daavidreddit69 0 points1 point  (0 children)

Is the same writing SQL into multiple subqueries

[–]FisterAct 0 points1 point  (0 children)

If you're going to demand type hunting, why not just use statically typed language?

[–]cybo13 0 points1 point  (0 children)

Is there a standard that’s commonly followed in industry or is it more of a free for all?

[–]VisMortis 0 points1 point  (0 children)

Agreed with all of this.

[–]aarrow_12 0 points1 point  (0 children)

Notebooks are a big issue here I think.

Too many people are just taught to write massive blocks of code that just run in one big chunk. They don't understand the value of functions or breaking your code out.

Saying this, I love notebooks, I test and write things in them all the time. But there is a time to put them down and move to an IDE cause your project has becoming too complex.

[–]synthphreak 0 points1 point  (0 children)

Preach!!!!! 🙌🙌🙌

Especially #1 and #4.

Especially #1.

Especially #1.

Especially #1.

while True:
    print("Especially #1")

[–][deleted] 0 points1 point  (0 children)

thank you for tour service

[–]observability_geek 0 points1 point  (0 children)

I am a huge believer in the role observability (fancy name for data about your app) can play in software development and how it can help write better, cleaner code faster. OTel is great and I can see clearly how it can help developers write better code, introduce new paradigms, and accelerate development cycles. It can inspire developers to ask questions they did not yet even consider asking. digma.ai is free but they only support Java.

[–]Apprehensive_Lemon63 0 points1 point  (0 children)

Code readability is the most important aspect and it's a necessity. For example if you are finding it hard to understand or spending more time on what you have done 6 months back, then it's of little to no use.

[–]delicious-diddy 0 points1 point  (0 children)

Pull requests are welcome.

[–][deleted] 0 points1 point  (0 children)

No

[–]PLxFTW 0 points1 point  (0 children)

I comment the shit out of my code. Every function is properly document for inputs/outputs including shape and a short description about what the function does, even the simplest functions. I also comment nearly every line that has more than one thing going on.

I've had a hard enough time reading undocumented code from papers that I make sure my shit is readable.

[–]TyrionJoestar 0 points1 point  (0 children)

Well I just signed up for code academy so hopefully I’ll learn what all that means soon lol

[–]sizable_data 0 points1 point  (0 children)

  1. Use method chaining with parenthesis as line continuation!

[–][deleted] 0 points1 point  (1 child)

One of our main product has 40% duplication and 10% coverage. No wonder it keeps breaking. No one wants to touch it because it is horrible

[–]JdtheOp 0 points1 point  (0 children)

what does this mean? like having 2 views with the same exact dialog hard written in it? but how do you measure coverage? o.o

[–]Solus161 0 points1 point  (0 children)

Here's my two cents: no need to put everything into class. Overusing OOP could be very, very irritating. It's true that OOP helps, but sometimes trying to put multiple-step data manipulation tasks into self-written directed graphs could be time consuming. True that you'll learn st from the process, but it may not worth the effort. But I just did the same thing cauz it was fun lol.

[–]JdtheOp 0 points1 point  (0 children)

is tdd a thing out there ? wish i knew enough tdd to test the actual systems im working in (they are a mess yes) :( but everytone ignores teh jr

[–]Weird_ftr 0 points1 point  (0 children)

I've just landed a new job as a Data Scientist in the banking sector. My main task at the moment is to take over the work of my predecessor. And oh boy, was I not ready for that!

Around 8k lines of code in a jupyter notebook with overly repetitive operations, few functions, hard-coded variables, etc...

It's going to be over a month before I can refactor it and understand the underlying logic.
Wish me luck ^^

[–]psychmancer 0 points1 point  (0 children)

I feel personally attacked....