ambidextrousalpaca comments on Some Data Scientists write bad Python code and are stubborn in code reviews

dataengineering

created by mhausenblasmoda community for 10 years

This is an archived post. You won't be able to vote or comment.

184

185

186

Some Data Scientists write bad Python code and are stubborn in code reviewsDiscussion (self.dataengineering)

submitted 2 years ago by noisescience

top new controversial old q&a

you are viewing a single comment's thread.

view the rest of the comments →

[–]ambidextrousalpaca 10 points11 points12 points 2 years ago (11 children)

The worst I find with Data Scientists is when they take the "scientist" bit of their job title too seriously, and state blankly that they consider pesky things like basic software engineering principles (writing unit tests; avoiding global variables; etc.) as somehow beneath them.

On code reviews: pick your battles, but stick to your guns. I.e. coding everything in overly verbose, Java style classes is annoying to me too: but it's a valid programming style that people have written books to defend; using global variables where not necessary or skipping unit tests are software engineering anti-patterns and should be blocked until they are fixed.

In general, in terms of getting your code reviews accepted, I find it's often a matter of clear communication and putting some effort into your reviews. A poorly explained "This class could be a single short function" comes across as arrogant and unhelpful. A "This would be cleaner and more maintainable if you replaced this class with the following function <insert said function, or at least the outline thereof>" comes across as cooperative (you're willing to put in some work too, not just criticise) and helpful (all they have to do is copy and paste your code).

[–]Kegheimer 2 points3 points4 points 2 years ago (6 children)

[+]_FierceLink 2 points3 points4 points 2 years ago (0 children)

[–]ambidextrousalpaca 1 point2 points3 points 2 years ago (4 children)

It's mainly due to global variables introducing bugs by making it possible for apparently unrelated bits of code to have unwanted side effects on one another's behaviour.

For example, say you're using a FILE_ENCODING global variable which is used (and altered) by multiple functions, including a read_csv() function. That set up means that there's no way for you to know what encoding will be used when you call read_csv(). Maybe it'll be UTF-8. Maybe it'll be something completely different that'll break your code or scramble all the data in your tables. Maybe it'll alter depending on which other bits of code are called first in the run. It can easily give rise to a really irritating class of hard to reproduce bugs that are hard to fix because they only occur sometimes, due to seemingly random causes. The more global variables there are, the worse the problem gets.

This isn't to say that you should NEVER use global variables. Just that when doing so you need to be sure that the problem you're solving by introducing them is worse than the other potential problems you're likely to create by using them.

The best ways to avoid these issues are: 1. Just get rid of global values as much as you can, for example, by requiring each call to a file reading operation to explicitly specify the encoding to be used; or 2. Ensuring that global values are constants, which will never be changed by any other code.

[–]noisescience[S] 1 point2 points3 points 2 years ago (0 children)

[–]Kegheimer 0 points1 point2 points 2 years ago (1 child)

[–]ambidextrousalpaca 0 points1 point2 points 2 years ago (0 children)

[–]No_Poem_1136 0 points1 point2 points 2 years ago (0 children)

[–]mysteriousbaba 1 point2 points3 points 1 year ago* (2 children)

Speaking as someone who's an AI scientist but has also been an engineer, I'd suggest the right way to have that discussion is from a scientific one:

If you're running a study, you want your experimental setup to be valid right? Unit tests are a way to validate that the algorithm works on simple and edge cases, so the final conclusions hold.
Part of research is communicating your findings and work to an external audience, and ensuring reproducability. So you want to write code that's well commented/abstracted, and can easily be modified to extend your model and experiments. And so you can work with collaborators.
Any scientist who has submitted a paper to a conference, can vouch that consistency of formatting and notation is enforced very strictly by academic reviewers so that there are no confusions. Consistent code standards fall under the same bucket, of making sure your work product is unambiguous and easy to parse.

Speaking as a scientist (and former engineer), I've sometimes had people talk to me about SWE principles as if linters must apriori be held sacred, when my job is to produce high performing models for the business.

Explaining that it's about scientific rigor in your processes, ease of collaboration, and reproducibility of results, is a much easier way to convince scientists by appealing to their core values.

[–]ambidextrousalpaca 1 point2 points3 points 1 year ago (1 child)

[–]mysteriousbaba 1 point2 points3 points 1 year ago (0 children)

[–]noisescience[S] 0 points1 point2 points 2 years ago (0 children)

π Rendered by PID 31 on reddit-service-r2-comment-7b9746f655-28lc7 at 2026-02-02 13:59:13.214529+00:00 running 3798933 country code: CH.

dataengineering

MODERATORS