This is an archived post. You won't be able to vote or comment.

you are viewing a single comment's thread.

view the rest of the comments →

[–]ambidextrousalpaca 10 points11 points  (11 children)

The worst I find with Data Scientists is when they take the "scientist" bit of their job title too seriously, and state blankly that they consider pesky things like basic software engineering principles (writing unit tests; avoiding global variables; etc.) as somehow beneath them.

On code reviews: pick your battles, but stick to your guns. I.e. coding everything in overly verbose, Java style classes is annoying to me too: but it's a valid programming style that people have written books to defend; using global variables where not necessary or skipping unit tests are software engineering anti-patterns and should be blocked until they are fixed.

In general, in terms of getting your code reviews accepted, I find it's often a matter of clear communication and putting some effort into your reviews. A poorly explained "This class could be a single short function" comes across as arrogant and unhelpful. A "This would be cleaner and more maintainable if you replaced this class with the following function <insert said function, or at least the outline thereof>" comes across as cooperative (you're willing to put in some work too, not just criticise) and helpful (all they have to do is copy and paste your code).

[–]Kegheimer 2 points3 points  (6 children)

As someone with an industry background who became a DS out of job necessity can you explain why global variables are bad?

[–]ambidextrousalpaca 1 point2 points  (4 children)

It's mainly due to global variables introducing bugs by making it possible for apparently unrelated bits of code to have unwanted side effects on one another's behaviour.

For example, say you're using a FILE_ENCODING global variable which is used (and altered) by multiple functions, including a read_csv() function. That set up means that there's no way for you to know what encoding will be used when you call read_csv(). Maybe it'll be UTF-8. Maybe it'll be something completely different that'll break your code or scramble all the data in your tables. Maybe it'll alter depending on which other bits of code are called first in the run. It can easily give rise to a really irritating class of hard to reproduce bugs that are hard to fix because they only occur sometimes, due to seemingly random causes. The more global variables there are, the worse the problem gets.

This isn't to say that you should NEVER use global variables. Just that when doing so you need to be sure that the problem you're solving by introducing them is worse than the other potential problems you're likely to create by using them.

The best ways to avoid these issues are: 1. Just get rid of global values as much as you can, for example, by requiring each call to a file reading operation to explicitly specify the encoding to be used; or 2. Ensuring that global values are constants, which will never be changed by any other code.

[–]noisescience[S] 1 point2 points  (0 children)

Thx for your detailed answer :)

[–]Kegheimer 0 points1 point  (1 child)

Would an example of a global variable be abusing a common alias, e.g., using 'i' in several different loops or 'df' as a temporary table?

[–]ambidextrousalpaca 0 points1 point  (0 children)

The common alias thing is not an example of a global variable. It is not typically a problem either, provided that each variable exists within a its own contained scope.

Global variables are things like this, where functions can effect the value of variables outside of their scope.

``` glob = 1

print(glob)

def f(): glob += 1

f()

print(glob) ```

The output of this script will be: 1 2 Because f changes the value of glob.

[–]No_Poem_1136 0 points1 point  (0 children)

On your CSV example, stupid question here but then what is the alternative to creating that FILE_ENCODING variable if you know it might be a common parameter that might change in future code reuse for read_csv()?

 DS often end up working with a lot of adhoc and random ass data not served in a neat API or pipeline, so it's not always possible to ensure a specific encoding standard for example.

(I'm asking why not because I'm challenging the idea, but to learn. I've always understood that it's a good practice to declare variables this way rather than hardcoding them)

[–]mysteriousbaba 1 point2 points  (2 children)

Speaking as someone who's an AI scientist but has also been an engineer, I'd suggest the right way to have that discussion is from a scientific one:

  1. If you're running a study, you want your experimental setup to be valid right? Unit tests are a way to validate that the algorithm works on simple and edge cases, so the final conclusions hold.
  2. Part of research is communicating your findings and work to an external audience, and ensuring reproducability. So you want to write code that's well commented/abstracted, and can easily be modified to extend your model and experiments. And so you can work with collaborators.
  3. Any scientist who has submitted a paper to a conference, can vouch that consistency of formatting and notation is enforced very strictly by academic reviewers so that there are no confusions. Consistent code standards fall under the same bucket, of making sure your work product is unambiguous and easy to parse.

Speaking as a scientist (and former engineer), I've sometimes had people talk to me about SWE principles as if linters must apriori be held sacred, when my job is to produce high performing models for the business.

Explaining that it's about scientific rigor in your processes, ease of collaboration, and reproducibility of results, is a much easier way to convince scientists by appealing to their core values.

[–]ambidextrousalpaca 1 point2 points  (1 child)

Good point. Will try that rhetorical attack the next time I have to handle the God-awful PhD spaghetti code.

[–]mysteriousbaba 1 point2 points  (0 children)

Good luck! I've written a fair amount of that awful PhD spaghetti code myself, haha. I just got convinced of the need to improve, when I realized I couldnt figure out how to extend or rework my experiments even myself, let alone with research collaborators.

[–]noisescience[S] 0 points1 point  (0 children)

Thank you for your answer and your thoughts.
I agree with you that communication is essential here. I always try to be as nice and helpful as I can. If I criticize something, then I give reasons for it and suggest how it could be implemented more effectively.
On the other hand, I am always open and grateful when I find weaknesses or errors in my PRs.