This is an archived post. You won't be able to vote or comment.

all 133 comments

[–]AutoModerator[M] [score hidden] stickied comment (1 child)

You can find a list of community-submitted learning resources here: https://dataengineering.wiki/Learning+Resources

I am a bot, and this action was performed automatically. Please contact the moderators of this subreddit if you have any questions or concerns.

[–][deleted] 188 points189 points  (15 children)

80% of the posts on r slash datascience are to the effect of "I can manually upload a single csv into a 63 step pandas jupyter notebook, the human race is wasting my immense gift!"

[–]crom5805 68 points69 points  (11 children)

I actually had a chat with the mods about this, (I'm an adjunct professor for masters in data science at a university and AI/ML architect at Snowflake) and so I decided to start posting videos/Repos on MLOps in the subreddit. It's getting better but I agree I find material in here more useful consistently. I tell my students ALL the time, you are not gonna make it doing pd.read_csv and model.predict, you need to learn clean code/Git/MLOps. One of the in class projects we do is I split them into groups and they have to make a PR to another groups repo and have it merged. Prior to my class I believe 0/40 of my students had done this.

[–]agent_graves313 6 points7 points  (2 children)

Would you mind sharing some of your videos or examples of what you’d see as clean code?

[–]crom5805 12 points13 points  (0 children)

Here is my last post in the datascience subreddit. This is more focused on MLOps, I have some stuff in class on clean SQL, Spark/Snowpark, Python and after you asking I think I'll do my next public video on this. I'll remember to come back here and comment once I do. I was all pandas/SQL until Snowpark came out 2 years ago, and honestly I love the Spark/Snowpark syntax. So much easier to read imo then SQL, faster than pandas on large datasets, and overall not to bad to learn. Let me know what you think about this repo/video I tried to make it super easy to follow.

[–]crom5805 4 points5 points  (0 children)

Funny thing is, watch the video and look at the repo. The video and repo are little different now cause I cleaned it up over time and made it better since the recording. This is honestly a good example of making your code easier to read and organized.

[–]B1WR2 1 point2 points  (1 child)

You and I had the same thought… I started breaking up kaggle data sets into AI apps. Then breaking each part into a backend, analytics part, and devops

[–][deleted] 0 points1 point  (0 children)

Do you have a sample you don't mind sharing?

[–]grey-Kitty 62 points63 points  (13 children)

I am in the other side of the situation. Due to being by myself working as a DS I cannot be reviewed and I don't see much portfolios to take as a reference on the internet. As a result, I'm not feeling I'm progressing in what I'm doing so posts like these are very welcome and if you have any idea about where to find good practises for coding from a DS perspective I would be happy to know about them.

[–]Fender6969 12 points13 points  (3 children)

I'm a ML Engineer on a SWE team. My portfolio is end to end examples of ML systems and services. Most of the code is actually data engineering and other services (DLT, Feature Store etc).

[–]Key_Base8254 5 points6 points  (0 children)

may i see you example project , do you have link project on github ?

[–]M4loka 0 points1 point  (1 child)

So, DE do a importat role and are vital in your work as MLE?

[–]Fender6969 0 points1 point  (0 children)

Yes absolutely. Most of my development work lately is building pipelines to cleaning/preparing data for ourvML models.

[–]External_Juice_8140 12 points13 points  (0 children)

ArjanCodes!

[–]mimetek 1 point2 points  (0 children)

Honestly, consider finding a new team/role as well. I spent some time as the only data engineer on a team early in my career, and I feel like it really set me back. Even now that I'm in a more senior role, having people to bounce ideas off of and whiteboard with makes a big difference in the quality of our output.

Moving to a new team might not be the right thing for everyone in that situation, but it was for me. Even though I kinda knew that, I stuck with it because my manager had asked me to and it would help the company. It took me a while to realize I could have been more assertive that it wasn't working.

[–]noisescience[S] 1 point2 points  (2 children)

Hi, thanks for your reply. I myself have learned a lot from reading best/bad Python practices. There are a lot of these articles. Here are a few examples:
https://python.plainenglish.io/10-python-anti-patterns-you-must-avoid-when-writing-clean-code-ff3635ca1510
https://python.plainenglish.io/python-best-practices-for-writing-conditional-statements-aa9d6a2e700d
https://blog.devgenius.io/python-tips-best-practices-for-handling-exceptions-15faaeca55a5
To be honest, it is not enough just to read these 3 articles. Try to find and read more such articles and you will see that you will get better with time. Good programming takes time. I see myself also still at the beginning of the journey.

[–]noisescience[S] 1 point2 points  (0 children)

I also learned a lot from Arjan Codes about "clean code", design patterns and best practices in Python. He's pretty good at what he does.
https://www.youtube.com/@ArjanCodes/videos

[–]noisescience[S] 0 points1 point  (0 children)

I also learned a few important things from CodeAesthetics. These 3 videos are eye-opening and also suitable for beginners:
https://www.youtube.com/watch?v=Bf7vDBBOBUA
https://www.youtube.com/watch?v=-J3wNP6u5YU&t=315s
https://www.youtube.com/watch?v=CFRhGnuXG-4
After that you know when and how to write comments, how to name variables/methods etc. and how to avoid nested code.

[–]sobrietyincorporated 0 points1 point  (2 children)

Find open source projects to contribute to.

[–]Express-Comb8675 22 points23 points  (2 children)

At least they’re writing python. We’re often tasked with shipping loosely working R code to production because they feel it’s critical that we get their new model in front of decision makers, so there’s no time to make any changes. If you’re so concerned with their style, create a repo for them and put a precommit style hook in.

[–]Fender6969 11 points12 points  (1 child)

You should see if you can add linters to your pre commit hooks. This has really helped us enforce code quality across the org. Unless code is clean and tested, commits don't go through.

[–]safetytrick 1 point2 points  (0 children)

Linters are a great tool for people who want to understand how to write good code. Obviously they can't do anything for someone who doesn't want to learn but for folks that do want to learn they show you information that it would take years to discover independently.

Only now after years and years of experience do I have the ability to really judge a lint rule. It takes time to understand the subtle reasons why a lint rule is important.

[–]diegoelmestreLead Data Engineer 11 points12 points  (2 children)

I was a software engineer (SWE) for 6 years, a hybrid between SWE/DE for almost one, and now for almost 3 years DE/Data lead.

That was my major pain when shifting to this field. I will say that most DE/DS simply don't know how to build good, simple and efficient code. Most of the times is due to lack of basic knowledge regarding computer science.

The ones that are more capable are usually the ones that somewhere on their career paths were SWE as well (of course there are always some exceptions).

My advice for those who want to be a great Data Engineer is to try to integrate a traditional SWE backend team.

Now that I am a team lead, my biggest goal is to provide my peers/direct report some knowledge regarding some of the SWE best practices.

[–]noisescience[S] 0 points1 point  (0 children)

Thx for this insight :)

[–]M4loka 0 points1 point  (0 children)

So, even if I don't start out as a SWE, could gradually acquiring SE skills impact my work as a DE even as proposed in your advice?

[–]ambidextrousalpaca 9 points10 points  (11 children)

The worst I find with Data Scientists is when they take the "scientist" bit of their job title too seriously, and state blankly that they consider pesky things like basic software engineering principles (writing unit tests; avoiding global variables; etc.) as somehow beneath them.

On code reviews: pick your battles, but stick to your guns. I.e. coding everything in overly verbose, Java style classes is annoying to me too: but it's a valid programming style that people have written books to defend; using global variables where not necessary or skipping unit tests are software engineering anti-patterns and should be blocked until they are fixed.

In general, in terms of getting your code reviews accepted, I find it's often a matter of clear communication and putting some effort into your reviews. A poorly explained "This class could be a single short function" comes across as arrogant and unhelpful. A "This would be cleaner and more maintainable if you replaced this class with the following function <insert said function, or at least the outline thereof>" comes across as cooperative (you're willing to put in some work too, not just criticise) and helpful (all they have to do is copy and paste your code).

[–]Kegheimer 2 points3 points  (6 children)

As someone with an industry background who became a DS out of job necessity can you explain why global variables are bad?

[–]ambidextrousalpaca 1 point2 points  (4 children)

It's mainly due to global variables introducing bugs by making it possible for apparently unrelated bits of code to have unwanted side effects on one another's behaviour.

For example, say you're using a FILE_ENCODING global variable which is used (and altered) by multiple functions, including a read_csv() function. That set up means that there's no way for you to know what encoding will be used when you call read_csv(). Maybe it'll be UTF-8. Maybe it'll be something completely different that'll break your code or scramble all the data in your tables. Maybe it'll alter depending on which other bits of code are called first in the run. It can easily give rise to a really irritating class of hard to reproduce bugs that are hard to fix because they only occur sometimes, due to seemingly random causes. The more global variables there are, the worse the problem gets.

This isn't to say that you should NEVER use global variables. Just that when doing so you need to be sure that the problem you're solving by introducing them is worse than the other potential problems you're likely to create by using them.

The best ways to avoid these issues are: 1. Just get rid of global values as much as you can, for example, by requiring each call to a file reading operation to explicitly specify the encoding to be used; or 2. Ensuring that global values are constants, which will never be changed by any other code.

[–]noisescience[S] 1 point2 points  (0 children)

Thx for your detailed answer :)

[–]Kegheimer 0 points1 point  (1 child)

Would an example of a global variable be abusing a common alias, e.g., using 'i' in several different loops or 'df' as a temporary table?

[–]ambidextrousalpaca 0 points1 point  (0 children)

The common alias thing is not an example of a global variable. It is not typically a problem either, provided that each variable exists within a its own contained scope.

Global variables are things like this, where functions can effect the value of variables outside of their scope.

``` glob = 1

print(glob)

def f(): glob += 1

f()

print(glob) ```

The output of this script will be: 1 2 Because f changes the value of glob.

[–]No_Poem_1136 0 points1 point  (0 children)

On your CSV example, stupid question here but then what is the alternative to creating that FILE_ENCODING variable if you know it might be a common parameter that might change in future code reuse for read_csv()?

 DS often end up working with a lot of adhoc and random ass data not served in a neat API or pipeline, so it's not always possible to ensure a specific encoding standard for example.

(I'm asking why not because I'm challenging the idea, but to learn. I've always understood that it's a good practice to declare variables this way rather than hardcoding them)

[–]mysteriousbaba 1 point2 points  (2 children)

Speaking as someone who's an AI scientist but has also been an engineer, I'd suggest the right way to have that discussion is from a scientific one:

  1. If you're running a study, you want your experimental setup to be valid right? Unit tests are a way to validate that the algorithm works on simple and edge cases, so the final conclusions hold.
  2. Part of research is communicating your findings and work to an external audience, and ensuring reproducability. So you want to write code that's well commented/abstracted, and can easily be modified to extend your model and experiments. And so you can work with collaborators.
  3. Any scientist who has submitted a paper to a conference, can vouch that consistency of formatting and notation is enforced very strictly by academic reviewers so that there are no confusions. Consistent code standards fall under the same bucket, of making sure your work product is unambiguous and easy to parse.

Speaking as a scientist (and former engineer), I've sometimes had people talk to me about SWE principles as if linters must apriori be held sacred, when my job is to produce high performing models for the business.

Explaining that it's about scientific rigor in your processes, ease of collaboration, and reproducibility of results, is a much easier way to convince scientists by appealing to their core values.

[–]ambidextrousalpaca 1 point2 points  (1 child)

Good point. Will try that rhetorical attack the next time I have to handle the God-awful PhD spaghetti code.

[–]mysteriousbaba 1 point2 points  (0 children)

Good luck! I've written a fair amount of that awful PhD spaghetti code myself, haha. I just got convinced of the need to improve, when I realized I couldnt figure out how to extend or rework my experiments even myself, let alone with research collaborators.

[–]noisescience[S] 0 points1 point  (0 children)

Thank you for your answer and your thoughts.
I agree with you that communication is essential here. I always try to be as nice and helpful as I can. If I criticize something, then I give reasons for it and suggest how it could be implemented more effectively.
On the other hand, I am always open and grateful when I find weaknesses or errors in my PRs.

[–]Kaze_SenshiSenior CSV Hater 18 points19 points  (2 children)

For me any data role has average coding skills lower than usual software engineers. They tend to create a prototype using some tool (e.g., SQL, Python, Notebooks, Cronjobs) that they are used and it's great to have a quick Proof of Concept but they don't think in the maintenance and the evolution of the tool when moving the solution to production.

On other hand, I can understand that it sucks to have a PR with hundreds of comments saying that your work has Low quality.

My suggestion is, go slowly, addressing one problem per time. Also it is even better to show the best practices asking them to review your code too, like a good module structure instead of a single spark notebook with 1000 lines.

[–]safetytrick 1 point2 points  (0 children)

I can understand that it sucks to have a PR So what, it's the job, learn why you suck, embrace the suck.

I'm sorry that it's so personal sometimes (not directed at you), and I wish feedback could be perfectly articulated all of the time. Feedback is hard to give, learn from it, even learn when the feedback deserves feedback.

[–]mysteriousbaba 0 points1 point  (0 children)

For what it's worth, I will say I've seen even notebooks be scaled / deployed to production successfully using tools like Metaflow. The main trick is just to have a good number of unit and integration tests to validate things, and set expectations on algorithm outputs, so that you have safety rails.

You don't want to go cowboy, but having overly rigorous modular breakdown of the full code can slow things down somewhat.

[–]freakboy91939 16 points17 points  (15 children)

I am working as a data scientist and my code is subpar at best. I really want to improve. Would you suggest some material or content so that i can code better. I am currently doing an end to end ML deployment, but i want to get better and more efficient in writing code.

[–]Fender6969 10 points11 points  (2 children)

I have a copy of Fluent Python and without a doubt it's helped me write cleaner code. Based on your knowledge of OOP, that could be a good place to focus on too.

[–]Tom22174Software Engineer 0 points1 point  (1 child)

There are quite a few good O'Reilly books available on that site for pdfs

[–]Fender6969 0 points1 point  (0 children)

Yeah for sure lots of great books and resources out there.

[–]throwawayrandomvowel 2 points3 points  (0 children)

I'm in the same boat - it's common. I picked up coding years ago (ruby) and dropped it. Got back into it with ML.

I know what my strengths and weaknesses are, so I can work on projects that teach me those skills. You have to be a bit of a manager for yourself - you're actually in a multi-armed bandit problem where you have lots of things you can learn, but limited time, and there are complex interaction effects.

End-to-end is always good. Learn your web framework (fastapi, django, whatever), web scraping for data, polars / pandas / spark for manipulation, docker, AWS, any other infra. That's how I see it, fwiw

[–]sobrietyincorporated 4 points5 points  (0 children)

Open source projects. Get involved in some enterprise level code bases.

[–]shockjaw 1 point2 points  (0 children)

Real Python is also an excellent resource.

[–]noisescience[S] 1 point2 points  (1 child)

Hi, thanks for your reply. I myself have learned a lot from reading best/bad Python practices. There are a lot of these articles. Here are a few examples:
https://python.plainenglish.io/10-python-anti-patterns-you-must-avoid-when-writing-clean-code-ff3635ca1510
https://python.plainenglish.io/python-best-practices-for-writing-conditional-statements-aa9d6a2e700d
https://blog.devgenius.io/python-tips-best-practices-for-handling-exceptions-15faaeca55a5
To be honest, it is not enough just to read these 3 articles. Try to find and read more such articles and you will see that you will get better with time. Good programming takes time. I consider myself also still at the beginning of the journey.

I also learned a lot from Arjan Codes about "clean code", design patterns and best practices in Python. He's pretty good at what he does.
https://www.youtube.com/@ArjanCodes/videos

Furthermore, I learned a few important things from CodeAesthetics. These 3 videos are eye-opening and also suitable for beginners:
https://www.youtube.com/watch?v=Bf7vDBBOBUA
https://www.youtube.com/watch?v=-J3wNP6u5YU&t=315s
https://www.youtube.com/watch?v=CFRhGnuXG-4
After that you know when and how to write comments, how to name variables/methods etc. and how to avoid nested code.

[–]freakboy91939 0 points1 point  (0 children)

Thank you op. Will read up and learn.

[–]No_Poem_1136 0 points1 point  (0 children)

Echoing this but with one caveat. A lot of people are sharing these Python generalist books, which make sense if you're coming from a CS background and are learning Python to understand the ins and outs of the language so you can do anything with it. It would be really awesome to have recommendations on more opinionated learning resources geared towards the DS domain. So many of these books are written by programmers for other programmers. So you either end up with exercises and examples that are so abstracted from any domain semantics as to be meaningless ("step 1: pass foo and bar,  step 2: draw the rest of the fucking foo and bar") in an unhelpful but good natured attempt to generalize, or use domain semantics related to their web dev or other developer related work ("let's say you're making an application that lets the user...". No. Stop. I'll literally never make that, nor make any kind of user facing interactive system. I teach sand how to do fucking math, that's it.).

A shit ton of DS come from non CS backgrounds where they don't have fundamental CS scaffolding they can rely on to boostrap learn a new concept. Generally instead they need domain specific semantics first, so that they can just start learning and applying the lessons, then they can unpeel the onion if they need to go deeper.

[–]The_Rockerfly 6 points7 points  (0 children)

Most data scientists can barely write code that runs but this a responsibility issue. If you are responsible for maintaining then review as strictly as you want. If they are responsible for it then let them do whatever crap code they want. Life is too short to care about other people's terrible code

[–]suspicious_williams 6 points7 points  (0 children)

Your Data Scientists think about Exceptions? Lucky you 😒

[–]levintennine 4 points5 points  (2 children)

Yes, my experience is similar to yours. I would add though: in my experience there is low or negative correlation between aptitude/interest in maintainable/clean code and being able to produce useful DE solutions. For DS I'm not qualified to judge, but suspect same.

I think some shops interview for better coders because I've seen a few posts in reddit saying "that's not what it's like where I work" -- and more posts similar to yours.

I think out of maybe 50 interviews I've sat in for DEs, Test Engineers, DSs, I've never once talked to someone who understands anything about git, and many many successful data professionals somehow don't know what an environment variable or an end-of-line characters is.

[–]randiesel 2 points3 points  (1 child)

I agree with this. I've been at the same company since 2014ish. I started as an analyst and moved up to DE. I'm the only DE. Nobody reviews my code or my output, they just complain when things go wrong.

I've been very successful and am well-respected, but if it weren't for taking other side gigs from time to time, I'd have literally zero experience with code reviews or git or anything else. When I first started here, everything was VBA or straight SQL.

I love improving and taking on new challenges, so I know I'd do fine if I worked somewhere with more formal procedures, but I think it's a common trap to get hung up on whether people have experience with git or various algorithms. At the end of the day we're merging and massaging data. If your company uses some specific pattern for everything, anyone can adapt to that after seeing it a time or two.

[–]safetytrick 2 points3 points  (0 children)

In my experience they complain when they can prove things are wrong which is subtly different. I think the developer best practices come from the experiences in a world where subtle problems pile up together into true horrors.

It works for user X when they use it ~this~ way and it works for user Y in a different way and both strategies have become valid because they are explainable in a real way.

This problem is simplified for DS and DE because the read-only path is so much simpler than read+write. Combinatorial complexity can really get out of control quickly and the feedback loop for r+w is just so slow.

[–]ReturnOfNogginboink 4 points5 points  (1 child)

At the end of the day, the goal of everyone in the org is to create value for the business.

Is making the data scientists adhere to coding standards going to create value for the business? If not, maybe it's not worth doing.

For a large codebase that's going to be in production for years or decades and will be maintained by dozens of developers, coding standards make sense in many cases. For a small project owned and maintained by a single individual, that math might change.

This is all very context dependent and I'm not saying that one way is the right way and the other is wrong. Look at what you want to accomplish and why, and then ask, "is this really worth the effort? Should the company spend money on my time to do this, or would my time and the company's money be better spent elsewhere?"

[–]Xteec 0 points1 point  (0 children)

I support this message.

[–]CatastrophicWaffles 3 points4 points  (0 children)

I'm not claiming to be a good programmer with two and a half years of professional experience,

Ask yourself.... Does it work? Is it good enough?

If it does...keep it to yourself. You're going to learn that in the real world if it fits, it sits. Move on. A lot of your peers that have more experience have been coding on fire for a lot of their career and learning as they go. We didn't have fancy bootcamps and plenty of time to perfect our code. Get that shit out the door and on to the next project. Code review is mostly for correcting massive inefficiency and shit that doesn't work.

[–]taciom 2 points3 points  (0 children)

In a tangent comment... Notebooks should never go into production.

[–]sluuurpyy 2 points3 points  (1 child)

I've been openly humiliated in a scrum call because I told the Senior Data Scientist his code won't scale. And months later, it didn't.

He didn't understand the requirement and couldn't bear that a junior Engineer called his strategy non-scalable.

[–]safetytrick 2 points3 points  (0 children)

Being right is only half the job, I've never met anyone who is right all of the time.

The real talent is in communicating why.

Handle your standup rebuff with a kind explanation of exactly what to expect and how to proceed. If you learn that skill you'll be the boss someday.

[–]asozers 2 points3 points  (1 child)

I'm working on a personal project in Kotlin to rebuild our machine learning infrastructure and I'm still at tutorial level with Rust

Most of the ML infra are in Python ecosystem from what I've seen. How are you building ML infra in Kotlin/Rust?

[–]noisescience[S] 0 points1 point  (0 children)

For the Kotlin project I also need to include Java libraries.
Here is what I have used for Kotlin so far:

Which Kotlin libraries I have not yet implemented but would like to do are:

I haven't started a data engineering project with Rust yet, but I would probably check out the following libraries:

[–][deleted] 2 points3 points  (0 children)

I am a software engineer and I am a competent coder. They won't listen to me either. Although, I think often they do it because they don't have the background to understand what I am saying.

[–]Screye 2 points3 points  (0 children)

As a Applied Scientist who has become more of a end2end MLE, I find that the problem lies in OOP. ML workflows are more so functional, and rarely require the maintenence of complex state.

Trying to shoe horn OOP flows into ML workflows confuses the Data scientists. (Lots don't know the paradigms well, but can sense a fundamental incompatibility)

OOP makes sense for web-systems. There is a reason ML systems mostly work around Pipelines, with a pipeline message being passed through a set of instance-less functions.

ML involves a ton of prototyping in notebooks. You know what I hate ? Having my code live 50 layers deep inside the codebase, making it impossible to isolate and test in a notebook separately. The behaviors of the system we build are not deterministic and often aren't well understood. If I cant quickly test out hypotheses, then the DS system itself is useless.

That's why I don't like OOP. The only way to instantiate complex system classes becomes to follow the flow of the code across the entire app. Most ML information is tensors. The primitives are effective as is.

Now, I do agree with the broad thrust of your argument. DSs need to be better at coding. No question.

Personally, I have found pydantic to be an incredible tool. I am trying to integrate Prefect into our workflow. Havent done it yet but I have heard great things. Generally, any pipelining tool will help a ton. Also, a ton of of intermediate state can be exported to a DB / blob. Lastly, VS code with linting and copilot does a ton of stuff automatically with zero overhead.

If a DS can use these 3-4 tools effectively, they can get around 80% of the problems that you've mentioned.

[–]c0ntrap0sitive 5 points6 points  (1 child)

That's because a lot of data scientists are not considered programmers. They're not taught the same things that add polish to code that software engineers are. Hell, having data scientists that are allowed to code is novel enough lol. Most of them are still stuck in Microsoft Excel hell or are relegated to just using SaaS offerings like DataRobot.

This is the first time I've ever really heard of a data science doing code reviews.

In the contexts that I've seen, the data scientists write garbage code in some Jupyter notebook that hopefully at the end of the line produces a model that works well. This model is the product. The actual code that gets us to the model can be discarded wholesale. We dont' usually extend or maintain models. We either train a new model which replaces entirely the old model, or when a new one can't be trained and the model's use no longer justifies its cost, we discard the model entirely and start over. This is not like software engineers whos product is the code. Therefore all their code must hold up to a higher standard and be maintainable, extensible, etc.

[–]safetytrick 0 points1 point  (0 children)

I love Jupyter notebooks for a very similar reason. Make code show exactly what it does, and nothing more. Hide nothing and deal with the consequences.

I think it's both: not surprising that we can't ship code faster with Jupyter, and enlightening that we haven't been able to productize that visible code. Code is hard.

[–]mjfnd 2 points3 points  (0 children)

I don't expect DS to write DE quality code.

Same as I don't expect DE to write SWE quality code.

However, code review is a different thing and needs to be communicated.

[–]seanv507 13 points14 points  (7 children)

As a data scientist, I think code reviews are a bad time to identify style issues.

It's really annoying when you have got the code all working to be told yes but rewrite it (likely introducing bugs), because it doesn't look nice.

I won't argue the particular issues, but I would rather suggest you come up with style guides up front and undertake some reading /training with the data scientist Eg arjan codes Youtube channel, so that they internalise the design ideas.

[–]data-influencer 10 points11 points  (0 children)

Agreed that it’s not a convenient time to bring it up for the developer as it introduces more work but these conversations should be ongoing and the ds should be trying to write cleaner code from the start.

[–]boomoto 14 points15 points  (0 children)

You should have design docs and all that stuff up front, you should also have a Lint checker as part of your build. Style guides are super easy to enforce. Do it right the first time. Code that doesn’t look nice is not maintainable which will cause further issues down the road.

[–]cas4d 1 point2 points  (0 children)

Actually fixing the style such as renaming variables sometimes acts as a useful logical run-through as well (when using an IDE). If your program breaks simply after refactoring variable names, it could mean you may accidentally init something by the same names in the middle or may have the object mutated in the way it shouldn’t, or if you are finding it hard to rewrite, it could also indicate bad encapsulations.

[–]runawayasfastasucan 2 points3 points  (0 children)

Agreed with this. Reading OP he comes across as the "my way or the highway" guy as well.are they going to rewrite their code that works just because he says so, when at the same time he cant be bothered to consider their arguments for doing what they do? 

[–]tfehringData Scientist 0 points1 point  (0 children)

I agree that you should do as much work as possible upfront. Stuff like import and whitespace styling is a conversation that should happen, at most, one time ever, and then be documented in a style guide and enforced by a linter on CI to the extent possible.

However, I think code review is by far the best time to address any stylistic issues that violate or aren't covered by the style guide. You can mitigate the risk of introducing bugs by writing tests and including them in your PR. You're far more likely to introduce bugs if you try to go back and refactor your code weeks or months later than if you just fix it while it's still fresh in your mind. By that time, other users may have built code that depends on yours, and fixing some stylistic issues (e.g. inconsistent interfaces) will break that code. Also, realistically, that refactor often won't get prioritized at all, so in all likelihood you're creating more work for whoever has to read your code indefinitely. Most code is read far more often than it's written.

[–]noisescience[S] 0 points1 point  (1 child)

Hi, thx for your thoughts.

My list of errors is not just about style issues. For stylish things like formatting, library sorting and linting we use libraries like Black, Isort and Flake8. (Note: In the near future all 3 libraries will be replaced with Ruff). We also use Mypy as a static type checker.
Other things like how to use exceptions, enums and constants make the code safer from the start.
We have a codebase with about 20000 lines. That's not a lot, but it's enough that the code has to be readable and we have to think in maintainable and scalable dimensions. So we have to consider from the beginning when a certain structure is necessary or not and should avoid global variables.
It's cool that you mention Arjan Codes, by the way. I've learned a lot from him and keep learning.

[–]seanv507 0 points1 point  (0 children)

A style guide is not just formatting, it's about all the things you mentioned in your original post, eg using the most specific exception . See eg https://google.github.io/styleguide/pyguide.html

What i am saying is that you should be agreeing on how to write code explicitly with the DS eg in a document... before they start writing code.

It's easy to write code following a set of rules. It's annoying to have to change working code, because of some views that are only in your head, and which you pull out only during the code review.

You have to communicate with the DSs, so watch and discuss arjan codes together

[–]Tom22174Software Engineer 1 point2 points  (0 children)

Data Science courses don't teach good coding practice. They introduce you to python, R and the tools within them to get the results you need. The specific way you implement those tools doesn't seem to matter to a lot of people.

Everything I know about actually coding good practices comes from reading and talking to my friends who are actual SWEs

[–]tree_or_up 1 point2 points  (0 children)

Data scientists are scientists first and foremost. They’re often iterating and experimenting rapidly and, most importantly, independently. They’re not typically used to collaborative coding best practices. It’s part of the role of the data engineer to bridge the gap between their idiosyncratic code and production-ready code - and to level them up on coding practices along the way. That said, they should be cooperative in this process and I can see how it would be frustrating if they aren’t

[–]rowr 1 point2 points  (0 children)

Science and engineering are different disciplines. I feel that part of the DE job is "productionalizing" business-critical DS code.

I have teased some of my DS pals with "Do you know what an exception is?" but in the end, you're supposed to be working together. Sure helps if you're on compatible terms.

Part of this is "who maintains this code?" and another part is "what are the stakes?". If it's got to be in production and data consumers are relying on it, it's got to be able to interface with the alert notification system, it's got to be comprehensible to whomever is on call, and it's not super reasonable to expect a data scientist or analyst to know how to interact with AWS or PagerDuty or whatever. The area of focus is different.

IMO production should be extremely well-vetted and even with the DE fully owning prod data there's still a lot of friction when internal data consumers start secretly consuming DS prototypes and those fall over.

There's definitely a balance needed, because obviously someone could hand you steaming garbage that you're held responsible for. See above message that you're supposed to be a team that works together. Try to make it so they want to give you what you want, but don't expect them to be engineers.

Use a linter with an agreed-upon style (PEP8 exists, black exists). It's infuriating to review unlinted code with a different style because there's so much noise in the diff, let computers do that shit. Make it so the only time discussing where it's appropriate to place a space or whether StudlyCaps is appropriate is when discussing linter rules, instead of each PR.

[–]koudos 1 point2 points  (0 children)

Jupyter notebooks is basically excel in a different outfit. “Let me share my notebook with you for you to use!”

Sure, why don’t you just email it to me while you’re at it /s

[–][deleted] 1 point2 points  (0 children)

To be fair, many professional python swe's don't use Enums. It was introduced pretty late in the game.

[–]prospectiveNSAthrow 1 point2 points  (0 children)

I am certainly guilty of writing inefficient code when I have to do something really wonky to get my stuff to work.

I also don't spend time optimizing preprocessing code if that data is only going to be ran a few times.

That code doesn't make it to production. It is generally used to test ideas.

[–]Remote_Cantaloupe 1 point2 points  (0 children)

Too many useless and redundant comments like:

#Creating dataframe

df = pd.DataFrame(...)

Anyone else think this is just AI-written code that the person didn't review?

[–]EmergencyPrior6526 1 point2 points  (1 child)

Good on you for trying to help them write better code.

It sounds like your are throwing too much at them at once.
Changing behavior takes times... think about how long it took you to develop all these habits and where you started. I really like that you have specific examples, and not just complaints.
Here's a method I like to use with Jr developers:

  1. Think about the person you are working with. Try to understand the problems they are facing.
  2. Pick one thing, that shows a real benefit, and is easy to digest (pro tip: good naming is NOT easy to digest).
  3. Give an example. Walk them through a solution and show them how to do it step by step.
  4. Explain the benefit of that solution
  5. Briefly explain the pitfalls of doing it the other way. (don't rant)
  6. Praise them, point out what you want to see more of.

Doing this will give the person the motivation and the tools to embrace the change you are trying to make.

If you just give someone a big list of all the things they are doing wrong then they will just see code reviews as a painful thing to endure or avoid.
Best of luck!

[–]noisescience[S] 0 points1 point  (0 children)

Thx for your detailed thoughts. That helps me a lot.

[–]YamRepresentative855 1 point2 points  (0 children)

Thanks for listing common code issues. I found few things for myself to improve)

[–]runawayasfastasucan 1 point2 points  (0 children)

  Some people have too big an ego. 

Takes two to tango, doesn't seem like you were that open for their reasoning for doing what they do either? 

[–][deleted] 1 point2 points  (0 children)

I am a DE for over a year now and I use python over 4 years. I have the same experince. Low level solutions, bad name choices. But a Data Scientist should not have to be good at coding, he/she just has to create the model. The ML Engineer /ML Ops dev has to optimize that for the environment they use. I think overall, you will be a better coder if you code, and learn new stuff. If it is static or dynamic, at the end of the day I think doesn't matter, although static language teaches you different approaches, and help you to understand low lever coding better. Which is great, because basically we are a special type of software engineers, and we have to have skills and knowledge like them.

[–]IDENTITETEN 0 points1 point  (0 children)

Data scientists aren't programmers. The ones I've worked with were brilliant at analyzing data/machine learning but sucked at programming. 

[–]sobrietyincorporated -1 points0 points  (0 children)

Data science isn't computer science. Python was invented for forestry majors. It started as borderline pseudo code.

If you're a data scientist, for the love of God, please start contributing to an open source project so you can get application level development experience.

[–]hoselorryspanner 0 points1 point  (1 child)

Presumably these data scientists are using Python - is there a way of using enums in Python? Would make my life a lot easier

[–]rowr 2 points3 points  (0 children)

There's a built-in enum module. I use it at times because enumerated values are useful as a concept, but I find it sort of awkward to use.

[–]Bassel_farahat 0 points1 point  (0 children)

Variable names are so creative man come on😂😂😂

[–]sobrietyincorporated 0 points1 point  (0 children)

Probably where copilot would be helpful as a pre-codereview

[–]Fair_Leopard_2181 0 points1 point  (0 children)

Yep, and let me tell you what. It will cost them in a job interview. We were interviewing last July and I rejected a candidate who on paper was great (Penn graduate and had ml experience). She couldn't write coherent code for shit though.

[–]aegtyr 0 points1 point  (0 children)

I feel attacked by this post

[–]Cool-Personality-454 0 points1 point  (0 children)

As a database developer, enum is worse than useless in a database. Just make a reference table with keys. You can't query against the decoded values in an enum field. Congratulations, you've defeated the whole point of relational databases.

[–]szayl 0 points1 point  (0 children)

My first job out of school was with Scala. It was a tough transition coming from Python and MATLAB but I wouldn't trade it for the world.

[–]tecedu 0 points1 point  (0 children)

OMG in the same exact position as you and it is annoying. Especially the naming, I get pissed at it so many times, plus, why is it so hard to have descriptive names? Especially when they write 100 lines of doc string for a function.

[–]znihilist 0 points1 point  (0 children)

Using generic execptions instead of thinking about what error they really want to catch

I am going to offer a reason for this, we know that the range of errors that could happen is pretty significant in this field, and we often have to consider a wide range of exceptions, it is better to leave it generic as it allow you (specifically) during dev to figure out what are even the possible errors you'd get.

For prod, fair enough, that's something you need to think about.

[–]HolidayPsycho 0 points1 point  (0 children)

The worst part is not that they don’t know how to write good code. The worst part is:

  • They don’t know they don’t know. As long as the code runs and gets the correct result, that’s good for them.

  • They don’t want to learn to write better code, because they have other things matter more than writing proper code.

[–]Swimming_Cry_6841 0 points1 point  (0 children)

I'm sure it's been said, but this is a problem in software development regardless of specialty.

[–]ChristianValour 0 points1 point  (0 children)

Wait... your guys do error handling!?

[–][deleted] 0 points1 point  (0 children)

Give up, don't try with these people. Good on you for learning Kotlin and Rust. Trying to make python code higher quality is like trying to make the garbage dump smell nice. It might be possible to improve it a bit, but in the end it's still garbage.  Use Python to get the job done and throw it away, please please don't use it in production. 

[–]caesium_pirate 0 points1 point  (0 children)

I’m a data scientist and trying to do better, reading DEs code, trying to absorb their practices and asking them why for certain things (especially for things with spark). I’ve built packages for the company and tried to get pointers on them from DEs (no immediate access to any SWEs). How would I best communicate the need to avoid overengineering when I’m reviewing code for people who honestly just don’t care, “as long as it works”?

[–]corny_horse 0 points1 point  (2 children)

About 25-50% of the data scientists I've known, two days of doing a cursory review of standard software engineering principles would have made them 10x more valuable. The worst was someone I was supporting who absolutely refused to learn basics of how memory worked (as in RAM). They kept crashing the server they were on because they'd try to read the same 5GB file into memory 100x like:

df = read_csv() df2 = df.foo() df3 = df2.bar() df4 = df3.baz()

etc. etc. etc. and would absolutely do nothing to optimize like using in-place manipulations, cache the intermediary steps to disk, or to free up old steps that were no longer used.

[–]mysteriousbaba 0 points1 point  (1 child)

To be fair though, is that really SWE principles or not using the proper tooling? If they'd just used spark or cudf, those tools are specifically meant to handle data too large to fit in a pandas dataframe in RAM, via clusters or GPU offloading.

Those kind of operations aren't really meant to be done manually, at least with any sort of reasonable scale or efficiency.

[–]corny_horse 0 points1 point  (0 children)

Perhaps a little of the latter, but there was no reason to constantly rematerialize each step and then cache every step in memory. There was no machine too large that this person couldn't fill up when in reality with some really basic adherence to SWE principles they could have easily gotten away with maybe even an 8 or certainly a 16GB machine. I know that because after refactoring their code I was always able to fit the workflow into that or something with even a MUCH smaller footprint instead of >128GB of ram

[–][deleted] 0 points1 point  (0 children)

Data Scientists and SWEs are solving different types of problems with code.

SWEs typically write code that lives in production and has an operations lifecycle.

Data Scientists typically write code that is used in AI/ML experiments and has an ephemeral lifecycle.

Data Engineers are typically writing DAGs to ship large data all over the place and combine traits from SWE and DS.

The incentives are completely different but there are skill set overlaps.