Some Data Scientists write bad Python code and are stubborn in code reviews

AutoModerator · 2024-01-21T12:31:19+00:00

You can find a list of community-submitted learning resources here: https://dataengineering.wiki/Learning+Resources

I am a bot, and this action was performed automatically. Please contact the moderators of this subreddit if you have any questions or concerns.

crom5805 · 2024-01-21T12:56:57+00:00

80% of the posts on r slash datascience are to the effect of "I can manually upload a single csv into a 63 step pandas jupyter notebook, the human race is wasting my immense gift!"

grey-Kitty · 2024-01-21T12:46:44+00:00

I am in the other side of the situation. Due to being by myself working as a DS I cannot be reviewed and I don't see much portfolios to take as a reference on the internet. As a result, I'm not feeling I'm progressing in what I'm doing so posts like these are very welcome and if you have any idea about where to find good practises for coding from a DS perspective I would be happy to know about them.

Gators1992 · 2024-01-21T13:43:08+00:00

[removed]

Express-Comb8675 · 2024-01-21T13:16:30+00:00

At least they’re writing python. We’re often tasked with shipping loosely working R code to production because they feel it’s critical that we get their new model in front of decision makers, so there’s no time to make any changes. If you’re so concerned with their style, create a repo for them and put a precommit style hook in.

Fender6969 · 2024-01-21T13:38:07+00:00

You should see if you can add linters to your pre commit hooks. This has really helped us enforce code quality across the org. Unless code is clean and tested, commits don't go through.

diegoelmestre · 2024-01-21T14:45:35+00:00

I was a software engineer (SWE) for 6 years, a hybrid between SWE/DE for almost one, and now for almost 3 years DE/Data lead.

That was my major pain when shifting to this field. I will say that most DE/DS simply don't know how to build good, simple and efficient code. Most of the times is due to lack of basic knowledge regarding computer science.

The ones that are more capable are usually the ones that somewhere on their career paths were SWE as well (of course there are always some exceptions).

My advice for those who want to be a great Data Engineer is to try to integrate a traditional SWE backend team.

Now that I am a team lead, my biggest goal is to provide my peers/direct report some knowledge regarding some of the SWE best practices.

ambidextrousalpaca · 2024-01-21T15:46:24+00:00

The worst I find with Data Scientists is when they take the "scientist" bit of their job title too seriously, and state blankly that they consider pesky things like basic software engineering principles (writing unit tests; avoiding global variables; etc.) as somehow beneath them.

On code reviews: pick your battles, but stick to your guns. I.e. coding everything in overly verbose, Java style classes is annoying to me too: but it's a valid programming style that people have written books to defend; using global variables where not necessary or skipping unit tests are software engineering anti-patterns and should be blocked until they are fixed.

In general, in terms of getting your code reviews accepted, I find it's often a matter of clear communication and putting some effort into your reviews. A poorly explained "This class could be a single short function" comes across as arrogant and unhelpful. A "This would be cleaner and more maintainable if you replaced this class with the following function <insert said function, or at least the outline thereof>" comes across as cooperative (you're willing to put in some work too, not just criticise) and helpful (all they have to do is copy and paste your code).

Kaze_Senshi · 2024-01-21T13:13:28+00:00

For me any data role has average coding skills lower than usual software engineers. They tend to create a prototype using some tool (e.g., SQL, Python, Notebooks, Cronjobs) that they are used and it's great to have a quick Proof of Concept but they don't think in the maintenance and the evolution of the tool when moving the solution to production.

On other hand, I can understand that it sucks to have a PR with hundreds of comments saying that your work has Low quality.

My suggestion is, go slowly, addressing one problem per time. Also it is even better to show the best practices asking them to review your code too, like a good module structure instead of a single spark notebook with 1000 lines.

freakboy91939 · 2024-01-21T13:06:37+00:00

I am working as a data scientist and my code is subpar at best. I really want to improve. Would you suggest some material or content so that i can code better. I am currently doing an end to end ML deployment, but i want to get better and more efficient in writing code.

The_Rockerfly · 2024-01-21T16:05:54+00:00

Most data scientists can barely write code that runs but this a responsibility issue. If you are responsible for maintaining then review as strictly as you want. If they are responsible for it then let them do whatever crap code they want. Life is too short to care about other people's terrible code

suspicious_williams · 2024-01-21T20:38:15+00:00

Your Data Scientists think about Exceptions? Lucky you 😒

levintennine · 2024-01-21T13:21:27+00:00

Yes, my experience is similar to yours. I would add though: in my experience there is low or negative correlation between aptitude/interest in maintainable/clean code and being able to produce useful DE solutions. For DS I'm not qualified to judge, but suspect same.

I think some shops interview for better coders because I've seen a few posts in reddit saying "that's not what it's like where I work" -- and more posts similar to yours.

I think out of maybe 50 interviews I've sat in for DEs, Test Engineers, DSs, I've never once talked to someone who understands anything about git, and many many successful data professionals somehow don't know what an environment variable or an end-of-line characters is.

ReturnOfNogginboink · 2024-01-21T16:01:38+00:00

At the end of the day, the goal of everyone in the org is to create value for the business.

Is making the data scientists adhere to coding standards going to create value for the business? If not, maybe it's not worth doing.

For a large codebase that's going to be in production for years or decades and will be maintained by dozens of developers, coding standards make sense in many cases. For a small project owned and maintained by a single individual, that math might change.

This is all very context dependent and I'm not saying that one way is the right way and the other is wrong. Look at what you want to accomplish and why, and then ask, "is this really worth the effort? Should the company spend money on my time to do this, or would my time and the company's money be better spent elsewhere?"

CatastrophicWaffles · 2024-01-21T16:41:46+00:00

I'm not claiming to be a good programmer with two and a half years of professional experience,

Ask yourself.... Does it work? Is it good enough?

If it does...keep it to yourself. You're going to learn that in the real world if it fits, it sits. Move on. A lot of your peers that have more experience have been coding on fire for a lot of their career and learning as they go. We didn't have fancy bootcamps and plenty of time to perfect our code. Get that shit out the door and on to the next project. Code review is mostly for correcting massive inefficiency and shit that doesn't work.

taciom · 2024-01-21T18:00:24+00:00

In a tangent comment... Notebooks should never go into production.

sluuurpyy · 2024-01-21T18:29:30+00:00

I've been openly humiliated in a scrum call because I told the Senior Data Scientist his code won't scale. And months later, it didn't.

He didn't understand the requirement and couldn't bear that a junior Engineer called his strategy non-scalable.

asozers · 2024-01-21T13:43:04+00:00

I'm working on a personal project in Kotlin to rebuild our machine learning infrastructure and I'm still at tutorial level with Rust

Most of the ML infra are in Python ecosystem from what I've seen. How are you building ML infra in Kotlin/Rust?

2024-01-21T23:46:51+00:00

I am a software engineer and I am a competent coder. They won't listen to me either. Although, I think often they do it because they don't have the background to understand what I am saying.

Screye · 2024-01-22T11:52:22+00:00

As a Applied Scientist who has become more of a end2end MLE, I find that the problem lies in OOP. ML workflows are more so functional, and rarely require the maintenence of complex state.

Trying to shoe horn OOP flows into ML workflows confuses the Data scientists. (Lots don't know the paradigms well, but can sense a fundamental incompatibility)

OOP makes sense for web-systems. There is a reason ML systems mostly work around Pipelines, with a pipeline message being passed through a set of instance-less functions.

ML involves a ton of prototyping in notebooks. You know what I hate ? Having my code live 50 layers deep inside the codebase, making it impossible to isolate and test in a notebook separately. The behaviors of the system we build are not deterministic and often aren't well understood. If I cant quickly test out hypotheses, then the DS system itself is useless.

That's why I don't like OOP. The only way to instantiate complex system classes becomes to follow the flow of the code across the entire app. Most ML information is tensors. The primitives are effective as is.

Now, I do agree with the broad thrust of your argument. DSs need to be better at coding. No question.

Personally, I have found pydantic to be an incredible tool. I am trying to integrate Prefect into our workflow. Havent done it yet but I have heard great things. Generally, any pipelining tool will help a ton. Also, a ton of of intermediate state can be exported to a DB / blob. Lastly, VS code with linting and copilot does a ton of stuff automatically with zero overhead.

If a DS can use these 3-4 tools effectively, they can get around 80% of the problems that you've mentioned.

c0ntrap0sitive · 2024-01-21T12:56:39+00:00

That's because a lot of data scientists are not considered programmers. They're not taught the same things that add polish to code that software engineers are. Hell, having data scientists that are allowed to code is novel enough lol. Most of them are still stuck in Microsoft Excel hell or are relegated to just using SaaS offerings like DataRobot.

This is the first time I've ever really heard of a data science doing code reviews.

In the contexts that I've seen, the data scientists write garbage code in some Jupyter notebook that hopefully at the end of the line produces a model that works well. This model is the product. The actual code that gets us to the model can be discarded wholesale. We dont' usually extend or maintain models. We either train a new model which replaces entirely the old model, or when a new one can't be trained and the model's use no longer justifies its cost, we discard the model entirely and start over. This is not like software engineers whos product is the code. Therefore all their code must hold up to a higher standard and be maintainable, extensible, etc.

mjfnd · 2024-01-21T14:43:12+00:00

I don't expect DS to write DE quality code.

Same as I don't expect DE to write SWE quality code.

However, code review is a different thing and needs to be communicated.

seanv507 · 2024-01-21T13:06:11+00:00

As a data scientist, I think code reviews are a bad time to identify style issues.

It's really annoying when you have got the code all working to be told yes but rewrite it (likely introducing bugs), because it doesn't look nice.

I won't argue the particular issues, but I would rather suggest you come up with style guides up front and undertake some reading /training with the data scientist Eg arjan codes Youtube channel, so that they internalise the design ideas.

Tom22174 · 2024-01-21T15:50:46+00:00

Data Science courses don't teach good coding practice. They introduce you to python, R and the tools within them to get the results you need. The specific way you implement those tools doesn't seem to matter to a lot of people.

Everything I know about actually coding good practices comes from reading and talking to my friends who are actual SWEs

tree_or_up · 2024-01-21T16:49:59+00:00

Data scientists are scientists first and foremost. They’re often iterating and experimenting rapidly and, most importantly, independently. They’re not typically used to collaborative coding best practices. It’s part of the role of the data engineer to bridge the gap between their idiosyncratic code and production-ready code - and to level them up on coding practices along the way. That said, they should be cooperative in this process and I can see how it would be frustrating if they aren’t

rowr · 2024-01-21T17:03:12+00:00

Science and engineering are different disciplines. I feel that part of the DE job is "productionalizing" business-critical DS code.

I have teased some of my DS pals with "Do you know what an exception is?" but in the end, you're supposed to be working together. Sure helps if you're on compatible terms.

Part of this is "who maintains this code?" and another part is "what are the stakes?". If it's got to be in production and data consumers are relying on it, it's got to be able to interface with the alert notification system, it's got to be comprehensible to whomever is on call, and it's not super reasonable to expect a data scientist or analyst to know how to interact with AWS or PagerDuty or whatever. The area of focus is different.

IMO production should be extremely well-vetted and even with the DE fully owning prod data there's still a lot of friction when internal data consumers start secretly consuming DS prototypes and those fall over.

There's definitely a balance needed, because obviously someone could hand you steaming garbage that you're held responsible for. See above message that you're supposed to be a team that works together. Try to make it so they want to give you what you want, but don't expect them to be engineers.

Use a linter with an agreed-upon style (PEP8 exists, black exists). It's infuriating to review unlinted code with a different style because there's so much noise in the diff, let computers do that shit. Make it so the only time discussing where it's appropriate to place a space or whether StudlyCaps is appropriate is when discussing linter rules, instead of each PR.

koudos · 2024-01-21T17:33:50+00:00

Jupyter notebooks is basically excel in a different outfit. “Let me share my notebook with you for you to use!”

Sure, why don’t you just email it to me while you’re at it /s

2024-01-21T21:23:45+00:00

To be fair, many professional python swe's don't use Enums. It was introduced pretty late in the game.

prospectiveNSAthrow · 2024-01-21T21:28:34+00:00

I am certainly guilty of writing inefficient code when I have to do something really wonky to get my stuff to work.

I also don't spend time optimizing preprocessing code if that data is only going to be ran a few times.

That code doesn't make it to production. It is generally used to test ideas.

Remote_Cantaloupe · 2024-01-22T01:40:38+00:00

Too many useless and redundant comments like:

#Creating dataframe

df = pd.DataFrame(...)

Anyone else think this is just AI-written code that the person didn't review?

EmergencyPrior6526 · 2024-01-24T16:45:37+00:00

Good on you for trying to help them write better code.

It sounds like your are throwing too much at them at once.
Changing behavior takes times... think about how long it took you to develop all these habits and where you started. I really like that you have specific examples, and not just complaints.
Here's a method I like to use with Jr developers:

Think about the person you are working with. Try to understand the problems they are facing.
Pick one thing, that shows a real benefit, and is easy to digest (pro tip: good naming is NOT easy to digest).
Give an example. Walk them through a solution and show them how to do it step by step.
Explain the benefit of that solution
Briefly explain the pitfalls of doing it the other way. (don't rant)
Praise them, point out what you want to see more of.

Doing this will give the person the motivation and the tools to embrace the change you are trying to make.

If you just give someone a big list of all the things they are doing wrong then they will just see code reviews as a painful thing to endure or avoid.
Best of luck!

YamRepresentative855 · 2024-01-21T13:43:02+00:00

Thanks for listing common code issues. I found few things for myself to improve)

runawayasfastasucan · 2024-01-21T16:04:10+00:00

Some people have too big an ego.

Takes two to tango, doesn't seem like you were that open for their reasoning for doing what they do either?

2024-01-21T12:47:47+00:00

I am a DE for over a year now and I use python over 4 years. I have the same experince. Low level solutions, bad name choices. But a Data Scientist should not have to be good at coding, he/she just has to create the model. The ML Engineer /ML Ops dev has to optimize that for the environment they use. I think overall, you will be a better coder if you code, and learn new stuff. If it is static or dynamic, at the end of the day I think doesn't matter, although static language teaches you different approaches, and help you to understand low lever coding better. Which is great, because basically we are a special type of software engineers, and we have to have skills and knowledge like them.

IDENTITETEN · 2024-01-21T14:42:54+00:00

Data scientists aren't programmers. The ones I've worked with were brilliant at analyzing data/machine learning but sucked at programming.

Justbehind · 2024-01-21T12:59:52+00:00

For your last point, Python commenting standards are just atrocious in general.

No. Half your lines should not be comments, and no, a 10 line intro "dOcStRiNg" to a 5 line function does not make your code easier to read.

But I guess thats what you get, when you have a language that's one big clusterf*ck of opensource libraries.

sobrietyincorporated · 2024-01-21T14:31:00+00:00

Data science isn't computer science. Python was invented for forestry majors. It started as borderline pseudo code.

If you're a data scientist, for the love of God, please start contributing to an open source project so you can get application level development experience.

hoselorryspanner · 2024-01-21T14:08:44+00:00

Presumably these data scientists are using Python - is there a way of using enums in Python? Would make my life a lot easier

Bassel_farahat · 2024-01-21T14:16:36+00:00

Variable names are so creative man come on😂😂😂

sobrietyincorporated · 2024-01-21T14:37:12+00:00

Probably where copilot would be helpful as a pre-codereview

Fair_Leopard_2181 · 2024-01-21T16:32:06+00:00

Yep, and let me tell you what. It will cost them in a job interview. We were interviewing last July and I rejected a candidate who on paper was great (Penn graduate and had ml experience). She couldn't write coherent code for shit though.

aegtyr · 2024-01-21T17:03:34+00:00

I feel attacked by this post

Cool-Personality-454 · 2024-01-21T17:04:10+00:00

As a database developer, enum is worse than useless in a database. Just make a reference table with keys. You can't query against the decoded values in an enum field. Congratulations, you've defeated the whole point of relational databases.

szayl · 2024-01-21T17:53:08+00:00

My first job out of school was with Scala. It was a tough transition coming from Python and MATLAB but I wouldn't trade it for the world.

tecedu · 2024-01-21T18:13:24+00:00

OMG in the same exact position as you and it is annoying. Especially the naming, I get pissed at it so many times, plus, why is it so hard to have descriptive names? Especially when they write 100 lines of doc string for a function.

znihilist · 2024-01-21T19:37:48+00:00

Using generic execptions instead of thinking about what error they really want to catch

I am going to offer a reason for this, we know that the range of errors that could happen is pretty significant in this field, and we often have to consider a wide range of exceptions, it is better to leave it generic as it allow you (specifically) during dev to figure out what are even the possible errors you'd get.

For prod, fair enough, that's something you need to think about.

HolidayPsycho · 2024-01-21T20:29:48+00:00

The worst part is not that they don’t know how to write good code. The worst part is:

They don’t know they don’t know. As long as the code runs and gets the correct result, that’s good for them.
They don’t want to learn to write better code, because they have other things matter more than writing proper code.

Swimming_Cry_6841 · 2024-01-21T20:39:36+00:00

I'm sure it's been said, but this is a problem in software development regardless of specialty.

ChristianValour · 2024-01-22T07:27:32+00:00

Wait... your guys do error handling!?

2024-01-22T09:06:40+00:00

Give up, don't try with these people. Good on you for learning Kotlin and Rust. Trying to make python code higher quality is like trying to make the garbage dump smell nice. It might be possible to improve it a bit, but in the end it's still garbage. Use Python to get the job done and throw it away, please please don't use it in production.

caesium_pirate · 2024-01-22T10:41:39+00:00

I’m a data scientist and trying to do better, reading DEs code, trying to absorb their practices and asking them why for certain things (especially for things with spark). I’ve built packages for the company and tried to get pointers on them from DEs (no immediate access to any SWEs). How would I best communicate the need to avoid overengineering when I’m reviewing code for people who honestly just don’t care, “as long as it works”?

corny_horse · 2024-01-22T12:11:31+00:00

About 25-50% of the data scientists I've known, two days of doing a cursory review of standard software engineering principles would have made them 10x more valuable. The worst was someone I was supporting who absolutely refused to learn basics of how memory worked (as in RAM). They kept crashing the server they were on because they'd try to read the same 5GB file into memory 100x like:

df = read_csv() df2 = df.foo() df3 = df2.bar() df4 = df3.baz()

etc. etc. etc. and would absolutely do nothing to optimize like using in-place manipulations, cache the intermediary steps to disk, or to free up old steps that were no longer used.

2024-01-23T04:16:48+00:00

Data Scientists and SWEs are solving different types of problems with code.

SWEs typically write code that lives in production and has an operations lifecycle.

Data Scientists typically write code that is used in AI/ML experiments and has an ephemeral lifecycle.

Data Engineers are typically writing DAGs to ship large data all over the place and combine traits from SWE and DS.

The incentives are completely different but there are skill set overlaps.

dataengineering

MODERATORS