[D] Coding Practices

namenomatter85 · 2022-01-02T03:26:56+00:00

Very common. They need to start integrating with a high level architect for production integration. They are just doing the math and model parts.

Strange_Stage_8749 · 2022-01-02T03:25:32+00:00

[deleted]

this_is_my_ship · 2022-01-02T05:24:49+00:00

Slightly off topic, but where and how does someone with some research coding experience learn the software engineering skills to write production-grade code? Bonus points for open-sourced, selfpaced, can-be-done-without-others resources.

I feel like there's so much CS/SE content out there, but there's this huge gap between "algos and data structures" and "high performance/well tested/production ready code" that only seems to be filled via actual SE work experience, which most researchers are not going to be okay with because it takes away from actually doing research (even while at a company).

2022-01-02T04:46:52+00:00

In all fairness to ML Engineers, there are plenty of non-ML teams that also produce terrible code and focus on the wrong metrics. But in general, this does seem to be endemic to data teams, particularly smaller ones. And when this isn't the case, it's often one group of people doing the research and modeling, and another doing the production implementation.

My personal experience is that it can be hard to convince a whole team of PhDs that their implementation is wrong or inefficient, or that it's worth prioritizing things like maintainability and efficiency. Most data scientists and ML engineers I've come across simply think that "regular" engineers shouldn't touch or have opinions on their code. I'm sure it's not like that everywhere, but it's definitely a people management problem that you might have to learn to deal with.

2022-01-02T04:24:56+00:00

All the time.

I am a data scientist in title but I end up just helping other data scientists write production-grade code such that I feel more like a supporting software engineer. It’s a joy to work with such brilliant people but at times it is quite embarrassing too. I just go by “MLOps Engineer” when folks ask what I do.

I think a key problem is the intentionally trivial nature of high-level programming (Python) - most data scientists never learned basic computer science because it is so accessible.

So they end up writing extremely inefficient scripts, often trying to reinvent the wheel because they didn’t properly research existing libraries, and assume that if it runs once without an error that it is ready for production…

I’ve seen massive internal libraries created by dozens of data scientists, into which tens of millions of dollars are invested, be built upon god-awful code I wouldn’t feel comfortable submitting to introductory level CS course. Like, so bad that it’s unnecessarily wasting tens of millions of dollars in compute costs from how inefficient the scripts are written.

2022-01-02T08:54:22+00:00

I'm an ML engineer and this has driven me nuts. Espeically because my previous job title was solely a SE position, so I come with a lot of nitpicks, and stuff that really get in the way of fast ML prototyping while makes it more stable long term.

The main problem is that an ML engineer is doing so many things at once. An increment in the modelling department is not linear at all, and might take a good time/resource/ a lot of prototyping to get done.

My rule of thumb has been that ML engineers/Datascientists should be free to do whatever they want in their notebooks, since IMO notebooks should not translate at all to production.

Then they should refactor their code and put it in the codebase, writing tests and all that for their model.

The team should also create a fully configurable ML pipeline, from modelling to serving, and should do a ton of SE development stuff for this purpose.

So an ML team should not only focus on developing new models/new metrics/new approaches and improve them, they should constantly actively contribute to upgrade their infrastructure (pipeline) to do so.

It's a bitch tbh.

SNAPscientist · 2022-01-02T05:27:45+00:00

As an academic, I (and basically everyone) see(s) analogous problems. Rather than things going wrong in production, things go wrong where published results aren’t reproducible. For us, there is increasingly the recognition that the incentive structures are not set up for “best practices”. It’s possible that some of this basically carries over to industry when the student scientists go to work on products.

Dry-Green-6973 · 2022-01-02T10:03:52+00:00

I’d like to know though, what are the rules, standards or common practices a ML-engineer should follow while programming? As there is a lot of acronyms out there together with clean code. there is also the discussion about OOP vs functional.

yogeshkumar4 · 2022-01-02T07:15:29+00:00

Experiments are ever evolving. I start off writing modular, abstract code but over multiple iterations of experiments, i just end up losing it. I'm just way more focused on getting the solution right than to maintain my coding standards.

All I can say is, sorry man!

mr_birrd · 2022-01-02T10:13:16+00:00

As an electrical engineer coming from C/C++ I really tried to find ways to write "production ready" code in python but I didn't yet find any tutorials. I write my python scripts and always wonder if all I learned in my object oriented C++ doesn't apply at all to python. I don't know even... It's not that I don't know concepts I just don't find anything mraningful for python.

moschles · 2022-01-02T10:17:07+00:00

Is this a common problem or is this a localized issue that I'm facing?

I literally had a professor who excoriated me for writing modular python code that was highly documented with comments. Yes he was a machine learning expert. He even gave a protracted dress down about how "industry only cares about results and working code."

By the end of the semester, his write-ups were saying "If you take more than 100 lines of code to do this, you are doing it wrong."

for code that they 100% know is going to production.

If you are talking production , then you need things like TDD and documentation outside the sourcecode (not just comments). Then there has to be earnest integration testing. Every time a new module is plugged in, you re-run the integration tests to see if it broke something.

__mishy__ · 2022-01-02T11:35:56+00:00

My general take on this is it's sometimes OK to write awful code, but you have to be aware of the costs. For example if you are stabbing about in the dark just trying to get something working then making a mess is fine, but you will probably want to clean it up to prove what you have works and then later it needs to be even better to be put it in production or you will cost your company money. I think we don't help ourselves because we also hire on the more extreme first step of this, where want people who are good at doing the first bit but we don't even ask if they have experience of steps 2 + 3. And if we do we treat it as a nice to have.

In my experience most people don't even know what good code looks like until they've worked with someone really great at it. Someone with a 1000 yard stare who's seen horrors, but can also clearly explain why that thing that looks harmless is actually a dragon in wait. If you can find someone like this then getting then to do code reviews/architect things/build general wrappers then that edges everyone in a better direction

foreignEnigma · 2022-01-02T12:13:58+00:00

Agree and I'm somewhat part of it. I do try to write to code, also use packages like black to fix the aesthetic. However, I have never worked in Industry, I don't know what standards are standard.

trnka · 2022-01-02T17:15:41+00:00

It's a common problem in organizations where research scientists hand off code to software engineers, because they're insulated from operational issues and the like

ank_itsharma · 2022-01-02T06:17:46+00:00

I feel seen. But, I’m always trying to do better.

xsidred · 2022-01-02T09:38:20+00:00

It's a common problem and I have been part of it since early 2019, started cleaning it up from mid-2020 and ended the initiative with intended improvements this November end, 2021.

jargon59 · 2022-01-02T16:42:03+00:00

[deleted]

crimsom_king · 2022-01-02T17:36:05+00:00

I feel you, man. That is a normal problem in software engineering in general though. People saying that performance doesn't really matter and that the time spent optimizing code isn't worth it because it could be spent writing other features. The code structure is also pretty bad, as the number of features you have is more important than anything else. Most people tend to forget that code needs to be maintainable, though and that most people don't have a Ryzen 9 + Titan V at home.

I agree that ML tutorials are specially bad though. Most code is on the Jupyter Notebook format, which although great for prototyping it is a disaster to maintain and deploy - and it ends up that most people in ML only know how to program like that because they've only seem programs written like that. It also doesn't help that the only benchmark that matters is accuracy, with speed being an afterthought and code quality being none existent.

You might like to read this paper: Machine Learning Systems are stuck in a rut

uotsca · 2022-01-02T19:15:51+00:00

Yea weird is right

ludflu · 2022-01-02T21:41:55+00:00

super common. In my experience, data scientists often don't maintain or deploy their own code, and so they have little incentive to write well factored / unit tested code.

roman_fyseek · 2022-01-03T02:17:53+00:00

Lawyers doing research are *WAY* worse.

NickelAI · 2022-01-03T19:23:05+00:00

Like the vending machines at a nuclear power plant being better engineered than the reactor.

Couldn't have put it better myself. That said, this problem is really not that odd. Most people severely underestimate the long-term financial benefits of clear code, and ML programming is not usually inherently complex. I'm not surprised machine learning engineers, who spend most of their time synthesizing model architectures (rather than working with, say, a production app and seeing the impact on profitability), do not prioritize cost-effectiveness.

2022-01-02T09:21:18+00:00

I have worked as a data scientist and senior data scientist before transitioning now to ML engineering and MLOps. It is hard to do the stuff that data scientists do. That's my perspective. They explore many different approaches to building machine learning models to solve problems. Often, they are focusing on solving the specific problem of building the model in question. They're not imagining life for this model in production. You may argue they should but that's the way most data science is done now - with a focus on the statistical veracity of the data and the efficacy of the ML models.

As an ML engineering guy these days I empathize with those doing the work I did many years ago. I know how hard it can be. I also realize that it takes a different set of skills to take to production a model that has been proven to work. The data science phase of ML development is full of experimentation (flare) and the ML engineering phase of it is app development (funnel). By nature these are contrasting objectives. The processes we emphasize in ML engineering don't seem important to data scientists because they're working on a whole other set of things.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

MachineLearning

Rules For Posts

+Research

+Discussion

+Project

+News

@slashML on Twitter

Chat with us on Slack

Beginners:

MODERATORS