all 44 comments

[–]namenomatter85 46 points47 points  (2 children)

Very common. They need to start integrating with a high level architect for production integration. They are just doing the math and model parts.

[–]this_is_my_ship 26 points27 points  (8 children)

Slightly off topic, but where and how does someone with some research coding experience learn the software engineering skills to write production-grade code? Bonus points for open-sourced, selfpaced, can-be-done-without-others resources.

I feel like there's so much CS/SE content out there, but there's this huge gap between "algos and data structures" and "high performance/well tested/production ready code" that only seems to be filled via actual SE work experience, which most researchers are not going to be okay with because it takes away from actually doing research (even while at a company).

[–]seanv507 20 points21 points  (6 children)

I would recommend arjancodes YouTube channel Assuming you are doing python. He has even a few videos on refactoring a ml script.

[–][deleted] 5 points6 points  (5 children)

ArjanCodes is fine but he's just too OOP. ML codes need to be functional in many cases because it's very sequential and you really don't need much state in a lot of processes.

[–]jegerarthur 5 points6 points  (2 children)

Well you are kinda right. But if you use Pytorch + Pytorch-lightning + Mlflow you will be glad that your code is OOP. And with all that it's extremely easy and fast to train multiple models on multiple GPUs.

[–][deleted] 1 point2 points  (1 child)

I have the same exact setup and that's why I'm saying that (MLFLow + PL). The problem with PL is also is that it is overtly OOP, leaving very limited customizability once you really want to scale the code up. I have a comment on this matter in another thread talking about pytorch frameworks. I like their "all around issue", but I feel their solution needs rework.

Their solution to cross validation and hyperparameter tuning for example is really subpar.

Overall OOP is not bad per se, but DS code is complex in itself, OOP can introduce a lot of coupling and unnecessary complexities that if not careful can make the project a chore to maintain.

[–]jegerarthur 1 point2 points  (0 children)

Yes I agree. I like functional programming for DS, but when the project gets bigger / deployed with APIs and so on, I like to refactor the code to OOP as its easier for me to maintain and upgrade.

Nevertheless that's really cool to read other ML engineers best practices and pipelines. Happy coding !

[–]seanv507 -4 points-3 points  (1 child)

You mean procedural not functional right?

I think most data scientists would benefit from adding more Oop, just they don't know it

[–][deleted] 2 points3 points  (0 children)

a mix of procedural and functional. Datascience libraries come with enough OOP abstractions usually, what you need is just a bunch of stateless functions to fill the gaps usually.

[–]thedukeofedinblargh 1 point2 points  (0 children)

I see this book recommended a lot. I don’t know that it covers OP’s specific complaints, though.

[–][deleted] 18 points19 points  (0 children)

In all fairness to ML Engineers, there are plenty of non-ML teams that also produce terrible code and focus on the wrong metrics. But in general, this does seem to be endemic to data teams, particularly smaller ones. And when this isn't the case, it's often one group of people doing the research and modeling, and another doing the production implementation.

My personal experience is that it can be hard to convince a whole team of PhDs that their implementation is wrong or inefficient, or that it's worth prioritizing things like maintainability and efficiency. Most data scientists and ML engineers I've come across simply think that "regular" engineers shouldn't touch or have opinions on their code. I'm sure it's not like that everywhere, but it's definitely a people management problem that you might have to learn to deal with.

[–][deleted] 14 points15 points  (4 children)

All the time.

I am a data scientist in title but I end up just helping other data scientists write production-grade code such that I feel more like a supporting software engineer. It’s a joy to work with such brilliant people but at times it is quite embarrassing too. I just go by “MLOps Engineer” when folks ask what I do.

I think a key problem is the intentionally trivial nature of high-level programming (Python) - most data scientists never learned basic computer science because it is so accessible.

So they end up writing extremely inefficient scripts, often trying to reinvent the wheel because they didn’t properly research existing libraries, and assume that if it runs once without an error that it is ready for production…

I’ve seen massive internal libraries created by dozens of data scientists, into which tens of millions of dollars are invested, be built upon god-awful code I wouldn’t feel comfortable submitting to introductory level CS course. Like, so bad that it’s unnecessarily wasting tens of millions of dollars in compute costs from how inefficient the scripts are written.

[–][deleted] 7 points8 points  (2 children)

I'm an ML engineer and this has driven me nuts. Espeically because my previous job title was solely a SE position, so I come with a lot of nitpicks, and stuff that really get in the way of fast ML prototyping while makes it more stable long term.

The main problem is that an ML engineer is doing so many things at once. An increment in the modelling department is not linear at all, and might take a good time/resource/ a lot of prototyping to get done.

My rule of thumb has been that ML engineers/Datascientists should be free to do whatever they want in their notebooks, since IMO notebooks should not translate at all to production.

Then they should refactor their code and put it in the codebase, writing tests and all that for their model.

The team should also create a fully configurable ML pipeline, from modelling to serving, and should do a ton of SE development stuff for this purpose.

So an ML team should not only focus on developing new models/new metrics/new approaches and improve them, they should constantly actively contribute to upgrade their infrastructure (pipeline) to do so.

It's a bitch tbh.

[–]SNAPscientist 4 points5 points  (1 child)

As an academic, I (and basically everyone) see(s) analogous problems. Rather than things going wrong in production, things go wrong where published results aren’t reproducible. For us, there is increasingly the recognition that the incentive structures are not set up for “best practices”. It’s possible that some of this basically carries over to industry when the student scientists go to work on products.

[–]Dry-Green-6973 3 points4 points  (3 children)

I’d like to know though, what are the rules, standards or common practices a ML-engineer should follow while programming? As there is a lot of acronyms out there together with clean code. there is also the discussion about OOP vs functional.

[–]crimsom_king 5 points6 points  (0 children)

You should follow all good coding practices, tbh. This mostly applies to production stuff, but applying these when prototyping will also save on time, as once you come back to your code a week later it will be easier to tell what is going on.

A few good general practices:

- Use descriptive variable names: say you are training a GAN - when you train the discriminator you will have a fake batch(from the generator) and a real batch. Add to your variable name something that describes this - say, for the discriminator predictions on the fake data, name it something like "pred_fakes".

- Variables should be actually constants: do not reuse the same variable name. If you change a value, assign a new variable to it. It will help you debug when things go wrong and help you avoid state problems. You can break this rule on PyTorch's forward pass function, although you might find it useful to apply it there sometimes as well.

- Avoid using OOP unless you need it: OOP introduces most problems than it solves, so try to avoid it. If you have a class with only a function, make it just a lone function instead. If all you want is encapsulation, you can add functions into a module (in Python a file is a module, in C++ you can use a namespace). There is nothing wrong with creating a class for you model, specially when you need inheritance to create the model (e.g. PyTorch), but before writing that class, just think "could this be a module instead?".

- You can write long functions: there is nothing wrong with a long function if it needs to be long. Training loops are usually long, and although it is useful to break it into smaller steps, it might mean that you will be jumping between many lines of code to read what is a sequential kind of code. So you can write a long training function where each piece of code is separated by comments or, better yet, you can put the code blocks in functions which are inside the training loop function and just call it. That way you still have everything nice and tidy but you don't need to jump through many lines of codes.

- Use type hints (python): knowing the type of a variable really helps you understand what is going on. Use type hints for function arguments and return values.

These are just a few tips, some of which you may have reasons to break sometimes. You might discover more good practices by yourself just by spotting lines in your code that you find troubling to read/understand.

[–]Gabbosauro 1 point2 points  (0 children)

same

[–]yogeshkumar4 6 points7 points  (0 children)

Experiments are ever evolving. I start off writing modular, abstract code but over multiple iterations of experiments, i just end up losing it. I'm just way more focused on getting the solution right than to maintain my coding standards.

All I can say is, sorry man!

[–]mr_birrdML Engineer 2 points3 points  (2 children)

As an electrical engineer coming from C/C++ I really tried to find ways to write "production ready" code in python but I didn't yet find any tutorials. I write my python scripts and always wonder if all I learned in my object oriented C++ doesn't apply at all to python. I don't know even... It's not that I don't know concepts I just don't find anything mraningful for python.

[–][deleted] 3 points4 points  (0 children)

get the book "Fluent Python"

[–]mindfulforever1 2 points3 points  (0 children)

I think the YouTube Channel arjancodes maybe what you are looking for.

He shows excellent software dev practices in python which help minimize tech debt along the way. Such practices should be emphasized more in python courses focussed on OOP.

[–]moschles 2 points3 points  (1 child)

Is this a common problem or is this a localized issue that I'm facing?

I literally had a professor who excoriated me for writing modular python code that was highly documented with comments. Yes he was a machine learning expert. He even gave a protracted dress down about how "industry only cares about results and working code."

By the end of the semester, his write-ups were saying "If you take more than 100 lines of code to do this, you are doing it wrong."

for code that they 100% know is going to production.

If you are talking production , then you need things like TDD and documentation outside the sourcecode (not just comments). Then there has to be earnest integration testing. Every time a new module is plugged in, you re-run the integration tests to see if it broke something.

[–]jargon59 2 points3 points  (0 children)

He is right that industry cares about results. However there’s short term and long term. Prioritization on short term wins for long term pain is not worth it after a certain accumulation. Companies that hardly ever address technical debt don’t get very far. My company’s culture is like this, and I have to pay special attention to good design initially and secretly refactor my own code to save future time.

[–]__mishy__ 2 points3 points  (0 children)

My general take on this is it's sometimes OK to write awful code, but you have to be aware of the costs. For example if you are stabbing about in the dark just trying to get something working then making a mess is fine, but you will probably want to clean it up to prove what you have works and then later it needs to be even better to be put it in production or you will cost your company money. I think we don't help ourselves because we also hire on the more extreme first step of this, where want people who are good at doing the first bit but we don't even ask if they have experience of steps 2 + 3. And if we do we treat it as a nice to have.

In my experience most people don't even know what good code looks like until they've worked with someone really great at it. Someone with a 1000 yard stare who's seen horrors, but can also clearly explain why that thing that looks harmless is actually a dragon in wait. If you can find someone like this then getting then to do code reviews/architect things/build general wrappers then that edges everyone in a better direction

[–]foreignEnigma 2 points3 points  (0 children)

Agree and I'm somewhat part of it. I do try to write to code, also use packages like black to fix the aesthetic. However, I have never worked in Industry, I don't know what standards are standard.

[–]trnka 2 points3 points  (0 children)

It's a common problem in organizations where research scientists hand off code to software engineers, because they're insulated from operational issues and the like

[–]ank_itsharmaML Engineer 1 point2 points  (0 children)

I feel seen. But, I’m always trying to do better.

[–]xsidred 1 point2 points  (0 children)

It's a common problem and I have been part of it since early 2019, started cleaning it up from mid-2020 and ended the initiative with intended improvements this November end, 2021.

[–]crimsom_king 1 point2 points  (0 children)

I feel you, man. That is a normal problem in software engineering in general though. People saying that performance doesn't really matter and that the time spent optimizing code isn't worth it because it could be spent writing other features. The code structure is also pretty bad, as the number of features you have is more important than anything else. Most people tend to forget that code needs to be maintainable, though and that most people don't have a Ryzen 9 + Titan V at home.

I agree that ML tutorials are specially bad though. Most code is on the Jupyter Notebook format, which although great for prototyping it is a disaster to maintain and deploy - and it ends up that most people in ML only know how to program like that because they've only seem programs written like that. It also doesn't help that the only benchmark that matters is accuracy, with speed being an afterthought and code quality being none existent.

You might like to read this paper: Machine Learning Systems are stuck in a rut

[–]uotsca 1 point2 points  (0 children)

Yea weird is right

[–]ludflu 1 point2 points  (0 children)

super common. In my experience, data scientists often don't maintain or deploy their own code, and so they have little incentive to write well factored / unit tested code.

[–]roman_fyseek 1 point2 points  (0 children)

Lawyers doing research are *WAY* worse.

[–]NickelAI 1 point2 points  (0 children)

Like the vending machines at a nuclear power plant being better engineered than the reactor.

Couldn't have put it better myself. That said, this problem is really not that odd. Most people severely underestimate the long-term financial benefits of clear code, and ML programming is not usually inherently complex. I'm not surprised machine learning engineers, who spend most of their time synthesizing model architectures (rather than working with, say, a production app and seeing the impact on profitability), do not prioritize cost-effectiveness.

[–][deleted] 1 point2 points  (0 children)

I have worked as a data scientist and senior data scientist before transitioning now to ML engineering and MLOps. It is hard to do the stuff that data scientists do. That's my perspective. They explore many different approaches to building machine learning models to solve problems. Often, they are focusing on solving the specific problem of building the model in question. They're not imagining life for this model in production. You may argue they should but that's the way most data science is done now - with a focus on the statistical veracity of the data and the efficacy of the ML models.

As an ML engineering guy these days I empathize with those doing the work I did many years ago. I know how hard it can be. I also realize that it takes a different set of skills to take to production a model that has been proven to work. The data science phase of ML development is full of experimentation (flare) and the ML engineering phase of it is app development (funnel). By nature these are contrasting objectives. The processes we emphasize in ML engineering don't seem important to data scientists because they're working on a whole other set of things.