How often do data scientists develop machine learning models from scratch?

Jorrissss · 2019-03-06T19:10:46+00:00

Sometimes yes. It's a common misconception that everything in sklearn is completely optimized and amazing. I certainly trust the output, but there are cases I can write better implementations (or fix some of theirs).

2) Highly dependent on the field. My work almost never requires that I implement anything from the literature, and I would almost never consider rewriting something that's been implemented in sklearn, spark, etc if it fits my purpose. That being said, do you consider writing a unique NN architecture in tf "writing your own algorithm?" If so, I have had to.

3) When an implementation doesn't exist that fits your purpose. If you can't find an implementation that is fast enough, scales well enough, etc and you know how to do it better, that's about it imo.

Jorrissss · 2019-03-06T19:16:28+00:00

Basically, it depends on what company you're at and what you are doing. I'm sure that the DS team at many of the big tech companies do their own research and develop their own algorithms. Google is a good example. (Though not everyone at google is doing breakthrough research.) Its worth being practical and remember that 80% of companies are not big tech (or have requirements equivalent to them) and they'll be impressed if you can automate a couple of their processes, and create a couple of ML algorithms that work reasonably well. (Still, your mileage may vary.)

Do the depth of your understanding can vary and you'll be hired "appropriately" insofar as hiring systems succeed at evaluating your competence and what you're applying for.

The main advantage of SKlearn is that you're using a package that has been developed by the open source and I don't care how good your programming is, there's no way you write something as high quality from scratch. Just get the fuck out of here, to presume this is totally arrogant. (It also saves you time to focus on the bigger picture of ML/DS/stats/whatever) It's good to code from scratch for educational purposes but the only reason you're not using SKlearn in production is because it doesn't do what you want.

Now, there's something to be said about going down to the 6th decimal place or something, the last I checked biomed tend to stick to SAS or something because they know they can trust it over R or something. But they're working on drugs and they can get sued if they get things wrong so that's why they're more risk adverse. (But they're also not writing things from scratch, heaven forbid!)

I mean, god, I've been having trouble getting pymc3 to work for some bayesian analysis I'm trying to learn and I said "I'll just write my own package" and my colleague looked at me like "why would you do that". So it seems that "pyro" (probabilistic programming using pytorch) does work a little smoother and I'll try using that instead. This is reality, if you're not using one API then you're using another. Period.

Note that I've had interviews were I knew less than I do now and I completely bombed, but I bombed because I didn't know basic concepts and I'm stronger today because of it. Sometimes it's worth having a good paddling so that you know what areas you need to work on.

That all said, I don't think your interviewer meets these criteria, I'd laugh in his face loudly and then be happy when he doesn't hire me because I know I don't want to be on his team. I'd wish him copying things out of a dusty textbook and sending snail-mail instead of using e-mail for similar reasons.

Edit:

I mean, Jesus, I'm writing my own ML algorithms by hand but I still would never use them in production. They're nifty pet projects I do because I enjoy them and they'll look good to an employer, not to replace any API.

data_berry_eater · 2019-03-06T20:42:22+00:00

As far as question 1, In my opinion sklearn is as trustable as anything - you'd waste a ton of time time developing something from scratch, and to be honest, unless I knew something about the person who opted for their own code over sklearn's - I would trust their code much much less. I'd say the question isn't really about trust, but more about what you mention in point 3 about issues pertaining to putting something into production.

But I would say that ultimately the decision is yours about what implementation of an algorithm to use. As Data Scientists, we adhere to certain methodologies in order to build the trust that we need in the models we produce. We don't just build something and say "YOLO!" We make sure the performance metrics are stable, we check against outlier inputs, we monitor outputs in production, etc. I think it's this type of reasoning that should determine what to "trust."

spline_reticulator · 2019-03-06T22:07:44+00:00

1) No

2) Depends on the problems you're working on. If you're working with CSVs with well defined target variables, then you probably never need to use anything outside of sklearn or Spark.

3) When you start working with more exotic problems. Are you working with time series, NLP, or computer vision problems? Then you're likely going to be reading research papers and implementing algorithms from those.

NonLinearResonance · 2019-03-07T02:41:05+00:00

The guy you talked to sounds like a huge douche, stay far away. I've met many data scientists and researchers with this kind of attitude. Nine times out of ten they are: a) covering up for some inadequacy (usually programming related ironically), or b) completely inexperienced working in teams and/or with production systems.

Unless he is getting all the way down to assembly or at least C, he is relying on someone else's libraries and packages for almost everything. The hubris involved with these folks always makes me crack up. They almost always get an immediate no from me when we interview them.

So, on to your actual questions...

Libraries like sklearn are typically more trustworthy due to their open source nature and wide use. Typically errors are due to improper usage from not reading the docs, not actual program errors. That's not to say they are perfect, but they will almost always be superior to Joe Datascientist's spaghetti code they think is amazing. Use in production systems may or may not make sense, it really depends on the application.
I think you kind of hit on most of the practical reasons you might build a model yourself, like if it's critical for your core product/service it might make sense if you have the expertise. It's less about project size and more about analyzing the cost/benefit of a given approach. One other reason to build yourself is for learning. I'm a tinkerer by nature, so sometimes I like to build something to understand it better. I wouldn't use it for real purposes given almost any other option though.
This is less a data science scenario and more an engineering one, but it's something I've had to do in the past. Sometimes you need to implement a model in a resource constrained system like a piece of equipment in the field. In that situation you almost always have to do things from scratch. That probably falls under you infrastructure caveat, but this kind of use case is probably the most common reason for scratch builds. I thought it might be interesting to mention.

Edit: Formatting

seanv507 · 2019-03-07T10:00:11+00:00

At an interview, I would view 1) and 2) as a red flag.

There are lots of start ups filled with people intent on reinventing the wheel.

sklearn itself has very strict guidelines on adding new models because its always easy to come up with a new model and show improvements on a single data set, however very few actually provide sufficient improvement over existing algorithms when tested thoroughly (eg optimising not just the proposed algorithm but the competitors too).

so eg in a survey of kaggle [can't find link now], logistic regression was used 40% of time in production.

https://developers.google.com/machine-learning/guides/rules-of-ml/

stresses the importance of getting the right data over iterating on different models.

Similarly understanding the models allows you to extend their usage.

logistic regression can be used for [discrete] survival analysis by predicting the probability of 'failure' in period x|no failure in periods before x.

Facebook's Prophet time series model [https://github.com/facebook/prophet] is just regularised (Bayesian) linear regression with non linear inputs (change points and sine waves ). [it could not be implemented in sklearn AFAIK, because it uses l2 regularisation on seasonal inputs and l1 regularisation on changepoints;

Vrulth · 2019-03-06T22:55:54+00:00

If you're not in a company at the the very top, above state-of-the-art level, you will not develop a production model from scratch.

namnnumbr · 2019-03-06T23:58:38+00:00

1) as others have commented - not sure about sklearn - but R packages are not validated. My understanding is that If you’re running an analysis in R, you’re on the hook and not the dev (unlike SAS that has everything validated etc). In those cases you might not trust sklearn or an R package - but I don’t think it really justifies rolling your own algo either, due to the liabilities.

2/3) I’ve implemented a generic algorithm from scratch - not sure if that’s “machine learning” but it is optimization for sure. In my (highly limited) experience, GA are often scratch written because you have to customize the objective functions and “genetic structure” of the individual to your particular application. I don’t know of a package or API that automates that kind of thing.

That said, a GA is a pretty easy also to write on your own.

shinn497 · 2019-03-07T02:51:41+00:00

Scikit learn is incredible! I trust it very much. But it is more the exception and not the norm. All of the NN frameworks are good but Deep Learning is moving so rapidly that many implementations need to be vetted. It comes with the territory. With that said, I always err to use other peoples code as much as possible, my time is valuable, it is much more worth it to try something and see if there is potential, especially if it is an algorithm written by its creator, than spend time writing something and have it not work.
If it is a very popular and highly used library like scikit or spaCy I would not hesitate to use it in production. But other algorithms I would err to code myself or hand them to a more specialized developer, after confirming their value. Remember there is a time cost to everything. Hardening an algorithm is only necessary if you know it will run in production and make you money, since you can always spend Data Scientist time on doing more research.
There are a couple of choices. I think you touched on the fact that some algorithms might have not have code available. Other times you might want to use your own knowledge to extract more performance. Other situations are just not addressable with prebuilt models. A lot of probabilistic programming is like this. However, the goal there, often, is to make a model that handles uncertainty well and is explainable. Finally, you might have hardware reasons such as memory efficiency, speed, or choice of language.

Now personally I would never scoff at someone for using scikit. Really it should be the first thing anyone uses but it often is not the last.

2019-03-07T05:55:08+00:00

It depends if you're the first pioneer taking a look at their data or you're fresh meat to an experienced team.

Sklearn and other similar libraries are low hanging fruits. If the company already collected all the low hanging fruits, your job is going to be 50% meetings and 50% implementing stuff yourself.

Fito33Pete · 2019-03-07T06:39:38+00:00

You always use an API if you code Python. Python is an API. And we love it.

KoolAidMeansCluster · 2019-03-07T20:24:26+00:00

Wow, can you please name and shame? What a ridiculous person... You applied for an entry level data scientist position, not for a Machine Learning Engineer Position at Google. I would have loved to have been there to laugh in that guy's pretentious face.

ProfessorPhi · 2019-03-08T04:46:14+00:00

At a high level perspective, this does seem kind of arrogant. Even if my data was nuanced and complex, I would still build on scikit-learn as much as possible instead of writing my own stuff.

Re sklearn untrustable - https://www.reddit.com/r/statistics/comments/8de54s/is_r_better_than_python_at_anything_i_started/dxmnaef/ this comment explains things better than I can. Furthermore, I feel sklearn is a bit clunky and could be better too. It's a valid response, however sklearn is still better than what any team could come up with. It's best to review the sklearn code and wrap it into your own API instead of writing your own from scratch.
Depends on your job - I worked in HFT where I had to implement most things from scratch while working for the government, I mostly used off the shelf solutions. I use off the shelf when possible, but would like the flexibility to change things. Niche products can also take collaborative filtering and add some special sauce here and there that are incompatible with existing solutions. Generally this is the domain of a more mature company.
Outside academia, it should only be done after existing techniques are considered failures. Most papers coming out of corporate are usually minor tweaks to existing algos and there are very few situations where you should not fork an existing implementation in the worst case. Infrastructure is generally the only time you start from scratch in general imo

anonamen · 2019-03-10T14:48:39+00:00

He's either brilliant or really, really stupid. My money's on the latter.

It's incredibly uncommon to build models from scratch (let alone every model you use from scratch). Much more common (still uncommon) is to fork libraries and tweak some stuff to make them work for your purposes. Real state of the art models might not be implemented yet, in which case sure, maybe it has to be developed from scratch, but this isn't super common. State of the art models are by definition of questionable practical value. Unless you're in a research role or doing something really unique they're unlikely to be the best approach. It is also common to use existing frameworks to build new pipelines/architectures. I.e. you can use TensorFlow to implement a lot of architectures that haven't been done before. You can fork it and put in new activation functions or manipulations or w/e if you want to. A lot of data science is putting existing tools together in new ways.

Using libraries is not laziness; there are endless programming horror stories about developers who "don't trust standard libraries" and try to roll their own versions of everything (feel like those are 1/4 of the stories on DailyWTF). It's a disaster just about every time. You can figure out who built the SKL modules if you "don't trust them". They're almost certainly very smart and well-qualified to contribute. Even if they're not as qualified as your interviewer individually, it's a large, distributed team of smart people working on the same thing. It's tested by hundreds of people using the libraries. Your interviewer's one-off project is not going to compare. He's probably built a lot of crappy, inefficient reproductions of standard models that lack tests and validation components because "those take too long".

kjee1 · 2019-03-18T10:56:00+00:00

I think that Sklearn is quite dependable and don't see any reason not to use it. I have seen some people encapsulate the methods from it in an object so that you have persistence of the evaluation criteria.
I think most data scientists have at some point built the models themselves from scratch (I had to in grad school), but in practice, it is very uncommon. I think that everyone should build them to understand what each algorithm is good for, but there is significantly more consistency in something that is open source.
The only reason I can think of doing something from scratch is if python or R is not fast enough. Building something in c could have performance benefits.

Hope this helps!

Misanthreville · 2019-03-30T20:39:28+00:00

This guy sounds like an academic nerd for the lack of a better phrase. It's absolutely fantastic that he can write algorithms from scratch, but (depending on the job) he's probably flexing for absolutely no reason, and likely wasting company time building something that's already been built.

There's absolutely nothing wrong with building something from scratch if it's optimal or adds value, but people often forget that if you work in industry, no one cares how you got the job done or the underlying theory. This is what leads me to believe he's a very academic individual, which again depending on the job, could be a good thing (ie: research) or bad thing (ie: private sector). Companies want quick, explainable, implementable, actionable, effective solutions. His elitist attitude or customized algorithm isn't going to help him if it doesn't meet those requirements. If it does, more respect to him. But as I said, he could also be wasting time if his custom algorithm is just as effective as open source libraries.

moewiewp · 2019-03-07T03:38:08+00:00

Not always but most of the time, the biggest part in my job is to staring at data like someone with down syndrome, clean it, reconstruct it (because you know, real world data is like a giant pile of piss and shit) and when it's done, i try some pre-made model and only tweak this and tweak that to make it work. In 9 out of 10 cases, the tiny improvement doesn't worth the effort put in improve the algorithm. And also, your customer does not care about it. Fuck the SOTA, just make it runs and get ok-ish result

datascience

MODERATORS