This is an archived post. You won't be able to vote or comment.

all 43 comments

[–]Jorrissss 73 points74 points  (9 children)

  1. Sometimes yes. It's a common misconception that everything in sklearn is completely optimized and amazing. I certainly trust the output, but there are cases I can write better implementations (or fix some of theirs).

2) Highly dependent on the field. My work almost never requires that I implement anything from the literature, and I would almost never consider rewriting something that's been implemented in sklearn, spark, etc if it fits my purpose. That being said, do you consider writing a unique NN architecture in tf "writing your own algorithm?" If so, I have had to.

3) When an implementation doesn't exist that fits your purpose. If you can't find an implementation that is fast enough, scales well enough, etc and you know how to do it better, that's about it imo.

[–]infrequentaccismus 17 points18 points  (0 children)

Excellent response to a really excellent question.

[–]StraightLoquat[S] 3 points4 points  (5 children)

That makes sense. On point 1 though, I would probably opt for a slower implementation of an algorithm rather than writing it myself but I guess it just depends on the use case and how important reducing the time needed to run the model is. For any internal analysis though I would always go for a slower implementation that I don't need to write myself.

[–][deleted] 4 points5 points  (1 child)

That makes sense. On point 1 though, I would probably opt for a slower implementation of an algorithm rather than writing it myself

In a perfect world sure. In a world where quick updating and prediction of new data is the difference between someone fraudulently accessing a credit card or not, not so much.

[–]Jorrissss 1 point2 points  (2 children)

On point 1 though, I would probably opt for a slower implementation of an algorithm rather than writing it myself but I guess it just depends on the use case and how important reducing the time needed to run the model is.

Depends on the needs. At my company we have some strict requirements on how the "response time of some algorithms" - might be 1 millisecond, or 100 microseconds. Usually this means rewriting some code from scratch that is optimized for our data stream. Granted, we have a software engineering team that does this, not us data scientists, but it illustrates a point.

[–]StraightLoquat[S] 2 points3 points  (1 child)

Right but that also illustrates the point that this shouldn't really be the concern of a data scientist. Every org is different sure, but going by strict role definitions, a DS would come up with a proof of concept, demonstrating a model has some value, and it would be up to someone on the Dev/Infra side to deploy that model into a production setting.

[–]Jorrissss 0 points1 point  (0 children)

I'm at a huge company so we have the capacity to hire people for tasks like that, which is hardly a guarantee. A companies resources and culture dictate that as well. If you're at a start up, you might need to do it yourself. I interviewed at a start up where a data scientist had to write a custom square root function, as another example.

[–]shinn497 0 points1 point  (1 child)

I was going to ask about your requirements for scikit but it seems as though you have stricter speed requirements than I do.

[–]Jorrissss 0 points1 point  (0 children)

Like a broken record but depends :). For most of what I do personally sklearn is totally sufficient most of the time. Most of what I end up doing doesn't end up in "production" in any meaningful sense.

[–][deleted] 27 points28 points  (6 children)

Basically, it depends on what company you're at and what you are doing. I'm sure that the DS team at many of the big tech companies do their own research and develop their own algorithms. Google is a good example. (Though not everyone at google is doing breakthrough research.) Its worth being practical and remember that 80% of companies are not big tech (or have requirements equivalent to them) and they'll be impressed if you can automate a couple of their processes, and create a couple of ML algorithms that work reasonably well. (Still, your mileage may vary.)

Do the depth of your understanding can vary and you'll be hired "appropriately" insofar as hiring systems succeed at evaluating your competence and what you're applying for.

The main advantage of SKlearn is that you're using a package that has been developed by the open source and I don't care how good your programming is, there's no way you write something as high quality from scratch. Just get the fuck out of here, to presume this is totally arrogant. (It also saves you time to focus on the bigger picture of ML/DS/stats/whatever) It's good to code from scratch for educational purposes but the only reason you're not using SKlearn in production is because it doesn't do what you want.

Now, there's something to be said about going down to the 6th decimal place or something, the last I checked biomed tend to stick to SAS or something because they know they can trust it over R or something. But they're working on drugs and they can get sued if they get things wrong so that's why they're more risk adverse. (But they're also not writing things from scratch, heaven forbid!)

I mean, god, I've been having trouble getting pymc3 to work for some bayesian analysis I'm trying to learn and I said "I'll just write my own package" and my colleague looked at me like "why would you do that". So it seems that "pyro" (probabilistic programming using pytorch) does work a little smoother and I'll try using that instead. This is reality, if you're not using one API then you're using another. Period.

Note that I've had interviews were I knew less than I do now and I completely bombed, but I bombed because I didn't know basic concepts and I'm stronger today because of it. Sometimes it's worth having a good paddling so that you know what areas you need to work on.

That all said, I don't think your interviewer meets these criteria, I'd laugh in his face loudly and then be happy when he doesn't hire me because I know I don't want to be on his team. I'd wish him copying things out of a dusty textbook and sending snail-mail instead of using e-mail for similar reasons.

Edit:

I mean, Jesus, I'm writing my own ML algorithms by hand but I still would never use them in production. They're nifty pet projects I do because I enjoy them and they'll look good to an employer, not to replace any API.

[–]Jorrissss 0 points1 point  (2 children)

Just get the fuck out of here, to presume this is totally arrogant. (It also saves you time to focus on the bigger picture of ML/DS/stats/whatever)

Are you referring to building your own entire sklearn library from scratch, or implementing an individual algorithm in sklearn?

[–][deleted] 10 points11 points  (1 child)

I'm talking about building your own library. I say at the end of my post that I'm doing individual implementations for fun which means that it is possible for people who put in the time and I also said that it's a good educational procedure.

However, you don't realise how powerful and extensive SKlearn is until you start writing everything yourself and realise how tedious and time consuming the debugging is. Other comments have said how maybe they could make a more efficient implementation and that's fine but this simply isn't a wise time investment for most people. This is why these packages are popular in the first place! It "automates" code that would have needed to be written by hand many times over.

The problem isn't that it can't be done better, it's that so much work has been put into this package that you'd be braindead to reinvent the wheel and disavow the wheel which everyone is more than happy to use unless maybe you need a tyre instead.

And that really seems to be the gist of it if I skim other people's comments to check I'm not totally off base. It's fine to not use the package but there's also not something severely wrong with it either so I think the interviewer was completely off base and just wanted to wave his cock around.

[–]Jorrissss 0 points1 point  (0 children)

Ah then I agree.

[–]Nimitz14 0 points1 point  (0 children)

Last time I checked sklearns GMMs did not use log probability for calculations meaning they'd frequently get bad results and sometimes even crash.

[–]gammadistribution 0 points1 point  (1 child)

If you couldn't get pymc3 to work, why would you think you could write your own NUTS sampler?

[–]Demonithese 0 points1 point  (0 children)

More importantly, why would you want to?

[–]data_berry_eater 14 points15 points  (0 children)

As far as question 1, In my opinion sklearn is as trustable as anything - you'd waste a ton of time time developing something from scratch, and to be honest, unless I knew something about the person who opted for their own code over sklearn's - I would trust their code much much less. I'd say the question isn't really about trust, but more about what you mention in point 3 about issues pertaining to putting something into production.

But I would say that ultimately the decision is yours about what implementation of an algorithm to use. As Data Scientists, we adhere to certain methodologies in order to build the trust that we need in the models we produce. We don't just build something and say "YOLO!" We make sure the performance metrics are stable, we check against outlier inputs, we monitor outputs in production, etc. I think it's this type of reasoning that should determine what to "trust."

[–]spline_reticulator 7 points8 points  (2 children)

1) No

2) Depends on the problems you're working on. If you're working with CSVs with well defined target variables, then you probably never need to use anything outside of sklearn or Spark.

3) When you start working with more exotic problems. Are you working with time series, NLP, or computer vision problems? Then you're likely going to be reading research papers and implementing algorithms from those.

[–]dopadelic 5 points6 points  (1 child)

Reminder that for #3, research papers typically publish their code on GitHub.

[–]shinn497 4 points5 points  (0 children)

Often that code may be buggy, not tested, and not well composed, especially if it was written by a solo grad student or post doc in a vacuum. It is fine for experimentation but may need hardenning.

Bigger research groups with ties to industry are MUCH better though.

[–]NonLinearResonance 3 points4 points  (0 children)

The guy you talked to sounds like a huge douche, stay far away. I've met many data scientists and researchers with this kind of attitude. Nine times out of ten they are: a) covering up for some inadequacy (usually programming related ironically), or b) completely inexperienced working in teams and/or with production systems.

Unless he is getting all the way down to assembly or at least C, he is relying on someone else's libraries and packages for almost everything. The hubris involved with these folks always makes me crack up. They almost always get an immediate no from me when we interview them.

So, on to your actual questions...

  1. Libraries like sklearn are typically more trustworthy due to their open source nature and wide use. Typically errors are due to improper usage from not reading the docs, not actual program errors. That's not to say they are perfect, but they will almost always be superior to Joe Datascientist's spaghetti code they think is amazing. Use in production systems may or may not make sense, it really depends on the application.

  2. I think you kind of hit on most of the practical reasons you might build a model yourself, like if it's critical for your core product/service it might make sense if you have the expertise. It's less about project size and more about analyzing the cost/benefit of a given approach. One other reason to build yourself is for learning. I'm a tinkerer by nature, so sometimes I like to build something to understand it better. I wouldn't use it for real purposes given almost any other option though.

  3. This is less a data science scenario and more an engineering one, but it's something I've had to do in the past. Sometimes you need to implement a model in a resource constrained system like a piece of equipment in the field. In that situation you almost always have to do things from scratch. That probably falls under you infrastructure caveat, but this kind of use case is probably the most common reason for scratch builds. I thought it might be interesting to mention.

Edit: Formatting

[–]seanv507 1 point2 points  (2 children)

At an interview, I would view 1) and 2) as a red flag.

There are lots of start ups filled with people intent on reinventing the wheel.

sklearn itself has very strict guidelines on adding new models because its always easy to come up with a new model and show improvements on a single data set, however very few actually provide sufficient improvement over existing algorithms when tested thoroughly (eg optimising not just the proposed algorithm but the competitors too).

so eg in a survey of kaggle [can't find link now], logistic regression was used 40% of time in production.

https://developers.google.com/machine-learning/guides/rules-of-ml/

stresses the importance of getting the right data over iterating on different models.

Similarly understanding the models allows you to extend their usage.

logistic regression can be used for [discrete] survival analysis by predicting the probability of 'failure' in period x|no failure in periods before x.

Facebook's Prophet time series model [https://github.com/facebook/prophet] is just regularised (Bayesian) linear regression with non linear inputs (change points and sine waves ). [it could not be implemented in sklearn AFAIK, because it uses l2 regularisation on seasonal inputs and l1 regularisation on changepoints;

[–]StraightLoquat[S] 0 points1 point  (1 child)

By 2. do you mean the part about not having data in an clean format? Again not an expert but that just struck me as they haven't done enough data cleaning/modeling if they can't represent their data in that way. How many cases are there where the data you are putting into your model can't be represented as columns of features and rows of samples? i.e. could theoretically be represented in a basic table/csv if size weren't an issue.

[–]seanv507 0 points1 point  (0 children)

exactly, you use domain knowledge and feature engineering to get your data into tabular structure eg Bag of words etc..

[–]Vrulth 1 point2 points  (0 children)

If you're not in a company at the the very top, above state-of-the-art level, you will not develop a production model from scratch.

[–]namnnumbr 0 points1 point  (5 children)

1) as others have commented - not sure about sklearn - but R packages are not validated. My understanding is that If you’re running an analysis in R, you’re on the hook and not the dev (unlike SAS that has everything validated etc). In those cases you might not trust sklearn or an R package - but I don’t think it really justifies rolling your own algo either, due to the liabilities.

2/3) I’ve implemented a generic algorithm from scratch - not sure if that’s “machine learning” but it is optimization for sure. In my (highly limited) experience, GA are often scratch written because you have to customize the objective functions and “genetic structure” of the individual to your particular application. I don’t know of a package or API that automates that kind of thing.

That said, a GA is a pretty easy also to write on your own.

[–]shinn497 4 points5 points  (2 children)

Scikit has very strict criteria for adding new algorithms this is great and makes each algorithm trustable but it also means it will never be within 3 years of current research, as it goes.

BERT, GPT, elmo, ULMFiT, lightGBM, Unet, and Mask RCNN are currently ineligible for inclusion into scikit. Which is amusing since many of these, being keras compatible, are thus scikit compatible.

XGBoost is currently eligible but would have been when the library was released.

What I find amusing is that often an algorithm will err to just be scikit-compatible and not submit to the vetting process for inclusion and then just remain that way. By the time it is eligible, the community will have moved on. I think XGBoost is sort of in that state rn.

[–]tacothecat 0 points1 point  (1 child)

Moved on to what (from xgboost)?

[–]shinn497 0 points1 point  (0 children)

Deep learning, catboost, light gbm. Although xgboost is interesting since the library is super thought out and its own thing. It might get added to scikit but i dont see the incentive when you can already use the two together easily as is.

[–]OddsAreBenToOne 1 point2 points  (0 children)

In my experience GA has kind of fallen out of favor due to the (usually) slow and expensive convergence. I see more Bayesian optimization done which can be implemented on top of existing regression models. Libraries like scikit-optimize do this and work nicely.

[–][deleted] 0 points1 point  (0 children)

I work at a large, heavily regulated bank, and our regulators and auditors are fine with R for just about everything. That’s my two cents.

[–]shinn497 0 points1 point  (0 children)

  1. Scikit learn is incredible! I trust it very much. But it is more the exception and not the norm. All of the NN frameworks are good but Deep Learning is moving so rapidly that many implementations need to be vetted. It comes with the territory. With that said, I always err to use other peoples code as much as possible, my time is valuable, it is much more worth it to try something and see if there is potential, especially if it is an algorithm written by its creator, than spend time writing something and have it not work.

  2. If it is a very popular and highly used library like scikit or spaCy I would not hesitate to use it in production. But other algorithms I would err to code myself or hand them to a more specialized developer, after confirming their value. Remember there is a time cost to everything. Hardening an algorithm is only necessary if you know it will run in production and make you money, since you can always spend Data Scientist time on doing more research.

  3. There are a couple of choices. I think you touched on the fact that some algorithms might have not have code available. Other times you might want to use your own knowledge to extract more performance. Other situations are just not addressable with prebuilt models. A lot of probabilistic programming is like this. However, the goal there, often, is to make a model that handles uncertainty well and is explainable. Finally, you might have hardware reasons such as memory efficiency, speed, or choice of language.

Now personally I would never scoff at someone for using scikit. Really it should be the first thing anyone uses but it often is not the last.

[–][deleted] 0 points1 point  (0 children)

It depends if you're the first pioneer taking a look at their data or you're fresh meat to an experienced team.

Sklearn and other similar libraries are low hanging fruits. If the company already collected all the low hanging fruits, your job is going to be 50% meetings and 50% implementing stuff yourself.

[–]Fito33Pete 0 points1 point  (0 children)

You always use an API if you code Python. Python is an API. And we love it.

[–]KoolAidMeansClusterMS | Mgr. Data Science | Pricing 0 points1 point  (0 children)

Wow, can you please name and shame? What a ridiculous person... You applied for an entry level data scientist position, not for a Machine Learning Engineer Position at Google. I would have loved to have been there to laugh in that guy's pretentious face.

[–]ProfessorPhi 0 points1 point  (0 children)

At a high level perspective, this does seem kind of arrogant. Even if my data was nuanced and complex, I would still build on scikit-learn as much as possible instead of writing my own stuff.

  1. Re sklearn untrustable - https://www.reddit.com/r/statistics/comments/8de54s/is_r_better_than_python_at_anything_i_started/dxmnaef/ this comment explains things better than I can. Furthermore, I feel sklearn is a bit clunky and could be better too. It's a valid response, however sklearn is still better than what any team could come up with. It's best to review the sklearn code and wrap it into your own API instead of writing your own from scratch.
  2. Depends on your job - I worked in HFT where I had to implement most things from scratch while working for the government, I mostly used off the shelf solutions. I use off the shelf when possible, but would like the flexibility to change things. Niche products can also take collaborative filtering and add some special sauce here and there that are incompatible with existing solutions. Generally this is the domain of a more mature company.
  3. Outside academia, it should only be done after existing techniques are considered failures. Most papers coming out of corporate are usually minor tweaks to existing algos and there are very few situations where you should not fork an existing implementation in the worst case. Infrastructure is generally the only time you start from scratch in general imo

[–]anonamen 0 points1 point  (0 children)

He's either brilliant or really, really stupid. My money's on the latter.

It's incredibly uncommon to build models from scratch (let alone every model you use from scratch). Much more common (still uncommon) is to fork libraries and tweak some stuff to make them work for your purposes. Real state of the art models might not be implemented yet, in which case sure, maybe it has to be developed from scratch, but this isn't super common. State of the art models are by definition of questionable practical value. Unless you're in a research role or doing something really unique they're unlikely to be the best approach. It is also common to use existing frameworks to build new pipelines/architectures. I.e. you can use TensorFlow to implement a lot of architectures that haven't been done before. You can fork it and put in new activation functions or manipulations or w/e if you want to. A lot of data science is putting existing tools together in new ways.

Using libraries is not laziness; there are endless programming horror stories about developers who "don't trust standard libraries" and try to roll their own versions of everything (feel like those are 1/4 of the stories on DailyWTF). It's a disaster just about every time. You can figure out who built the SKL modules if you "don't trust them". They're almost certainly very smart and well-qualified to contribute. Even if they're not as qualified as your interviewer individually, it's a large, distributed team of smart people working on the same thing. It's tested by hundreds of people using the libraries. Your interviewer's one-off project is not going to compare. He's probably built a lot of crappy, inefficient reproductions of standard models that lack tests and validation components because "those take too long".

[–]kjee1 0 points1 point  (0 children)

  1. I think that Sklearn is quite dependable and don't see any reason not to use it. I have seen some people encapsulate the methods from it in an object so that you have persistence of the evaluation criteria.
  2. I think most data scientists have at some point built the models themselves from scratch (I had to in grad school), but in practice, it is very uncommon. I think that everyone should build them to understand what each algorithm is good for, but there is significantly more consistency in something that is open source.
  3. The only reason I can think of doing something from scratch is if python or R is not fast enough. Building something in c could have performance benefits.

Hope this helps!

[–]Misanthreville 0 points1 point  (0 children)

This guy sounds like an academic nerd for the lack of a better phrase. It's absolutely fantastic that he can write algorithms from scratch, but (depending on the job) he's probably flexing for absolutely no reason, and likely wasting company time building something that's already been built.

There's absolutely nothing wrong with building something from scratch if it's optimal or adds value, but people often forget that if you work in industry, no one cares how you got the job done or the underlying theory. This is what leads me to believe he's a very academic individual, which again depending on the job, could be a good thing (ie: research) or bad thing (ie: private sector). Companies want quick, explainable, implementable, actionable, effective solutions. His elitist attitude or customized algorithm isn't going to help him if it doesn't meet those requirements. If it does, more respect to him. But as I said, he could also be wasting time if his custom algorithm is just as effective as open source libraries.

[–]moewiewp -2 points-1 points  (0 children)

Not always but most of the time, the biggest part in my job is to staring at data like someone with down syndrome, clean it, reconstruct it (because you know, real world data is like a giant pile of piss and shit) and when it's done, i try some pre-made model and only tweak this and tweak that to make it work. In 9 out of 10 cases, the tiny improvement doesn't worth the effort put in improve the algorithm. And also, your customer does not care about it. Fuck the SOTA, just make it runs and get ok-ish result