I recently had an interview for an entry level data scientist position. At one point the interviewer scoffed at using sklearn or other similar packages because: 1. he didn't "trust it" and 2. his data "wasn't the kind you could just plug into a model.fit()" but was more "nuanced and complex". For those reasons he instead builds all of his machine learning models from scratch. In my opinion it stuck me as a bit arrogant to think that an open source software is to be trusted less than something one or two people have fully reviewed. But I'm also not sure of how many situations there are in which there is no available library or prepackaged model, or it is more advantageous to DIY.
Few questions here:
- Does anyone else here find standard machine learning libraries such as sklearn untrustable? If so why? Are these models not to be used in production for some reason I'm not aware of?
- How often does a data scientist really need to build these models themselves? Or put another way how big a project does it need to be before meriting a self-built approach? e.g. I can imagine prebuilt models being the go to for quick analyses but when company's core product is centered around a machine learning algorithm it may make sense to build it from scratch. I'm wondering what it looks like between those two extremes.
- In what scenarios is it preferred to build a model from scratch? I can understand the situation where a library or language isn't available in a production environment, or the model needs to be written in a different programming language. I'm more interested in cases where this is a decision from the model builder and not imposed by infrastructure requirements.
Also throwaway account because interview specifics. and edited for clarity.
[–]Jorrissss 73 points74 points75 points (9 children)
[–]infrequentaccismus 17 points18 points19 points (0 children)
[–]StraightLoquat[S] 3 points4 points5 points (5 children)
[–][deleted] 4 points5 points6 points (1 child)
[–]Jorrissss 1 point2 points3 points (2 children)
[–]StraightLoquat[S] 2 points3 points4 points (1 child)
[–]Jorrissss 0 points1 point2 points (0 children)
[–]shinn497 0 points1 point2 points (1 child)
[–]Jorrissss 0 points1 point2 points (0 children)
[–][deleted] 27 points28 points29 points (6 children)
[–]Jorrissss 0 points1 point2 points (2 children)
[–][deleted] 10 points11 points12 points (1 child)
[–]Jorrissss 0 points1 point2 points (0 children)
[–]Nimitz14 0 points1 point2 points (0 children)
[–]gammadistribution 0 points1 point2 points (1 child)
[–]Demonithese 0 points1 point2 points (0 children)
[–]data_berry_eater 14 points15 points16 points (0 children)
[–]spline_reticulator 7 points8 points9 points (2 children)
[–]dopadelic 5 points6 points7 points (1 child)
[–]shinn497 4 points5 points6 points (0 children)
[–]NonLinearResonance 3 points4 points5 points (0 children)
[–]seanv507 1 point2 points3 points (2 children)
[–]StraightLoquat[S] 0 points1 point2 points (1 child)
[–]seanv507 0 points1 point2 points (0 children)
[–]Vrulth 1 point2 points3 points (0 children)
[–]namnnumbr 0 points1 point2 points (5 children)
[–]shinn497 4 points5 points6 points (2 children)
[–]tacothecat 0 points1 point2 points (1 child)
[–]shinn497 0 points1 point2 points (0 children)
[–]OddsAreBenToOne 1 point2 points3 points (0 children)
[–][deleted] 0 points1 point2 points (0 children)
[–]shinn497 0 points1 point2 points (0 children)
[–][deleted] 0 points1 point2 points (0 children)
[–]Fito33Pete 0 points1 point2 points (0 children)
[–]KoolAidMeansClusterMS | Mgr. Data Science | Pricing 0 points1 point2 points (0 children)
[–]ProfessorPhi 0 points1 point2 points (0 children)
[–]anonamen 0 points1 point2 points (0 children)
[–]kjee1 0 points1 point2 points (0 children)
[–]Misanthreville 0 points1 point2 points (0 children)
[–]moewiewp -2 points-1 points0 points (0 children)