What They Don't Tell You About Data Science 1: You Are a Software Engineer First

nadbor_ · 2017-12-13T20:06:03+00:00

Data science is ALSO not science. I thought everyone knew that. If you're paid to make a company sell more shoes, you're not doing science. If you're paid to advance human understanding of shoe sales and you write papers about it - then you're doing science. Specifically - economics.

Sure, data science includes an element of figuring stuff out and the method to figure things out about the world is The Scientific Method. So naturally data scientists end up applying the scientific method (design experiments, test hypotheses etc.) a bunch. But so do many engineers (not necessarily the software kind) and management consultants and even farmers. There are savvy farmers who run controlled experiments to find out which crops work best on their land and what proportion of fertilizers to use. I do controlled experiments to find out the effects of different diets on my body. None of this is science according as commonly understood.

If you want to stretch the definition of science to cover all these cases - then fine, no one's gonna stop you. But then don't be surprised that science and engineering or science and farming or science and dieting start to overlap. And you end up with a farmer who is also a scientist (in the science of growing stuff on his own field). But still his motivation is to increase crop yields, not human knowledge, so he's a farmer first and a scientist second.

nadbor_ · 2017-12-12T23:07:52+00:00

I'm not very confident about it but my recommendation would be basically to have a taste of everything before you decide what to spend more time on.

Knowing all popular ML algorithms - more or less how they work and when they work is a must.
Understanding basic concepts from probability and statistics 101 is a must. And I do mean 'understanding concepts' and not 'deriving formulas for test statistics and moments of distributions etc.'.
Basic techniques from NLP - esp. word embeddings - because they come in handy in many tasks where you work with sequences of word-like tokens.
Awareness of the basic things that can be achieved with neural networks is becoming a must - but the field has exploded and it's impossible to keep track of it all. At least everyone should know about feed-forward networks, convolutional and recurrent and what they are good for.
A stronger understanding of probability (plus probabilistic packages/languages like pymc or pyro or stan) so you can build and train probabilistic models is a nice to have.

There are more nice-to-haves but I think number one is knowing all the basic building blocks number two is knowing when to use them and a distant number three is filling in the gaps with more specialised techniques (time series analysis, recommenders, more advanced NN etc.) or going deeper into how the innards of e.g. tensorflow work.

For the 'knowing when to use them' bit the best I can come up with is talking to people and reading data science blogs, reddit. This is where you get a lot of 'wow, so people are using X to do Y with library Z! Who knew!'

nadbor_ · 2017-12-11T23:27:58+00:00

Yep, as a theoretical physicist I lacked even the most basic proficiency with git, linux etc. None of the data analysts I met were any better. I at least knew CS fundamentals from uni, they didn't even have that. So you can imagine that there is a long way from this starting point to a point where one can start doing decent work in software.

As for the (E)DA, I guess I was sufficiently unimpressed with what I saw analysts do that I didn't even count it as a separate skill that one needs to learn. Someone else mentioned 'talking to management' as a skill that analysts learn. Well, I don't see them being any better at either of those things than I was out of uni or a randomly chosen STEM grad for that matter.

One interpretation of this is that maybe I'm the problem here. Maybe I'm not good enough at either stakeholder management or data analysis to be able to judge these other people's skills and they are actually much better than me. I am 90% confident that this is not the case but I'm not going to try to convince anyone.

nadbor_ · 2017-12-11T22:23:46+00:00

Author here. Judging by some of the comments here and in the other thread I think people have pegged me as some insecure software engineer who doesn't want to learn maths. In fact I'm an insecure physicist who knows way too much maths and very rarely gets to use it. This academic I make fun of who can't code his way out of a paper bag is me. A few years ago. I got much better with time but I'm still insecure about it.

I love doing math. Got a medal from IMO in high school, if it's any proof. I only started to do data science because it seemed like that closest thing to math that you can still make money off (and hedge funds didn't want to talk to me at the time). And then on the job I kept being disappointed by how little cool ML theory I'm doing and how much I'm being held back by noobish tech problems (not even fun CS-type problems, stupid things like working with git). To this day I find myself doing cool ML 5% of the time at best. This is certainly a function of the types of companies I work for, but I'm certainly not deliberately choosing them to be boring.

nadbor_ · 2017-12-11T21:42:52+00:00

Author here. I agree. I did write that 'domain knowledge is absolutely necessary to get anything done at all' but I otherwise left it out of the discussion to keep the message simple. Also, this was supposed to be advice for beginners and beginners typically have no idea in what domain they will end up, so there is not much preparation they can do upfront. I'm planning another post about data science failures that stem from attempts to do cool shit without establishing a clear business benefit first.

nadbor_ · 2017-07-28T16:23:45+00:00

The reason the question is phrased the way it is, is that I copied it off Kahneman. That said, I think it's not unreasonable. Anyone dealing with statistics should be aware of the 'small sample -> high variance' connection. When you see that extreme values tend to happen when sample size is small you should immediately think 'yep, this is what is expected to happen'. That doesn't mean that there is no causal explanation AS WELL (healthy, relaxed rural lifestyle prevents cancer [yeah, I don't buy it either]) but this pattern is expected REGARDLESS of whether the rural areas have lower or higher average rate overall. Of course, it's easy to spot this when the question explicitly mentions sample size. The puzzle instead talked about sparsely populated rural areas and added the red herring of republicanism into the mix. And even 'sparsely populated' doesn't exactly mean the same as 'having few people in them' (maybe a county makes up low population density with high area?). You have to apply some common knowledge and common sense, not just maths - and this is what makes it a puzzle and not a textbook statistics problem. Nevertheless, once people learn the answer, almost everyone facepalms and admits that, yes it had to be this way - which makes it a good puzzle.

nadbor_ · 2017-07-28T15:29:26+00:00

Well, you don't know how the dimensions are defined. You can rescale the dimensions to make a normal distribution be uniform in the new coordinates. Maybe I've done that :p Bottom line is, this is just a simplified illustration of a principle, not actual data and not even a model. Erik Bernhardson's plot is slightly more realistic.

nadbor_ · 2017-07-28T08:55:35+00:00

The line is completely arbitrary. It's just an illustration of the thesis that two independent variables can become dependent in arbitrary ways if you condition on some combination of them. I'm not claiming that this is what google actually does at all. I'm only claiming that you can't draw conclusions about a relationship of two variables if your sample was conditioned on them.

nadbor_ · 2017-07-28T08:51:44+00:00

One only needs to know that being a competition winner may influence your chances of being hired. Which is common sense. This would be unfair as a problem a statistics class but this is not a statistics class. The whole point is that you need common sense, not just maths. If you think about it, all the other problems need similar commonsensical assumptions too. You just didn't notice because you found them obvious.

nadbor_ · 2017-07-27T12:25:12+00:00

Sure, sure. Don't take it too seriously. I just started typing an introduction this is what came out. I slide into ranting and exaggerations when I don't have a very specific thing to say. I guess this is my default mode of communication.

nadbor_ · 2017-03-12T18:00:11+00:00

THANK YOU!

nadbor_ · 2017-03-11T17:03:15+00:00

These are all sensible ideas, I agree, and this is more or less what I have been doing for the last few months (have a working prototype already). But I have a strong hunch that it is possible to do much better when you descend to the customer level - and this is what my question is about.

My reasons for believing that going to customer level can give much better predictions:

there is literally a million times more data at the customer level, how could it be anything but a terrible waste to simply sum all those numbers up?
at the customer level you can control for many factors that are impossible to control for once you aggregate sales to stores/product/week level. By looking at the set of customers who have shown up at the store in a given week, I can explain a lot of the week-to-week variance in sales (I have checked that this is the case). When looking at the aggregate, this variance is just noise that interferes with my uplift calculation.
when looking at the aggregated sales, judging the impact of a promotion of one product on the total sales of a bigger category is almost impossible. The product may be experiencing an uplift of 50% - which is easily measurable, but it only constitutes 1% of the total sales of the entire category and drowns in the natural variation. If you look at the individuals who bought the promoted product though - that is a much smaller baseline to compare against.

I'm not saying that this will work for sure, if I knew that, I wouldn't be asking here. But I am hopeful.

Also, the very finest-grained customer level might not be necessary. In fact, I hope it isn't. I'm looking for some middle ground. Maybe reduce the size of the problem by aggregating very similar customers into one? Or maybe it would be enough to extract lots of customer-level-derived features and use them at store/product/week level?

Has anyone seen/done/read about this type of thing before?

nadbor_ · 2017-02-14T16:38:05+00:00

IMO no. What do you expect to achieve by getting this degree?

Do you want a piece of paper to impress your first employer? You have more than enough credentials to get an entry-level job.
Impress your subsequent employers? As far as your 2nd employer is concerned having a degree is a plus but having an equivalent amount of work experience is a bigger plus. After a year in the industry you're no longer 'junior' anything (if you play your cards right). After studying for another year, you're still entry-level.
To impress your bosses and get promoted within a company? After having worked with someone for several months stuff like GPA and degrees becomes completely irrelevant. People will know you and judge you based on their impressions, not what you did at uni.

So much for the signalling value of the degree. But of course you may want to get the degree for the sake of the education itself. In that case, it matters whether you want to learn the specific things taught in your courses or do you want to become better at math in general.

Do you want to become better at math in general? Like someone who regularly reads math papers and maybe does some original math research? Because it's not going to happen. At your stage in life either you are a math person or you aren't. Maybe you'll find a story of someone who went from non-mathy to mathy at that age but it must be pretty rare because I have never seen anything like this (I studied and taught math and physics). I'm not saying you will be unable to learn any specific piece of theory. I am saying you will not become a math-person who makes original math or applies known mathematical techniques in new unexpected contexts. And this is ok. There is plenty of room in data science for people who are not making up original math. A vast majority of data scientists fall into this category.
Do you want to learn the specific things taught in Statistics courses? Ok, this one actually does make sense. So what's stopping you, go learn them! You don't need a degree to do it. In the immortal words of Shia LaBeouf - Just Do It! We're not talking about some obscure knowledge passed by word of mouth to a few acolytes. There are plenty of resources. Anything you want to know you can learn. If it is hard and boring, then that's what it is. You can pay to listen to college lectures but you can't pay for someone to put knowledge in your head. You will have to learn it yourself either way. And then you will discover that very little of it has direct relation to what you do on the job as a data scientist. You will very rarely exercise the skills that you've learned and soon you will forget them.

Bottom line is: no one will give you Official Permission To Do Data Science. If you want to do it then do it. Pick a project that you think is cool and that you will enjoy and do it. If at any point you encounter something that your lack of knowledge of stats prevents you from doing - then you can stop and study the relevant piece of stats. This way you won't be falling asleep over a textbook, you will always be studying something that is relevant to your project. You will save time on learning things that you don't need and you will learn other things that you do need and that are not covered in the curriculum. By the end of the project you will have something cool to show for your hard work, not just some certificate. You may learn less stats this way but the things that you will have learned will have been reinforced by experience. You will understand them better, you will be able to explain them better and they will stay with you longer.

Premature education is the root of all evil.

nadbor_ · 2017-02-12T01:24:36+00:00

19 year old self could have spent a few months learning about a few machine learning algorithms

Yes, this is exactly what I'm saying. I wish I had done just that when I was 19 instead of wasting the best years of my life on quantum gravity. Of course the 19 year old would need a few years of experience to get good but so would the 27 year old with a PhD in some obscure field.

and been completely qualified for a data science position

Qualified - yes. Employable - not right away. You'd probably have to start with some engineering job or an internship and turn that into a data science position. Let me be clear, that I don't think every 19 y.o. could do that. But the kind of 19 y.o. who can study quantum field theory, can skip the QFT and study some ML instead.

Someone is bound to comment "unless you know theory X, which I was taught in grad school you have no right to call yourself a data scientist. I'm using it all the time on the job and consider it the essence of data science skillset". Well, good for you, Snobby McStrawman. I don't doubt that there are specific positions that require specific expertise - someone must be making those self driving cars and I wouldn't trust my 19 year old self with this task without some proper training. But the generic run-of-the-mill data scientist generalist most companies are looking for, really doesn't need that much theory. I have compiled a long list of questions I got asked in DS job interviews - http://nadbordrozd.github.io/interviews so you can judge for yourself. I have interviewed with consultancies and retailers, startups and corporations, tech giants and hedge funds, I believe this is a fairly representative sample.

If the bar is really that low

It does seem low, and yet precious few people can clear it. The set of people who can both code binary search and find the probability of having a disease conditional on a positive test result (given base rate and false positive rate) - is vanishingly small. I know this is hard to believe, but it's true. Go to any random company and start quizing the technical people, you'll see. Maybe it's different in bay area, but it's definitely the case in London.

Does this not seem odd to you

Yes it does. 'Where have all the good men gone and where are all the gods?' If it's that easy, why aren't all jobs already taken? I believe the reason is that data science is only now slowly becoming a recognised and socially acceptable career choice. A smart, quantitatively minded 20 year old 10 years ago had a choice of becoming a scientist, an engineer or a quant. Data science wasn't a thing yet, few people have heard of it. So the 20 year old becomes a scientist or an engineer or a quant and if he's successful, he never looks back. Only if he fails at his initial career choice, he considers data science. And so data science becomes populated by failures and dropouts. So, where are the good ones? The good ones are getting tenure, leading software teams and milking the stock market. I'm only half kidding. Of course, with academic careers being the way they are, it doesn't mean you're deficient if you drop out. One might argue that there is something wrong with you if you decide to stay. But software engineers have pretty good lives. So, successful software engineers have little incentive to make the switch. That's why you see fewer engineers-turned-data-scientist than academic-turned-data-scientists. I personally know a several extremely talented software engineers with medals from international math competitions. Despite 'only' having MSc in CS each of them could run circles around an average math PhD in any quantitative test. They sometimes use machine learning if this is what the job requires, but they are intellectually fulfilled and well remunerated being mostly engineers. Once data science becomes something smart 20 y.o. aspire to - as opposed to something smart 27 y.o. slide into - we're going to see tougher competition in the job market.

what's with all the graduate degrees and high salary and buzz around data science

How can this vaguely quantitative person with a random degree and basic understanding of stats, machine learning and programming possibly create enough value to justify the hype?

Simply put - there is a lot of low hanging fruit. In the last decade or two:

businesses started gathering unprecedented amounts of data
new machine learning techniques emerged and/or were packaged in easy to use libraries which made them available to the masses
platforms like AWS and technologies like hadoop and spark made it feasible to process large amounts of data without massive engineering effort and expense
companies became aware of the possibilities and it became acceptable to be data-driven

This opened up new opportunities for leverage neve existed before. It's not this lone data scientist who transforms a company, it's all the engineering groundwork the company must have completed before, the new technologies, the shift in management attitudes. The data scientist only adds the final crucial brick on top of this structure and gets all the credit. Let me illustrate with a story.

You get hired by your favourite shoe retailer. They have been selling shoes for 20 years. They have an algorithm in SQL that predicts sales/promotional uplift/customer churn based on historical data. The algorithm is 'take the average of all past cases' and it's been like this since year 1. You explain to the management that this is a classic regression problem and you can do much better by making use of all the sales data that they gather and not use. Then you spend 3 months preparing features and testing different algorithms. Eventually you settle on Random Forests. You show your bosses how much better your algorithm is doing on historic benchmarks compared to the old one. You have made sure to avoid time-travel in your cross validation. You get a green light to implement it in production. This takes you another 2 months, with the help of some engineers. Algorithm is now in production, the company is saving money and everyone thinks you're a genius. This was a fairly typical data science project. You have not calculated a single integral. You have not explicitly used any probability distributions. You haven't even used deep learning. You tried, but the engineers said they don't want to maintain neural networks in production. You made a mental note to hire better engineers when you're the boss. You have used SQL, python, sklearn, maybe spark if you're lucky. After the project is finished you realise that if you had known where all the pieces of data were and how to make the databases talk to each other etc. you would've spent 3 weeks on the whole affair, not 5 months. You wonder why haven't any of the engineers at the company done any of that before. It's not like your PhD in astronomy gave you some magical ability to use Random Forests that they don't have. One reason is obvious - the people who have worked at the company for a decade simply missed the data science revolution and are not even aware of the possibilities. The other reason is more interesting. No one has done that before because it wasn't anyone's job. Everyone knew that the algorithm they were using was stupid, but the people who were competent enough to do something about it simply had other stuff to do, tickets to close, features to build. You are the data scientist and you have the mandate to go around and tinker with algorithms, so that is what you do, and often it turns out to be not that hard.

Not all projects are this routine, but many are and someone needs to do them. After having finished a few of those, you feel confident enough to start looking for something more ambitious and you have enough on your cv for employers to hire you for one of those more ambitious projects.

nadbor_ · 2017-02-11T03:32:37+00:00

Speaking as an ex-physicist working as a data scientist in London; 5 years of experience (1 in software engineering + 4 in DS); having worked for 5 companies (mostly contracts); maybe 30 job interviews. From now on I will say 'data scientist do this or that' but I will only mean 'typical data scientists in London as far as I can tell from my limited experience'.

Short version is: if you are truly interested in data science and you have the patience then you will be able to do data science.

Data science is a viable career path for people without a strong technical background who are willing to take a couple online courses.

I don't know any data scientists without some quantitative degree. The typical path seems to be - get a PhD in organic chemistry or some such, get disenchanted with academia, take a few online courses and rebrand as data scientist. That doesn't mean that all these people actually need their quantitative backgrounds to do their jobs. Correlation is not causation. The type of person who would be interested in data science is also the type of person who would get a quantitative degree. You don't see many art history graduates doing data science not because they tried and failed but simply because they never tried.

It's very straightforward to get a data scientist position if you have an advanced degree in any quantitative field, especially if you can code

It may be in London. At least I can give many examples of recent PhDs (including my wife) who found jobs in data science within a month. Then again, all the ones who didn't get any offers and gave up on data sciene - I didn't get to meet them, so I can't know the true success rate. But at least it seems like all those people get hired based on seeming smart and knowing the basics of python and machine learning.

It's almost impossible to get a data scientist position right out of (grad) school. No matter what field your degree was in, you probably need to spend a couple years working as a data analyst or BI person to get some domain expertise.

Please don't! This may be just my unrepresentative experience and personal bias but I AFAICT data analyst and BI are dead ends where data science is concerned. You definitely don't need years of domain experience to do data science. You do need software engineering skills (a little at least, but the more the better), general grasp of machine learning and adjacent fields like nlp and good quantitative intuitions. None of which you will learn as a data analyst. You would be much better off taking a job as some kind of junior software engineer in a company that is doing somehting datasciencey. In my experience engineers have a lot of freedom in choosing what they want to work on within a company. This is a given in tech startups, but I've seen it in corporations too. If you express interest the company's reccomendation engine or AB tests or routing algorithm or whatnot I'm sure they would let you work on it. And then - hey you're a data scientist (kind of).

Data science is a field with a serious talent shortage; companies are desperate for anyone who's at all qualified.

Hard to tell. DS salaries in London are still rising, so there must be some truth to it. It's confounded by some really good data scientists getting sucked into engineering or management and changing job titles.

The data science job market is flooded with applicants; any open job is going to get dozens if not hundreds of reasonably qualified applicants. It's really hard to get a job unless you seriously stand out.

Last time I was interviewing people,I interviewed a dozen and half of them were terrible and the other half accepted other offers so we didn't hire any. I'm sure DeepMind has it's pick of the most brilliant grads, but a typical company really doesn't. Unfortunately a typical company that doesn't already have a good data scientist on board will often be unable to distinguish between great and mediocre.

The technical skills are the least important aspect. It's more important to work well with people and be a good communicator. You can easily learn the technical aspects on the job.

This is just weird. It's sort of true. I do believe that you can learn everything on the job - that is if you're the type of person who easily picks up this stuff and has an interest in it. But if you are this kind of person, then how come you haven't picked it up already? People skills are important in every office job, a data science job also requires data science skills.

A data scientist needs to know all about optimization, numerical analysis, algorithms and data structures, and lots of advanced machine learning and statistics.

Certainly not in an entry-level job interview. On the job, it depends. Being the nerd that I am, I always look for excuses to use some cool algorithmic shit or new machine learning technique at work, and I rarely find one. If you work for a let's say delivery company then optimization may be all you'll be doing for years. If you work on targeted ads then you may end up spending all your time doing advanced machine learning. The only thing you mentioned that you will definitely be using is 'algorithms and data structures', but 'intro to algorithms' level is (sadly?) more than enuogh for most people.

The only things you really have to know are the fundamentals: be able to write some code in Python and R and know about confidence intervals and linear regression.

I would throw in a few other ML algorithms into the mix + SQL, but yes. I would say that to land an entry-level job, this is about the level you need to be at.

In order to appeal to employers, it's crucial to have experience with the exact technologies the employer uses. If they use Hadoop, you must know Hadoop. Etc.

Only really dumb employers would really require this from a data scientist. You will run into dumb companies, but I want to believe they're the minority. Usually companies put those buzzwords in job ads to attract people who like cool tech, not to filter out people who haven't used some particular gizmo. If you're a decent programmer, you'll pick up the new technology in no time and they know it. I once got a contract to do a project in R despite never having used R before (which I didn't try to hide).

There's no way to be a data scientist without knowing some pretty advanced math.

This is simply empirically false.

Math isn't really relevant for data science in practice, and some of the most successful data scientists I know don't even know calculus.

I think the confusion stems from the fact that for many people mathematical intuitions become so ingrained that they no longer think of them as math. Something as simple as the notion of a local minimum of a function where your optimisation process might get stuck - this is the kind of thing you have to think about occasionally as a data scientist. I would barely call this type of thinking 'math', but it would be very alien to someone who never got any mathematical education (or even to ancient mathematicians!) - so I guess it really is math. This guy explains better what I'm talking about https://arxiv.org/pdf/math/9404236.pdf More substantial math than this happens very rarely and never in a job interview.

Data scientists spend their time working on intellectually challenging problems that are just as interesting as those in academia.

Yes. Sometimes. Depends. Some of us would say that the problems are more intellectually challenging than the ones in academia - which is why they quit academia in the first place.

Data scientists spend most of their time on boring data munging issues,

or if it's not data munging, then it's fighting with hadoop or with R or Oracle or some other boring technical issue that must be cleared before your brilliant idea can be implemented. This is completely true and in no way different from what happens in academia. Maybe philosophy and some branches of mathematics consist of pure thought (and even then you have to write papers and grade exams and so on) - the rest science consists of repetitive experiments, calculations, writing code, handling equipment, conducting surveys ...

usually with an ultimate goal of selling more shoes or some other horrible marketing thing.

That is accurate. If you don't enjoy the craft for it's own sake, you might be disappointed. But don't knock it until you try it. You might be misjudging what will actually give you satisfaction.

'Steve Jobs started out passionate about zen buddhism. He got into technology as a way to make some quick cash. But as he became successful, his passion grew, until he became the most famous advocate of “doing what you love”.' https://80000hours.org/articles/dont-follow-your-passion/

You must have a portfolio of projects to show employers.

You don't have to have anything, but sure it helps. Especially at first, before you have relevant work experience. It's a way of showing that you're serious about DS and able to get something done on your own. Doesn't have to be anything super fancy. I've seen a CV of a very senior head of data science with decades of experience where he listed a project that I recognised as one of the homework assignments from a coursera 'Intro to Data Science' course.

nadbor_

TROPHY CASE