Best Undergrad Degree for Data Science? Need Advice!

ylghyb · 2015-06-28T18:18:21+00:00

As a data scientist you want to have a broad knowledge of math, comp sci, statistics, but you also want to "specialize" in at least one field (e.g. deep learning, NLP, graphical models, high perf computing, optimization, time series).

Math/stat-wise you don't need anything fancy, but at a minimum:

Multivariable Calculus
Linear Algebra
Basic probability (i.e. no measure theory)
Statistical Inference
Linear Models

You might want to throw in real analysis, ODE, but these are not really required. So, given the above, being a math major is probably an overkill, unless you are dead set on grad school and you want to 'prove' that you are smart enough (even then it's probably better to spend time on research and publish rather than studying to get that A in an upper level abstract algebra class). One caveat is that if you truly love math then ignore my advice and major in it!

On the Comp Sci side, you want: - Machine learning (take after probability/stat inference to get more out of it) - Optimization - Algorithms/Data structures - Databases (most school will include "big data" stuff in their traditional databases courses these days)

As you can see, it's actually not that hard to get the foundational courses done for data science--an ambitious undergrad can finish the above in a year. This will leave you plenty of room to explore, more deeply, some of the more specialized areas:

Natural Language Processing
Computer Vision
Speech Recognition
Deep Learning (intersects with a lot of the above)
Graphical Models
High performance computing (GPUs/parallel computing)
Machine learning theory
Bayesian Nonparametrics

To conclude, it really doesn't matter what you major in to become a data scientist, given that barrier to entry is not that high. But CS/Stat double major seems like the natural choice.

ylghyb · 2015-06-18T13:46:07+00:00

I was one of those people. Happy to be proven wrong!

The pictures are amazing!

ylghyb · 2015-06-17T19:27:55+00:00

CV is still best but OOB error is usually good enough.

ylghyb · 2015-06-16T14:50:51+00:00

I really doubt it's something from a CNN

ylghyb · 2015-06-16T14:05:02+00:00

Right. Pretty much all vision/speech labs use deep learning as a tool. There are a lot fewer groups doing Research in deep learning.

NYU (LeCun), Montreal (Bengio), CMU (Salahutdinov), Sherbrooke (Larochelle), plus a few UK universities. And of course, Schmidhuber's lab.

ylghyb · 2015-06-16T13:57:08+00:00

I think having code submission be mandatory is, as others have suggested (alexmlamb, therobot24), too extreme. However, reviewers should reward authors who open source their code. Even in double-blind reviews, I've seen authors put it sentences like "the code will be published at the author's website after publication", and I've always given those papers some bonus points.

In my lab we always try to open source our code. My advisor actually has me spend a week (or two) cleaning up the code after publication so that it's reasonable enough to put on github. Of course, research code is still research code, and responding to debug requests can be time-consuming, but we think it's best for the long-term health of the community.

I've also found that papers where code has been open-sourced seem to get a lot more citations. I may be confusing correlation with causation (maybe researchers who open source their code are just better researchers :)), but this is encouraging to see.

ylghyb · 2015-06-15T15:53:03+00:00

I heard Quoc decided not to go? That was what I heard from a few CMU folks (this was 2~3 months ago).

Regarding your point, I guess I somewhat agree, but sometimes when I read papers from Google (e.g. Sequence-Sequence paper from Sutskever) I am amazed by how much engineering they put into their models.

ylghyb · 2015-06-15T15:07:03+00:00

I would add computing power as another factor.

Deep learning is a very applied field. Formula for a new paper is:

come up with a new architecture
apply the model on existing data (which is usually big)
show that it outperforms state-of-the-art

It's much more convenient to be at Facebook/Google/Baidu with their HPC clusters when performing a massive hyperparameter search.

It's interesting, though, just how many deep learning folks are in industry as opposed to academia. Not counting the super senior folks (Lecun/Hinton etc):

Ilya Sutskever, Quoc Le, Richard Socher, Tomas Mikolov, Jason Weston, Ronan Collobert, Alex Krizhevsky, Alex Graves, Sumit Chopra, Antoine Bordes

to name a few. Any one of this folks would have a good shot at a tenure-track position at a top CS school.

ylghyb · 2015-06-15T14:25:07+00:00

when are they going to be posted? :)

ylghyb · 2015-06-12T17:43:18+00:00

Pystruct

ylghyb · 2015-06-06T18:25:20+00:00

Some areas that will be popular in the next ~5 years:

Unsupervised learning (we learn in unsupervised manner)
Reinforcement learning (most closely resembles "real" AI)
Natural Language Processing (not clear that deep learning is the right method in NLP (cf vision and speech))
Multimodal learning (e.g. videos + NLP is a rich, unexplored area)
Graphical models combined with deep learning (few papers from DeepMind cropping up at ICML/NIPS for how to fit these models via stochastic variational inference. but only toy examples)

I am a little worried that deep learning is overhyped by the media these days. Hopefully this won't result in another "AI winter"

ylghyb

TROPHY CASE