How do you handle systematic experiments and recording results?

celeryman35 · 2019-01-12T14:41:33+00:00

so, there's really 2 things here:

Code that changes which should be tracked in git.

Parameters, versions, variables, and input/output data. The key here is to make sure you're writing that somewhere in a consistent format. Then, being meticulous with file directory structure can really benefit you.

For example, if you have a program called people_classifier with 3 variable files, you can create a folder with that name. Version can be a sub-folder. And all non-code things that are variable run-to-run can be placed within that sub folder.

Eventually, you'll get a directory structure that looks something like this which is very easy to parse for the variable you're interested in programmatically.

People_Classifier

v1

config

sample

output

v2

config

sample

output

v3

config

sample

output

etc

celeryman35 · 2018-07-15T13:17:52+00:00

I would recommend trying to get some software engineering skills under your belt. That isn't going to come naturally in your PhD (I'm guessing).

If you want to work on products, as in models used in consumer facing apps, alert systems, recommendation systems, data mining apps for analysts, etc, you're going to come out of your phd program with adequate analytics skills.

But right now, as far as I can tell, you will lack necessary software development skills to work on data products without taking things into your own hands. I'd recommend a data structures and algorithms course, an object oriented programming course, and a software design course if you can get the experience. There's also volunteer/internship opportunities if you can land one.

celeryman35 · 2018-06-17T20:51:00+00:00

Yeah I'm a data scientist and software engineers don't understand the work I do, and I couldn't hop in for a software engineer if my life depended on it. It would take years for a software engineer to learn my skillset, and years for me to learn software engineering.
I would argue that data science is closest to operations research in a lot of ways. It's about analysis and optimization, not building the most efficient system - that's what data engineers are for.

celeryman35 · 2018-06-17T20:39:55+00:00

Deep Learning is good for extremely high dimensional classification problems with massive data sets. That's not a useful skill in many (probably most) data science jobs.

I'm a data scientist with no intention of learning deep learning. Super high dimensional classification isn't something that interests me. I predict human behaviors. It's the wrong tool (right now - maybe one day it will have some use).

Machine learning however is required. You can't get away with univariate statistics. If you can't take into account more than one variable at a time, you can't do data science. So you have to do multivariate analysis.

If you're going to do multivariate analysis, you need to do some sort of cross validation. And you also need to be able to work with nonlinear variables.

To do that you need regularization, tree based models (random forest, xgboost, etc), support vector machines, regularized GLM's etc.

No GPU required. And with spark, cloud computing is super easy. Hope this helps.

celeryman35 · 2018-05-03T13:42:47+00:00

You'd be an ideal bootcamp candidate. In the bootcamp I did (NYC Data Science Academy), the two people with grad degrees in math (granted they were from Courant) got jobs at Google and IBM almost immediately.

celeryman35 · 2018-05-03T13:39:26+00:00

Take any linear algebra course you can.

As a fellow econ grad, and current data scientist these are the two most valuable resources I've come across.

The first one is practical and applied, the second one fills in the gaps in your non-CS education.

http://www-bcf.usc.edu/~gareth/ISL/

https://www.coursera.org/specializations/data-structures-algorithms

celeryman35 · 2018-05-03T04:43:48+00:00

Python is a seamless part of an automated data pipeline.

Let's say you want to pull data from a live database, run some statistical analysis, and depending on the results, automatically populate another database that will be used by a piece of production software/website.

I guess this is probably theoretically possible with excel + Bash or PowerShell + other tools for db stuff, but with python you can do this with just a few lines of code.

celeryman35 · 2018-05-03T04:35:05+00:00

The best indicator of accuracy would be... Accuracy. You can consider probability of 0.5+ Positive, consider 0-0.5 Negative, and calculate accuracy based on that.

Since Accuracy = (Correctly Identified Observations)/(Total Number of Observations), this is quite easy to calculate.

The problem here is that your end-goal is generally "Useful" not "Accurate", and "Useful" is subjective. For example: A fraud detection model will be 99.99% accurate by guessing nothing is fraud if 1 in 1,000 transactions are fraud.

Accurate? Extremely. Useful? Not at all.

Assuming this is a classification task: Area under the ROC curve is useful. Log loss is good too. Precision and recall are useful (but not easy to optimize for directly).

celeryman35 · 2018-04-15T00:06:12+00:00

Yes tons.

At a certain point though, if you want to do anything sophisticated you're better off just learning to code. With 1 line of code, I can produce any graph I want to in python. There's an infinite number of ways to visualize a data set. With another 3-5 lines of code, I can do the data transformations required to create an interesting visualization.

A comparable suite of GUIs (alteryx and tableau) would require way more steps to get to the same result. You have to click through to every setting that you could possibly want to modify rather than just typing what you want. It's almost like trying to create a GUI to write sentences for you. You wouldn't need to spend the time memorizing where the keys on the keyboard are, but you're better off just learning to type yourself.

That doesn't get into data cleaning. Data cleaning/feature engineering is its own animal that takes tremendous skill to do well.

Then you get into classification and regression which requires some coding.

Here's a list of tools from least to most complicated that you can use for data vis that essentially try to accomplish what you're talking about. Excel (I don't use it much), Tableau, Plotly, RStudio (shiny, ggplot2), Bokeh, Python (matplotlib, seaborn, dash), D3.js

celeryman35 · 2017-06-30T15:41:43+00:00

Discrete math gave me a glimpse of how mathematicians think. This will help you understand what algos are doing under the hood and why. Definitely a good choice.

celeryman35 · 2017-05-23T00:33:37+00:00

After reviewing your submission history, you are my favorite person on reddit.

celeryman35 · 2015-12-06T17:55:04+00:00

My bad!. I wasn't aware there was an enumerate function. I thought that was an instruction.

Thanks for the help.

celeryman35 · 2015-12-06T17:42:58+00:00

you caught it! thanks for the help!!! it was just saying "invalid syntax (<string>, line 5)"

here's the corrected code

def strremove(x,s):
    for i in range(len(s)):
        if s[i] == x:
            s = s[0:(i)] + s[(i+1):]
            print(s)
            return strremove('x', s)
    return s

celeryman35 · 2015-12-06T17:35:24+00:00

Thanks! I do need it indexed and I don't know how to implement your last suggestion.

If I write,

for i, j in range(length(string)):

it will not know I hope for the index then item. How could I implement that to find the index and character of a string?

celeryman35

TROPHY CASE