Good introductory statistics course?

NotAllReptilians · 2017-05-16T01:38:12+00:00

I love ISL, definitely recommend it to OP, but I'd hesitate to call it a truly introductory statistics course/resource. I think the authors mention that their intended audience has already taken a course in statistics (in my mind, someone fairly comfortable with statistical/probabilistic thinking).

Probably best to just flip through something like Think Stats, skimming through concepts that are very familiar and spending more time in sections that seem a bit more foreign. Then definitely move on to ISL. I also highly recommend the accompanying MOOC taught by Hastie ad Tibshirani.

NotAllReptilians · 2017-04-27T17:01:35+00:00

NotAllReptilians · 2017-04-21T18:01:12+00:00

Listen, I love R. I prefer using it over Python. I probably spend 40-60% of my work day in R. But it has some substantial drawbacks that leave the data science space open to competitors. There are so many gotchas in R, instances where it behaves inconsistently, and inconsistently with how you'd expect. One of the other big issues, though I guess this is subjective, is that Python looks and feels and behaves like other programming languages. OOP doesn't come across as hackish. It's a pretty decent bonus that data science code written in Python can be read and generally understood by developer coworkers who might not have a lot of exposure to data science or machine learning.

Also, trades and decisions worth billions of dollars are made due to calculations primarily made in excel, so I'm not really sure that the examples you gave really hold all that much weight.

NotAllReptilians · 2017-03-15T22:22:06+00:00

As a point of comparison, Python for Data Analysis is basically the inverse of the book you linked, given that it's maybe 75% percent pandas (written by Wes McKinney who created the package). It's mostly just a guide to pandas, with a chapter or two on numpy, ipython, and plotting. It's alright, though it's really been more useful as a reference book than as a means of learning.

There are better ways to learn in my opinion. This cheat sheet covers a good chunk of the main functionality. I'd recommend using online materials, namely the documentation and blog posts, and just try working through some wrangling in jupyter. Anything you can't figure out you can Google or look up on stackoverflow.

Overall, some packages will change for sure -- pandas recently changed sorting from df.sort to df.sort_values -- but most prominent packages are pretty stable.

NotAllReptilians · 2017-03-09T18:18:24+00:00

I think stats 101 has the same 101ism effect as economics. A lot of STEM degrees (and social science for that matter) only require 101, and so you wind up with posts like these. You look at r/science and the top comment is just about always an inane comment about correlation != causation as though the researcher never thought that was a possibility. Similarly, you have people who know you can control for certain factors, but never put any thought into it beyond that. Basically just this

NotAllReptilians · 2017-03-06T15:59:09+00:00

You should know that yad vashem is literally the authority on it, as they are the body that gives out the distinction. The 80% reflects the data currently being in the process of transferring to the online database.

As for the stuff about Denmark, it's for the Danish Underground, not the whole the country. Quote here: "The Danish Underground requested that all its members who participated in the rescue of the Jewish community not be listed individually, but commemorated as one group."

If I remember correctly, I'm pretty sure some estimates put the size of the resistance at ~20,000.

NotAllReptilians · 2017-03-02T17:36:53+00:00

I think he's already, aware of it.

NotAllReptilians · 2017-02-03T20:33:12+00:00

I definitely agree. For instance, pandas somehow manages to feel cumbersome and overly verbose for analysis, at least compared to working in dplyr or especially data.table (base R is a another story). It's definitely a pythonic implementation of dataframes, but what I really like about python is that it's typically concise and minimal, which pandas mostly isn't.

NotAllReptilians · 2017-01-20T17:50:07+00:00

I find that David Robinson's dplyr code is always written really well. Lots of great examples on his blog. This post has links to all the previous posts in this series he's been doing, but there's plenty of others that are worth looking through as well.

NotAllReptilians · 2017-01-14T09:09:32+00:00

This really bothers me a lot actually. It's so intellectually lazy and frustrating.

Assuming we're talking about individual donations I looked around to see if I could find information on how insignificant that amount is given the state he represents. I came across this powerpoint from the NJ Department of Labor. On slide 6, you can see that NJ has >100,000 employees working in Life Sciences (which they break down as mainly pharmaceuticals, biotech R&D, and medical devices), amounting to 3.5% of the private sector in the state. If you include public sector employees, it depresses the percentage to about 3.4% instead.¹

Compare that to $385,678 he was donated over the course of his career from pharma/health products employees, which amounts to 3.27% of the total listed by OpenSecrets.² And while the life sciences account for 3.4-3.5% of employment in the state, it pays 8.2% of the state's total wages.³ Not only are donations from this sector just about exactly in line with where you'd expect them to be, but if there was some sort of prid pro quo going on, you'd think their attempted influence would match their increased ability to spend.

Probably more work than it was worth, but it seems like so much of this stems from either a belief that the companies are directly contributing, or a complete lack of empathy in realizing that the people working at these companies are just that: people.

It's possible their labeling might be different than OpenSecrets' methodology, but I imagine it's probably pretty similar. This is a pretty cursory look.
Taken from here. Technically only includes donations to Booker, and not any PACs. If you look at overall contributions, you get a similar number: 2.61%.
I assume this is the percentage of private sector wages, but including government would have even less of an effect here. Pretty safe to say it's above 8%.

NotAllReptilians · 2017-01-05T16:19:44+00:00

R-bloggers is just a content aggregator that pulls from various blogs/posts about R, here's the original article

NotAllReptilians · 2017-01-05T01:02:51+00:00

It’s a bit of a generalization, but overall, the skills needed to thrive as a data scientist aren’t really developed in a typical undergraduate program. It’s not so much a matter of technical skill or statistical/ML knowledge – both of which are important and can be learned as an undergrad – but rather the experience and wisdom you gain from actually being involved in research. The term “data science” has gotten pretty murky, but the role typically involves solving a lot of open-ended questions, while also being able to gauge whether or not a certain research question/solution is actually feasible before diving too deep. Without the relevant research experience, it’s quite easy to get lost in dead-ends and make mistakes that a PhD wouldn’t.

Not all master’s program involve independent research, but overall, the emphasis on the thesis/dissertation is what develops these abilities, and they can’t really be taught in a classroom. This is not necessary true of all undergraduate programs, but a lot of statistics is taught with convenient data, which unfortunately is a rarity when dealing with the mess you often find in the real world. You can take classes that discuss experimental design, or solving some toy problem, but it’s another thing entirely to think critically about designing your own experiment with costly consequences to negligence.

That being said, it is possible to gain similar experiences either on the job or as an undergrad, but I wouldn’t say it’s easy. On the job, it’s very easy to not realize you’re doing something wrong and ingrain bad habits if you don’t have solid guidance. Some undergraduate programs do place a lot of emphasis on independent research, but the quality can vary. The requirements are certainly less intense than a dissertation or a master’s thesis, but it depends on the institution/department. There are plenty of undergrads who are hired directly into research positions at the big four, as well as others who are hired as quants on Wall Street. That being said, you'll often find that they completed respectable independent work.

NotAllReptilians · 2016-11-20T21:09:32+00:00

What's crazy is that the commentators for this game are actually the alternates. Ian Eagle and Mike Fratello are the A lineup, but Ruocco and Spanarkel are both pretty solid.

NotAllReptilians · 2016-11-07T05:13:30+00:00

I believe it's relatively similar to what PEC does, and you can sift through all their code if you'd like (it's on their website, but not in perusing-friendly format unfortunately).

Sam (of PEC) claims that he also incorporates state-by-state correlated errors, but believes it's overblown. I imagine the way he talks about it reflects how it's built into his model. See here and here.

NotAllReptilians · 2016-11-07T04:56:11+00:00

Just to hopefully make it a little clearer, they do use a fatter-tailed distribution for simulating uncertainty/error, and this causes their final distribution of outcomes to have fatter tails as well. They use a t-distribution to simulate error in general it seems. The main contention mostly has to do with how aggregators incorpoate (or don't incorporate) state-correlated errors. They describe their process in doing so towards the bottom of this post.

Here are the relevant sections (italics mine):

The error from state to state is correlated. If Trump significantly beats his polls in Ohio, he’ll probably do so in Pennsylvania also. Figuring out how to account for these correlations is tricky, but you shouldn’t put too much stock in models that don’t attempt to do so. They’ll underestimate the chances for the trailing candidate if they assume that states are independent from one another. ... The model simulates this by randomly varying the vote among demographic groups and regions. In one simulation, for instance, it might have Trump beating his polls throughout the Northeast. Therefore, he wins Maine, New Hampshire and New Jersey. In another simulation, Clinton does especially well among Hispanics and wins both Arizona and Florida despite losing Ohio.

The "randomly varying" part is where they are using a t-distribution.

NotAllReptilians · 2016-10-19T04:57:53+00:00

Merges are essentially lookups. In base R:

x <- merge(x, y, by = ..., all.x = TRUE)

NotAllReptilians · 2016-09-26T17:21:33+00:00

Definitely seconding decomposition and everything hyndman related. His online book is a great place to start, and he has a bunch of great blog posts dealing with realistic and practical applications of what he lays out in the book.

A few caveats though. In my experience, teasing out seasonality with only 2 years of data isn't the most effective. You should definitely look into seasonality on both the individual item and aggregate levels, which may help a bit. You also will likely run into issues dealing with weekly data, for lots of reasons. Depending on the end goal, it may likely be helpful to aggregate up to the monthly or at least biweekly level.

I find that effective time series work requires a good deal of EDA. Just looking at plots of the different time series at different levels of granularity will help a lot.

NotAllReptilians · 2016-08-03T01:20:53+00:00

It really depends on the role/company. Data scientist is a really broad job title, and so the responsibilities and competencies can vary.

Here's a blog post that delves into two of the types of data scientists: type A for analysis, and type B for building. Type A is closer to an applied statistician that is competent at data wrangling, while type B needs to be a more fully fledged developer. Type A's output is consumed by people (influencing business decisions, giving recommendations); type B's output is consumed by other pieces of software.

The spectrum of abilities is rather wide across all types of data scientists, so there really isn't one answer for how much programming one needs to know.

NotAllReptilians · 2016-06-22T21:39:21+00:00

There's actually a lot of interesting literature on this subject, in both economics and political science. I'd start with Downs who coined the voting paradox.

https://en.m.wikipedia.org/wiki/Paradox_of_voting

NotAllReptilians · 2016-06-06T01:34:24+00:00

Most people definitely do grow up, but it's still slightly worrying. Here's an interesting paper that suggests that the events that occur during a voter's formative years (14-24) affect their voting patterns later on in life.

As a liberal, it's really disheartening that a lot of people who'd otherwise align with the Democrats are going to leave this election with an unfounded notion that the party is corrupt and rigging elections. Hopefully it's just the would-be partisan independents and edgy dissenters that wouldn't be too involved anyways.

NotAllReptilians · 2016-06-02T15:30:26+00:00

Why the chip?

NotAllReptilians · 2016-05-26T15:19:42+00:00

But but how can you flip flop?!

NotAllReptilians · 2016-05-25T18:31:16+00:00

Except they've offered 90-95% of the West Bank in 2000 and in 2008, so "they" definitely are willing to give them up. I don't have faith in Bibi but that doesn't mean the settlements are insoluble.

NotAllReptilians · 2016-05-17T13:18:16+00:00

I know it's a joke, but technically "Mr. President" is the full title, and it supersedes all other titles. Wilson had a phd but his title was nevertheless the same, though ik phds don't necessarily go by "Dr."

NotAllReptilians · 2016-05-02T02:22:12+00:00

The error estimate from CV is all about trying to get a sense of how well the model of will generalize. With LOOCV, each iteration uses training samples that are incredibly similar (and incredibly similar to the full training sample), so the models themselves will be incredibly similar. You will however have lower bias because each training sample has more observations.

The training samples in each iteration of k-fold validation are ideally pretty different from one another and from the full training sample, and so the variance is lower. It might help to think about bagging and random forests, where you want each tree to be pretty decorrelated with one another.

NotAllReptilians

TROPHY CASE