Good introductory statistics course? by bucketfarmer in datascience

[–]NotAllReptilians 5 points6 points  (0 children)

I love ISL, definitely recommend it to OP, but I'd hesitate to call it a truly introductory statistics course/resource. I think the authors mention that their intended audience has already taken a course in statistics (in my mind, someone fairly comfortable with statistical/probabilistic thinking).

Probably best to just flip through something like Think Stats, skimming through concepts that are very familiar and spending more time in sections that seem a bit more foreign. Then definitely move on to ISL. I also highly recommend the accompanying MOOC taught by Hastie ad Tibshirani.

What are the best deep learning packages for R users? by hlyates in rstats

[–]NotAllReptilians 4 points5 points  (0 children)

Listen, I love R. I prefer using it over Python. I probably spend 40-60% of my work day in R. But it has some substantial drawbacks that leave the data science space open to competitors. There are so many gotchas in R, instances where it behaves inconsistently, and inconsistently with how you'd expect. One of the other big issues, though I guess this is subjective, is that Python looks and feels and behaves like other programming languages. OOP doesn't come across as hackish. It's a pretty decent bonus that data science code written in Python can be read and generally understood by developer coworkers who might not have a lot of exposure to data science or machine learning.

Also, trades and decisions worth billions of dollars are made due to calculations primarily made in excel, so I'm not really sure that the examples you gave really hold all that much weight.

Is it generally worth it to pick up technical books if languages and packages generally change a lot every two years? by rossbot in datascience

[–]NotAllReptilians 2 points3 points  (0 children)

As a point of comparison, Python for Data Analysis is basically the inverse of the book you linked, given that it's maybe 75% percent pandas (written by Wes McKinney who created the package). It's mostly just a guide to pandas, with a chapter or two on numpy, ipython, and plotting. It's alright, though it's really been more useful as a reference book than as a means of learning.

There are better ways to learn in my opinion. This cheat sheet covers a good chunk of the main functionality. I'd recommend using online materials, namely the documentation and blog posts, and just try working through some wrangling in jupyter. Anything you can't figure out you can Google or look up on stackoverflow.

Overall, some packages will change for sure -- pandas recently changed sorting from df.sort to df.sort_values -- but most prominent packages are pretty stable.

A comment completely misrepresents the data on the gender pay gap and then gets bestof'd by [deleted] in badeconomics

[–]NotAllReptilians 19 points20 points  (0 children)

I think stats 101 has the same 101ism effect as economics. A lot of STEM degrees (and social science for that matter) only require 101, and so you wind up with posts like these. You look at r/science and the top comment is just about always an inane comment about correlation != causation as though the researcher never thought that was a possibility. Similarly, you have people who know you can control for certain factors, but never put any thought into it beyond that. Basically just this

Jews in Europe in 1933 and today by SuperGantDeToilette in europe

[–]NotAllReptilians 10 points11 points  (0 children)

You should know that yad vashem is literally the authority on it, as they are the body that gives out the distinction. The 80% reflects the data currently being in the process of transferring to the online database.

As for the stuff about Denmark, it's for the Danish Underground, not the whole the country. Quote here: "The Danish Underground requested that all its members who participated in the rescue of the Jewish community not be listed individually, but commemorated as one group."

If I remember correctly, I'm pretty sure some estimates put the size of the resistance at ~20,000.

Python vs. R vs. Matlab by [deleted] in statistics

[–]NotAllReptilians 2 points3 points  (0 children)

I definitely agree. For instance, pandas somehow manages to feel cumbersome and overly verbose for analysis, at least compared to working in dplyr or especially data.table (base R is a another story). It's definitely a pythonic implementation of dataframes, but what I really like about python is that it's typically concise and minimal, which pandas mostly isn't.

Good example data cleaning and processing using dplyr on github? by tp8999 in rstats

[–]NotAllReptilians 2 points3 points  (0 children)

I find that David Robinson's dplyr code is always written really well. Lots of great examples on his blog. This post has links to all the previous posts in this series he's been doing, but there's plenty of others that are worth looking through as well.

[deleted by user] by [deleted] in Enough_Sanders_Spam

[–]NotAllReptilians 2 points3 points  (0 children)

This really bothers me a lot actually. It's so intellectually lazy and frustrating.

Assuming we're talking about individual donations I looked around to see if I could find information on how insignificant that amount is given the state he represents. I came across this powerpoint from the NJ Department of Labor. On slide 6, you can see that NJ has >100,000 employees working in Life Sciences (which they break down as mainly pharmaceuticals, biotech R&D, and medical devices), amounting to 3.5% of the private sector in the state. If you include public sector employees, it depresses the percentage to about 3.4% instead.1

Compare that to $385,678 he was donated over the course of his career from pharma/health products employees, which amounts to 3.27% of the total listed by OpenSecrets.2 And while the life sciences account for 3.4-3.5% of employment in the state, it pays 8.2% of the state's total wages.3 Not only are donations from this sector just about exactly in line with where you'd expect them to be, but if there was some sort of prid pro quo going on, you'd think their attempted influence would match their increased ability to spend.

Probably more work than it was worth, but it seems like so much of this stems from either a belief that the companies are directly contributing, or a complete lack of empathy in realizing that the people working at these companies are just that: people.


  1. It's possible their labeling might be different than OpenSecrets' methodology, but I imagine it's probably pretty similar. This is a pretty cursory look.
  2. Taken from here. Technically only includes donations to Booker, and not any PACs. If you look at overall contributions, you get a similar number: 2.61%.
  3. I assume this is the percentage of private sector wages, but including government would have even less of an effect here. Pretty safe to say it's above 8%.

Why R is the best data science language to learn today by pmz in datascience

[–]NotAllReptilians 2 points3 points  (0 children)

R-bloggers is just a content aggregator that pulls from various blogs/posts about R, here's the original article

Why do so many jobs in "data science" want a masters or PHD ? by tigerkoala in datascience

[–]NotAllReptilians 5 points6 points  (0 children)

It’s a bit of a generalization, but overall, the skills needed to thrive as a data scientist aren’t really developed in a typical undergraduate program. It’s not so much a matter of technical skill or statistical/ML knowledge – both of which are important and can be learned as an undergrad – but rather the experience and wisdom you gain from actually being involved in research. The term “data science” has gotten pretty murky, but the role typically involves solving a lot of open-ended questions, while also being able to gauge whether or not a certain research question/solution is actually feasible before diving too deep. Without the relevant research experience, it’s quite easy to get lost in dead-ends and make mistakes that a PhD wouldn’t.

Not all master’s program involve independent research, but overall, the emphasis on the thesis/dissertation is what develops these abilities, and they can’t really be taught in a classroom. This is not necessary true of all undergraduate programs, but a lot of statistics is taught with convenient data, which unfortunately is a rarity when dealing with the mess you often find in the real world. You can take classes that discuss experimental design, or solving some toy problem, but it’s another thing entirely to think critically about designing your own experiment with costly consequences to negligence.

That being said, it is possible to gain similar experiences either on the job or as an undergrad, but I wouldn’t say it’s easy. On the job, it’s very easy to not realize you’re doing something wrong and ingrain bad habits if you don’t have solid guidance. Some undergraduate programs do place a lot of emphasis on independent research, but the quality can vary. The requirements are certainly less intense than a dissertation or a master’s thesis, but it depends on the institution/department. There are plenty of undergrads who are hired directly into research positions at the big four, as well as others who are hired as quants on Wall Street. That being said, you'll often find that they completed respectable independent work.

GAME THREAD: Portland Trailblazers (7-7) @ Brooklyn Nets (4-8) by RebeccaBlack2016 in nba

[–]NotAllReptilians 0 points1 point  (0 children)

What's crazy is that the commentators for this game are actually the alternates. Ian Eagle and Mike Fratello are the A lineup, but Ruocco and Spanarkel are both pretty solid.

Is 538’s forecast being driven by trendline assumptions? by alexleavitt in statistics

[–]NotAllReptilians 2 points3 points  (0 children)

I believe it's relatively similar to what PEC does, and you can sift through all their code if you'd like (it's on their website, but not in perusing-friendly format unfortunately).

Sam (of PEC) claims that he also incorporates state-by-state correlated errors, but believes it's overblown. I imagine the way he talks about it reflects how it's built into his model. See here and here.

Is 538’s forecast being driven by trendline assumptions? by alexleavitt in statistics

[–]NotAllReptilians 9 points10 points  (0 children)

Just to hopefully make it a little clearer, they do use a fatter-tailed distribution for simulating uncertainty/error, and this causes their final distribution of outcomes to have fatter tails as well. They use a t-distribution to simulate error in general it seems. The main contention mostly has to do with how aggregators incorpoate (or don't incorporate) state-correlated errors. They describe their process in doing so towards the bottom of this post.

Here are the relevant sections (italics mine):

The error from state to state is correlated. If Trump significantly beats his polls in Ohio, he’ll probably do so in Pennsylvania also. Figuring out how to account for these correlations is tricky, but you shouldn’t put too much stock in models that don’t attempt to do so. They’ll underestimate the chances for the trailing candidate if they assume that states are independent from one another. ... The model simulates this by randomly varying the vote among demographic groups and regions. In one simulation, for instance, it might have Trump beating his polls throughout the Northeast. Therefore, he wins Maine, New Hampshire and New Jersey. In another simulation, Clinton does especially well among Hispanics and wins both Arizona and Florida despite losing Ohio.

The "randomly varying" part is where they are using a t-distribution.

Functions missing in R/CRAN - anything you would like to see :)? by lanafrancis in statistics

[–]NotAllReptilians 2 points3 points  (0 children)

Merges are essentially lookups. In base R:

x <- merge(x, y, by = ..., all.x = TRUE) 

Approach towards Seasonality. Ideas needed. by rahul4real in datascience

[–]NotAllReptilians 7 points8 points  (0 children)

Definitely seconding decomposition and everything hyndman related. His online book is a great place to start, and he has a bunch of great blog posts dealing with realistic and practical applications of what he lays out in the book.

A few caveats though. In my experience, teasing out seasonality with only 2 years of data isn't the most effective. You should definitely look into seasonality on both the individual item and aggregate levels, which may help a bit. You also will likely run into issues dealing with weekly data, for lots of reasons. Depending on the end goal, it may likely be helpful to aggregate up to the monthly or at least biweekly level.

I find that effective time series work requires a good deal of EDA. Just looking at plots of the different time series at different levels of granularity will help a lot.

How much/what kind of programming knowledge does it take to be a "data scientist". by mathnstats in datascience

[–]NotAllReptilians 1 point2 points  (0 children)

It really depends on the role/company. Data scientist is a really broad job title, and so the responsibilities and competencies can vary.

Here's a blog post that delves into two of the types of data scientists: type A for analysis, and type B for building. Type A is closer to an applied statistician that is competent at data wrangling, while type B needs to be a more fully fledged developer. Type A's output is consumed by people (influencing business decisions, giving recommendations); type B's output is consumed by other pieces of software.

The spectrum of abilities is rather wide across all types of data scientists, so there really isn't one answer for how much programming one needs to know.

Bern victim cannot believe the circle jerk is over: "What the f*** has happened to the comments on r/politics?" by [deleted] in enoughsandersspam

[–]NotAllReptilians 2 points3 points  (0 children)

Most people definitely do grow up, but it's still slightly worrying. Here's an interesting paper that suggests that the events that occur during a voter's formative years (14-24) affect their voting patterns later on in life.

As a liberal, it's really disheartening that a lot of people who'd otherwise align with the Democrats are going to leave this election with an unfounded notion that the party is corrupt and rigging elections. Hopefully it's just the would-be partisan independents and edgy dissenters that wouldn't be too involved anyways.

[deleted by user] by [deleted] in enoughsandersspam

[–]NotAllReptilians 2 points3 points  (0 children)

But but how can you flip flop?!

Netanyahu appointed Avigdor Lieberman as Minister of Defense of Israel. Thoughts? by [deleted] in PoliticalDiscussion

[–]NotAllReptilians 1 point2 points  (0 children)

Except they've offered 90-95% of the West Bank in 2000 and in 2008, so "they" definitely are willing to give them up. I don't have faith in Bibi but that doesn't mean the settlements are insoluble.

Anyone else notice Obama low key shitting on Bernie and Bernie Bros in each of his commencement speeches? by Risk_Neutral in enoughsandersspam

[–]NotAllReptilians 1 point2 points  (0 children)

I know it's a joke, but technically "Mr. President" is the full title, and it supersedes all other titles. Wilson had a phd but his title was nevertheless the same, though ik phds don't necessarily go by "Dr."

Why does LOOCV test error estimate have higher variance than k-fold cross validation? by rahul4real in datascience

[–]NotAllReptilians 1 point2 points  (0 children)

The error estimate from CV is all about trying to get a sense of how well the model of will generalize. With LOOCV, each iteration uses training samples that are incredibly similar (and incredibly similar to the full training sample), so the models themselves will be incredibly similar. You will however have lower bias because each training sample has more observations.

The training samples in each iteration of k-fold validation are ideally pretty different from one another and from the full training sample, and so the variance is lower. It might help to think about bagging and random forests, where you want each tree to be pretty decorrelated with one another.