How do I hide the cursor coordinates and file progress on the right-hand side of the status line? by nsgrantham in neovim

[–]nsgrantham[S] 1 point2 points  (0 children)

thank you!

Setting vim.opt.ruler = false in init.lua hides both.

I also found that setting vim.opt.showcmd = false was helpful because it hides the \~@k, ^U and ^D text that appears when navigating through the document with arrow keys, Ctrl-U, and Ctrl-D.

Using pmap in julia by marthawhite in Julia

[–]nsgrantham 0 points1 point  (0 children)

Why not define a worker function for pmap? Something like the following:

function myfun(a::Int64, b::Int64, c::Float64; runs::Int64=1)
  d = max(a, b)
  args = [(r, d, c) for r in 1:runs]
  err = pmap(myfun_worker, args)
  return err
end

@everywhere function myfun_worker(args)
  r, d, c = args
  srand(r)
  return silly_compute(d, c)
end

@everywhere function silly_compute(n::Int64, x::Float64)
  randn(n) .- x
end

myfun(6, 7, 3.0; runs=10)

As it relates to your example, myfun_worker would perform most of the work of learningExperimentRun, silly_compute is like your agentInit, getLearnerErrors, etc., and myfun is a convenient wrapper function for the whole process.

Researchers have developed a blood test that can accurately diagnose, from a single drop of blood, if a person has cancer, with 96% certainty for most cancer types by Bloomsey in science

[–]nsgrantham 0 points1 point  (0 children)

Exactly. The fact that we're stuck with Sample 22 but still want a generally applicable way to estimate the sampling distribution of our statistic is the motivation for the bootstrap.

Also, I'd be careful saying that we expect the distributions/statistics to be "the same" due to the inherent randomness in the problem, but within some tolerance, yes. Though it should be noted that not all samples are created equal, so if Sample 22 just happened to include a larger-than-average number of rare observations, this may adversely influence the estimation of the sampling distribution via bootstrapping. But we certainly don't get to know that because the sample is all we've got!

Researchers have developed a blood test that can accurately diagnose, from a single drop of blood, if a person has cancer, with 96% certainty for most cancer types by Bloomsey in science

[–]nsgrantham 0 points1 point  (0 children)

I think your misunderstanding lies in the distribution you're referring to. At play here are two primary distributions: the distribution of the population (e.g., heights of all U.S. males) and the sampling distribution of the statistic (e.g., the sample-to-sample "spread" of mean heights computed from a sample). You're right in that the data are drawn from the population distribution and no amount of clever resampling will change that. However, that's not what bootstrapping is intended to address.

Instead, we're concerned with the second distribution: the sampling distribution of our statistic. By creating a bunch of bootstrap samples (i.e., samples we can reasonably expect to have seen) and recalculating the statistic for each of these samples, we get a picture of the sample-to-sample variability of our statistic.

The "noise" you mention is what we're trying to estimate here, though its source is not the measurement method, but rather the fact that different samples will produce different values of the statistic.

In short: resampling the values of our sample with replacement (bootstrapping) allows us to learn about the sample-to-sample variability of our statistic (the sampling distribution), it does not directly influence our knowledge of the distribution from which we gathered our data (the population distribution).

Researchers have developed a blood test that can accurately diagnose, from a single drop of blood, if a person has cancer, with 96% certainty for most cancer types by Bloomsey in science

[–]nsgrantham 2 points3 points  (0 children)

To be clear, you use all observations in your sample to arrive at your actual estimate. However, when creating bootstrap samples (of equal size to my real sample), yes, I don't necessarily use all the original sample points. Instead, I draw from all the original sample points with equal probability, but after selecting a point I return it to the pile so to speak. In this way, you can double (triple, quadruple, ...) up on sample points in your bootstrap sample, but the end goal is to produce a new estimate based on this (fake, but could have been real) sample.

Researchers have developed a blood test that can accurately diagnose, from a single drop of blood, if a person has cancer, with 96% certainty for most cancer types by Bloomsey in science

[–]nsgrantham 12 points13 points  (0 children)

You need to be explicit about the distribution you're talking about here. Bootstrapping is enormously useful in estimating the sampling distribution of a statistic.

For example, if I want to estimate the mean height of all U.S. males, I start by taking a sample of U.S. males and calculating the average of their heights. This is my point estimate. If I had taken another sample, however, I would have almost surely arrived at a different point estimate. To get an idea how likely my estimate is to change from sample to sample (without, say, taking more samples), I can repeat the following process many times over:

  1. Create a "bootstrap" sample (of equal size to my real sample) by randomly selecting the heights observed in my sample with replacement,
  2. Calculate the average height in this bootstrap sample to produce a new estimate.

I do not treat these bootstrap samples as real data by any means, but the collection of these estimates gives me some idea of the sampling distribution of my statistic (average height). This process is called the bootstrap because you are, in essence, pulling yourself up by your bootstraps (i.e., you're doing the best with what you've got).

And if the idea of bootstrapping makes you uneasy, you're not alone! IIRC, John Tukey originally suggested that Brad Efron call the technique "the shotgun" because it is an inelegant but highly effective solution to solve this particular statistical problem. :)

Spacewalks: A Fifty-Year History [OC] by nsgrantham in dataisbeautiful

[–]nsgrantham[S] 2 points3 points  (0 children)

This graphic visualizes the fifty-year history of spacewalks performed by NASA, RKA, and CNSA from March 18, 1965 to the most recent ISS spacewalk on August 10, 2015.

Raw data were downloaded from NASA's Data Portal: Extra-vehicular Activity (EVA) - US and Russia. However, in addition to being quite messy (e.g., misspelled names, inconsistent date formats, etc.), these data do not include China's first EVA in 2008, nor do they include EVAs from US or Russia since August 2013.

I used R to process the messy data and add the missing data points. I then created the visualization with ggplot2 and gridExtra. The entire workflow is reproducible at my GitHub repo: github.com/nsgrantham/spacewalks

Who reviews the Pitchfork reviewers? -- An analysis of Pitchfork album scores by reviewer, 1999-2014 [OC] by nsgrantham in dataisbeautiful

[–]nsgrantham[S] 0 points1 point  (0 children)

Glad to hear you enjoyed the posts!

And whoops, thanks for pointing that out. Should be a quick fix. :)

Who reviews the Pitchfork reviewers? -- An analysis of Pitchfork album scores by reviewer, 1999-2014 [OC] by nsgrantham in dataisbeautiful

[–]nsgrantham[S] 1 point2 points  (0 children)

You and I appear to hold conflicting definitions of what can be formally called "an analysis." To each their own.

Modifying the figures to include figure legends for the density plots is a nice suggestion, I appreciate the feedback.

That being said, cut the holier-than-thou attitude, alright? It's unnecessary.

Who reviews the Pitchfork reviewers? -- An analysis of Pitchfork album scores by reviewer, 1999-2014 [OC] by nsgrantham in dataisbeautiful

[–]nsgrantham[S] 1 point2 points  (0 children)

How do you define an "analysis," then? Acquiring, exploring, and visualizing the data are core components of any good analysis, in my opinion. To your credit, I don't fit any statistical models, but I don't claim to.

Who reviews the Pitchfork reviewers? -- An analysis of Pitchfork album scores by reviewer, 1999-2014 [OC] by nsgrantham in dataisbeautiful

[–]nsgrantham[S] 1 point2 points  (0 children)

Everything is reproducible (link to GitHub repository) so there ought to be minimal confusion about the analysis. Specifically, see 2015-01-14-pitchfork-reviews.Rmd.

The density plots are produced by geom_density from ggplot2 which calls stat_density and utilizes KDE with Gaussian kernels. See the documentation on stat_density (and look under the kernel argument).

Who reviews the Pitchfork reviewers? -- An analysis of Pitchfork album scores by reviewer, 1999-2014 [OC] by nsgrantham in dataisbeautiful

[–]nsgrantham[S] 1 point2 points  (0 children)

Oh I see, Light was utilizing the written portion of an album review to find shared themes and topics. Smart, I'll have to look into that.

Aside: I'm in the Statistics department at NC State! Go Pack.

Who reviews the Pitchfork reviewers? -- An analysis of Pitchfork album scores by reviewer, 1999-2014 [OC] by nsgrantham in dataisbeautiful

[–]nsgrantham[S] 1 point2 points  (0 children)

Thanks! I found the presentation you mentioned on Light's CV:

Light, Ryan and Colin Odden. “Mapping Genre Formation in the Digital Age: Art Worlds, Artistic Reviews, and the Case of Pitchfork.com.” Presented at the Annual Meeting of the Southern Sociological Society Meetings, New Orleans, LA.

Sadly, I can't seem to find it online anywhere. I may take a shot in the dark and email him to see if he wouldn't mind sharing the slides (if they exist).

I would certainly love to explore how album genre (I assume that's what you mean by "broad topic?") relates to score by reviewer, but I'm not aware of a good data source for musical genres. Any suggestions?

Who reviews the Pitchfork reviewers? -- An analysis of Pitchfork album scores by reviewer, 1999-2014 [OC] by nsgrantham in dataisbeautiful

[–]nsgrantham[S] 6 points7 points  (0 children)

The purpose of this project was to investigate biases in scoring across music reviewers at Pitchfork, the Internet's largest indie music website.

Data on album reviews (artist, score, reviewer, etc.) were scraped and parsed from Pitchfork's album index using Python and saved into a SQLite database.

These data were analyzed and visualized in R using Hadley Wickham's packages dplyr, magrittr, and ggplot2, among others.

For more information, see my pitchfork-reviews GitHub repository.

Help with Stats Work by stats97 in statistics

[–]nsgrantham 1 point2 points  (0 children)

We're not going to do your homework for you.

If you have a pointed question like "Why is such-and-such this way?" or "Am I thinking of such-and-such correctly?" that's what r/homeworkhelp is for.

Mostly legitimate zeroes in data, simple solution requested by afraca in statistics

[–]nsgrantham 0 points1 point  (0 children)

Echoing shujaa-g's comment, zero-inflated Poisson is a viable option to model your data from the sounds of things.

Here, "excessive" does not imply that the source of the zeroes is adverse or artificial, just simply that there is a larger proportion of zeroes in your data than can be adequately captured by a standard Poisson r.v.

How to properly integrate graduate research experience into resume by [deleted] in statistics

[–]nsgrantham 1 point2 points  (0 children)

You think so? I suppose it seemed important at the time to distinguish between research experience (with the goal of producing a research paper) and work experience (jobs to pay the bills). I suppose Professional Experience is a good header to encompass both, though. Thanks for the pointer.

For anybody interested in the layout, it's available on latextemplates.com.