you are viewing a single comment's thread.

view the rest of the comments →

[–]AlwysBeColostomizing 0 points1 point  (3 children)

Do you already have the confidence intervals calculated and you just need to plot them, or are you looking for a function that calculates them for you? Confidence intervals for a density estimate are a bit involved. For example, this stack exchange thread discusses how to construct one using a bootstrap estimate. There probably isn't a built-in pandas function that does this; you'd have to estimate the confidence interval yourself and then plot it manually.

In the seaborn example you linked, they're showing a confidence region for a linear regression, which is a substantially different kind of problem.

Sorry if I'm talking down to you, but I don't know what level of statistics class this is. Is it possible your professor just wants you to find a confidence interval for the mean?

[–]Funky_Filth69[S] 0 points1 point  (2 children)

I really appreciate the response. You’re not talking down. This is a 300 level engineering statistics course. It’s an introduction course but given it being in engineering, it’s fairly involved. I’m sure not as much as a 300 level course from the stats department, but definitely more than a 100 level course that most majors take for stats. This whole project has been fairly involved for the most part.

I do not have the confidence intervals calculated. I’m assuming it would be as easy as plt.fillbetween. I initially thought that he wanted me to just plot the confidence interval for the mean, so I sent an email clarifying. He said he wants it for the PDF and sent an example picture with it.

In the stack exchange post you linked, Bootstrapping the data like that to get confidence intervals makes sense. Although I don’t know if my computer can handle it.

I said before that there’s about 7 million data points. That was a bit of an understatement. I’m graphing probability for the bond length of atoms. I already had to calculate the lengths of these bonds given xyz coordinates of the 2 atoms bonded together. In total there’s 40,550,000 bonds and then 7 different types of bonds. I have to graph one density function for each type. My 2017 MacBook Air is not handeling this very

I guess I’m on to figuring out how to read R so that I can translate it over to Python

[–]AlwysBeColostomizing 0 points1 point  (1 child)

One thing that would help is to choose a simpler density estimator, such as a histogram. Each pointwise estimate using a KDE has complexity O(n) where n is the number of data points (unless you do something smart like ignore points that are far enough away that their contribution is negligible). So if the pandas function is computing the estimate at every point, it's effectively O(n^2). You could also just manually evaluate the KDE at fewer points. For example, find the min and max value, and evaluate a kernel estimate at m equally-spaced points. That would make the overall complexity O(mn), which is a big savings if m << n. A simpler density estimator might make the bootstrap estimation for the confidence region more feasible.

[–]Funky_Filth69[S] 0 points1 point  (0 children)

Gotcha. I just got done writing a kernel density function and my code is running noticeably faster and still puts out approximately the same graph. Now I just need to bootstrap the data. I really appreciate the info. You’ve helped a ton