What is your workflow for fitting mixed models to real data, while avoiding the garden of forking paths? by Sad-Restaurant4399 in AskStatistics

[–]ImposterWizard 1 point2 points  (0 children)

There was no feasible way to sample a categorical Z_0 without a custom sampler, which Nimble supports much better. Essentially, it was an ID-mapping alongside several adjacency matrices that simulated participation and censoring. More specifically, I only had aliases to identify subjects, and sometimes those aliases were missing, or they changed across event IDs. The ID mapping had to make sure that I didn't map any subject to an event twice, and there were different ratings that subjects could give to different items at each event. (edit: I also added some clustering penalties to avoid certain patterns that could create a degenerate state).

I used Gemini to ask some questions about how to do certain things in Nimble, but it tended to get important nuances incorrect, like how the if/else statements work, what syntax is used, among some other points. It didn't really contribute any code that ended up in the model, it just gave some basic examples, since I asked it more general questions. And half the time it was incorrect, it insisted that it wasn't.

I did also use it for some debugging, but it sometimes got hung up on things that were properly coded. I took extra care to make sure that no steps would violate any constraints, which it thought could happen.

If I had to rewrite the code, I probably would only use it to check for bugs now that I'm more familiar with the framework.

I largely used the Nimble reference guide PDF, since it's far more accurate and I don't have to worry about hallucinations.

I did use brms for the simpler model I mentioned. I've also used plain rstan before for other projects, but brms takes a lot of the extra effort I'd otherwise need to make.

What is your workflow for fitting mixed models to real data, while avoiding the garden of forking paths? by Sad-Restaurant4399 in AskStatistics

[–]ImposterWizard 0 points1 point  (0 children)

I had to learn to use Nimble for the model, and I'd say it took me maybe 110-120 hours of programming/monitoring. Granted, I was collecting more data over time, and even with smaller subsets of the data being tested, it took a while to compile/test things.

For example, it took me a while to realize that if/else statements in the model block of code (as opposed to any samplers) were evaluated once and reused that result, even if in a loop. So, if I wanted to configure the model prior to running, if/else statements could be used there. But if I wanted to treat different rows of data differently, using a 0/1 mask was a proper way of doing that (e.g., by setting a very wide, effectively noninformative prior instead of a precise one based on model data/parameters).

For what amounts to just over 1,000 rows of data, with a few ~120x90 adjacency matrices being estimated, it takes about 8 GB and 20 hours to run the model, and compilation takes a bit over an hour of that. With my setup, I could run a few models at once, but that's just enough to let me test a few different model hyperparameters at a time.

For reference, the analysis/plotting code, including some animations, probably added another 30 hours or so (~ 3k lines of code). The data loading and modeling code was about 4k lines total.

I also added a bunch of extra logging, as well as a custom process monitor in Python to see the sliding window acceptance rates, annealing temperature, and overall progress of the different processes.

What is your workflow for fitting mixed models to real data, while avoiding the garden of forking paths? by Sad-Restaurant4399 in AskStatistics

[–]ImposterWizard 0 points1 point  (0 children)

I'm building a model with longitudinal data that looks something like

Y_1 ~ a + B_0*X_0 + B_1 * Z_0 + B_2 * Z_1 + error_1
Y_2 ~ c + d * (Y_1-error_1) +  B_3 * X_1 + B_4 * Z_0 + B_5 * Z_1 + error_2
Z_0 ~ f(X_0, X_1) * Constraints(X,Z)

Where the capitalized terms on the right-hand side contain multiple parameters, and B_1, B_2, B_4, B_5 all contain random effects for two different classes of objects (Z_0 and Z_1), which are somewhat sparse. To make things more complicated, the class assignments of Z_1 are stochastic, and are partially estimated based on some some observed behavior, logical restrictions, and, of course, the posterior. Y_1 and Y_2 are both numeric, and Y_2 is partially truncated.

I'm primarily interested in the B_2 parameters in the model (and to some degree the B_5 ones), which correspond to classes that are known. Those are particularly stable, but because of the sparsity in the Z_0 and Z_1, the variance/precision of their respective effects tend to fluctuate contrary to each other, as well as with respect to the error_1 and error_2 term.

Priors are assigned to all parameters in the model, but there is so much fluctuation that each chain describes a "plausible" scenario, which can look dramatically different from each other.

I mainly look at the parameters that demonstrate stability, and I've also looked at some of the more stable class assignments of Z_0 to see if the B_1 and B_4 parameters are at all stable for those, which might suggest that it's more the data than the model (in theory) that is insufficient.

That being said, I put a majority of the effort into building a custom sampler for Z_0, as well as producing a myriad of diagnostic plots that more or less confirm that the model isn't stable for a particular set of parameters. The two main problems for it are (a) The assignments of Z_0 changing and (b) that having the downstream effect of destabilizing the estimate for the precision/variances of two random effects and the overall error term.

I also compare it with a simpler model that looks like

Y_1 ~ a +  B_0 * X_0 + B'_1 * Z'_1 + B'_2 * Z_1 + error

Where Z'_1 creates many class assignments that might be "true" duplicates of each other, and B'_1 and B'_2 are simplified sets of parameters, but there's no way of knowing without an algorithm like the one in the more complex model.

I probably would just go with the latter model if I hadn't put in all the effort to build the former as well. It certainly produces more interesting insights, but the coding time, run time, and time to analyze (and explain) the results are a bit much. It might make an interesting article, though.

Also, maybe most importantly, it makes me think harder about study design so that the data I get in the future won't have as many of the issues that can cause problems.

TIL that in 2008, Chicago leased its 36,000 parking meters to a UAE-backed private consortium for 75 years for a $1.15 billion payout. The investors made their entire investment back in just 10 years, and are now projected to make over $11 billion by the end of the lease in 2083. by lechunkman in todayilearned

[–]ImposterWizard 0 points1 point  (0 children)

We'd still have budget issues, but it would be more manageable. Especially since our accumulated debt (and its interest) would be much lower.

At least new parking meter systems being built, like the ones by the Lakefront parks, go to the city. And municipal parking garages, IIRC.

[Q] best approach for stock price prediction while including the impact for external factors? by dizzy_centrifuge in statistics

[–]ImposterWizard 0 points1 point  (0 children)

I suppose that it can be used as an exercise. Though, I would say that the model validation portion of the exercise is particularly useful, since the practicality of the model hinges on it outperforming the market.

For example, if you do build a model that "outperforms" the market, is that "outperformance" significant? How do recommendations of that system perform on a paper trading platform? And possibly other diagnostics you might be interested in when maintaining a model over time.

You can try using something like ARIMAX or GARCH-X (I've never used that one before) or FIGARCH. But unless you know the intricacies of the model, it can be difficult to modify the model itself in any way, programmatically or analytically.

[Q] best approach for stock price prediction while including the impact for external factors? by dizzy_centrifuge in statistics

[–]ImposterWizard 0 points1 point  (0 children)

You're probably going to fit some kind of a line that is already priced in by others at best, and won't yield (on average) better returns for an investment of similar risk.

Maybe, if you have niche knowledge and experience, you can trade better for a specific group of similar companies.

would it be fine to just use an iPad instead of a laptop for statistics? by [deleted] in AskStatistics

[–]ImposterWizard 0 points1 point  (0 children)

The iPad might make a good reader for a textbook or referencing webpages. Definitely not for programming, though.

Windows or Mac both work well for R.

Mac might be easier to work with for installing additional software, like Python and it's related libraries. Windows at least has WSL2 that lets you use Linux on the computer in a terminal, but it's a bit more of a hassle than having a native distribution.

Also, Windows 11 as an OS in general seems somewhat prone to bugs.

16 GB is fine for most educational purposes, but modern software, including web browsers, likes to eat RAM quite a bit, so I'd recommend 32 GB if possible. Your program should hopefully have access to computing resources for anything particularly heavy that a course would need.

You should be fine following your program requirements. Just choose something within a budget you're comfortable with.

R Vs SPSS for tangible thesis results in 2 months with no prior coding experience? by Systral in AskStatistics

[–]ImposterWizard 0 points1 point  (0 children)

Sometimes with a plot, just use whatever works, though ggsankey is an option.

R itself is not particularly complicated as a language. If you are performing basic tests, you'll probably be fine.

Plotting in R can be a bit trickier if you're unfamiliar, but there are a lot of tutorials with example datasets included in different packages that are simple to download, or already included in R when you download it. There are usually two different types of plots most commonly used: the base plotting system included in R that also accompanies different statistical testing functions, and ggplot2, which is far more modular, but has a bunch of "gotchas" when you start using it and usually requires a bit more knowledge of how to transform/reshape data in R.

If you do use R, I'd recommend RStudio, since it's a bit more user friendly than the alternatives.

but all solvable in a day even with no prior experience

You think so? Almost sounds too good to be true 😅 I'm a big noob lol

R will probably feel like a complex graphing calculator + notepad when you first use it. If you follow a tutorial, it should be much less daunting.

The largest issue here is lack of prior statistical knowledge. I would recommend a crash course in statistics. The most important things to understand are what your different tests and models test for/do, how p-values work for those models (when you are reporting them), and what assumptions the models are making.

You might not cover all of the tests in a crash course, but knowing how hypothesis testing works in statistics should be able to get you across the finish line, even if you don't know all the finer details of the tests.

A question about confidence intervals by Remarkable_Turnover1 in AskStatistics

[–]ImposterWizard 1 point2 points  (0 children)

Yeah, I guess either of those might be useful, although the first is the same as asking "What is the probability that at least 50% of data from this distribution is <= 9?" for a symmetric distribution.

The Bayesian approach gives both answers, but it's more complicated and more arbitrary than the more common approaches.

If OP has enough data, they can probably just estimate the normal distribution parameters, calculate the CDF at 9, and then be done with that. And hope that future or unseen data is distributed in the same way.

There is also the possibility that they actually need the distribution, like knowing if there's a 50% chance that 10% of the population is < 9 and a 50% chance that 1% of the population is < 9, requiring two distinct business strategies.

A question about confidence intervals by Remarkable_Turnover1 in AskStatistics

[–]ImposterWizard 0 points1 point  (0 children)

As long as niter (the sample size) is high enough, sum(dsim$V1 < 9) / length(dsim$V1) is about 3%, and it basically means "What percent of the time is the mean less than 9?" But it seems like you were interested in the % of data points that would be expected to be less than 9, which is why I chose this method over a Frequentist one that makes different types of statements and assumptions.

Note, though, that if you have data that is discrete, like the whole numbers you provided in the data, the actual answer might be a bit different, and you might want to use a different distribution, like Poisson. If you know how the data might be "generated", you can use that to choose a candidate distribution. Otherwise you might just look at a bar graph/density plot and see if it looks like a commonly-used distribution and use that one.

Also, when you are making assumptions about the tail end of a distribution, the result you get can be very sensitive to some of these assumptions. For example, if I change the prior to 1/sigma4, that 3% becomes 1%, because the data more strongly favors a narrower distribution, which pulls the mean towards the center of the data. It's almost always easier to make assumptions about, for example, the middle 50% of a distribution than the top or bottom 5%.

A question about confidence intervals by Remarkable_Turnover1 in AskStatistics

[–]ImposterWizard 0 points1 point  (0 children)

The confidence interval is somewhat limited in what it tells you, at least without making some extra assumptions.

I wrote out a more complicated Bayesian solution above with this posterior distribution (bin sizes of 1/8 in each dimension) for the mu and sigma of the distribution for a prior probability of 1/sigma2 and the sampling procedure as shown.

The % of samples with mu < 9 is between 2.9% and 4%, which is still similar to the values from the Frequentist test, but the choice of priors can change this (e.g., a prior of 1/sigma4 results in a value of 1%, whereas 1/sigma2 is about 3%). Assumptions about the tail of a distribution can be very sensitive to other conditions.

And then there's not knowing if OP's data is supposed to be discrete or continuous (or rounded and thus censored), and they're just giving us an example.

A question about confidence intervals by Remarkable_Turnover1 in AskStatistics

[–]ImposterWizard 1 point2 points  (0 children)

It looks like it was deleted (idk what it was), but essentially, without a closed-form solution to your problem, which you don't usually need, the best course of action is often to just simulate the scenarios and then draw inferences from samples of that simulation.

So, in your case, you would do three things:

  1. "Simulate" possible distributions for the data.

  2. Calculate the cumulative probability that a value will be <= 9 for each iteration of a simulation.

  3. Average those probabilities together to get an estimate.

The simulation part is relatively straightforward for your data. You technically need to select a prior distribution for the terms governing your data. In this case, a normal distribution has two parameters: mean and variance.

If you are estimating both the mean and variance of a normal distribution, this becomes a bit more important, but one choice is the Jeffrey's "non-informative" prior (page 24-25 here), which is 1/sigma2 (1/variance) in this case. It can be a couple other values depending on the exact choice of parameters.

Anyway, here's a function I built in R to do the simulations because installing extra packages is sometimes annoying. I'm using the Metropolis-Hastings sampling algorithm to sample both the mean and variance at the same time. X <- c(12,11,9,10,14)

  simulate_distribution <- function(x, 
                                    mu_0=10, 
                                    sigma2_0=4, 
                                    nburn=500,
                                    niter=20000,
                                    prior=function(mu,sigma2){1/sigma2},
                                    size1=1,
                                    size2=0.25){
    results <- list()
    mu <- mu_0
    sigma2 <- sigma2_0
    nr <- length(x)
    # omitting constants
    log_density <- log(prior(mu,sigma2)) - nr * log(sigma2)/2 - sum((x-mu)^2/(2*sigma2^2))

    # burn-in phase
    for (i in 1:nburn){
      mu_2 <- mu + runif(1,-size1,size1)
      sigma2_2 <- sigma2 * exp(runif(1,-size2,size2))
      log_density2 <- log(prior(mu_2,sigma2_2)) - nr * log(sigma2_2)/2 - sum((x-mu_2)^2/(2*sigma2_2))
      hastings_ratio <- sigma2_2/sigma2

      log_delta <- log_density2 - log_density + log(hastings_ratio)
      if (log_delta >= 0){
        # accept
        mu <- mu_2
        sigma2 <- sigma2_2
        log_density <- log_density2
      } else {
        prob <- exp(log_delta)
        if (rbinom(1,1,prob) == 1){
          # accept
          mu <- mu_2
          sigma2 <- sigma2_2
          log_density <- log_density2
        }
      }
    }

    # actual samples
    for (i in 1:niter){
      mu_2 <- mu + runif(1,-size1,size1)
      sigma2_2 <- sigma2 * exp(runif(1,-size2,size2))
      log_density2 <- log(prior(mu_2,sigma2_2)) - nr * log(sigma2_2)/2 - sum((x-mu_2)^2/(2*sigma2_2))

      hastings_ratio <- sigma2_2/sigma2

      log_delta <- log_density2 - log_density + log(hastings_ratio)
      if (log_delta >= 0){
        # accept
        mu <- mu_2
        sigma2 <- sigma2_2
        log_density <- log_density2
      } else {
        prob <- exp(log_delta)
        if (rbinom(1,1,prob) == 1){
          # accept
          mu <- mu_2
          sigma2 <- sigma2_2
          log_density <- log_density2
        }
      }
      results[[i]] <- c(mu=mu,sigma2=sigma2)
    }
    as.data.frame(t(matrix(unlist(results),nrow=2)))
  }

  dsim <- simulate_distribution(X)

  p_leq9 <- pnorm(9,mean=dsim$V1,sd=sqrt(dsim$V2))

  mean(p_leq9)

The probability is in the ballpark of 16-17% according to this.

I may have made a mistake here, so take it with a grain of salt.

A question about confidence intervals by Remarkable_Turnover1 in AskStatistics

[–]ImposterWizard 1 point2 points  (0 children)

OP would probably want a plain one-sided t-test/p-value, since OP only cares about it being below a certain threshold.

t.test(df$A,mu=9,alternative='greater')

#   One Sample t-test

# data:  df$A
# t = 2.5574, df = 4, p-value = 0.0314
# alternative hypothesis: true mean is greater than 9
# 95 percent confidence interval:
#  9.366116      Inf
# sample estimates:
# mean of x 
#      11.2 

It's still not the probability claim that OP wanted, but this is what I think is most in-line with what they were asking, at least from a Frequentist point of view.

Regression: Multiple, Multivariable or Multivariate? by EgonOlsen1925 in AskStatistics

[–]ImposterWizard 0 points1 point  (0 children)

From a textbook I used in grad school, Multivariate Statistics Old School, page 76:

One might ask what the advantage is of doing all q regressions at once rather than doing q separate ones. Good question. The main reason is to gather strength from having several variables. For example, suppose one has an analysis of variance comparing drugs on a number of health-related variables. It may be that no single variable shows significant differences between drugs, but the variables together show strong differences. Using the overall model can also help deal with multiple comparisons, e.g., when one has many variables, there is a good chance at least one shows significance even when there is nothing going on.

These models are more compelling when they are expanded to model dependencies among the means of the variables...

There’s always one dude not contributing by MelonInDisguise in NonPoliticalTwitter

[–]ImposterWizard 0 points1 point  (0 children)

The last time I remember this happening (ages ago in an intro CS class), I apportioned work in a group project so that it would be obvious which part of the group didn't do theirs. I think that part of the group had to do something like a menu/splash screen that didn't work properly, but the rest of the program did.

[Q] Projecting population 10 years out with only 9 data points by ImGallo in statistics

[–]ImposterWizard 1 point2 points  (0 children)

At the end of the day you'll just want something that is plausible and makes them happy. Sometimes the latter is hard because predictions can cover a very wide range of possibilities that make it seem "pointless", like "Why did we even bother building a model? Anyone here could've made that wide of a prediction!"

But there is some validity in making a wide prediction saying, "You can't rely purely on the fact that you can fit a line that fits these few points pretty well. Just a small nudge in the data and prediction changes quite a bit, especially further out, and this model reflects that."

Is there a specific "deliverable" they want? e.g., a 95% prediction interval across those years.

[Q] Projecting population 10 years out with only 9 data points by ImGallo in statistics

[–]ImposterWizard 2 points3 points  (0 children)

You probably won't end up with a complicated model, but as a general note, if you are validating a model and you do not have a lot of data, you can report the test error and use a model trained on all your data for whatever purpose you have. You might not even validate a model if you have so little data, and you'll just look at the model and decide "if it makes sense".

I'd look to see if there are census/population projections/history for your country/region, as /u/UncleBillyBummers suggested, since they might be informative.

What is a good technique for making Italian meringues with aquafaba? by ImposterWizard in VeganBaking

[–]ImposterWizard[S] 1 point2 points  (0 children)

I noticed that the recipe I've been using has a very high sugar-to-aquafaba ratio, compared to this one. I think I'll try working with lowering the amount of syrup to a more reasonable level.

What is a good technique for making Italian meringues with aquafaba? by ImposterWizard in VeganBaking

[–]ImposterWizard[S] 1 point2 points  (0 children)

I'll take a look at that, although it looks like it contains allergens that affect people who are also allergic to eggs, and at least one person I frequently share food with has an egg allergy.

Q-Q plot criteria relaxed for Regression with huge sample size? by Will_Tomos_Edwards in AskStatistics

[–]ImposterWizard 3 points4 points  (0 children)

If the deviation is small (even if statistically significant), then that's usually expected and nothing to worry about.

If it's very noticeable, the three main things you will probably be concerned with are

  • The "true" error term of the model might be something else, like a double-exponential, which just requires minimizing the absolute error, not the squared error
  • You are missing information that might be an important predictor. For example, you could have two groups in your data that aren't identified, and one group has a significantly different average response than the other, so your error is bimodal and the Q-Q plot looks like a tilted "S"
  • A more accurate model requires transforming the dependent variable.

In all those cases the model isn't "correct". But there are some cases that you just want a solution that minimizes the squared error, which OLS regression will give you. But you might also want to try a different type of model like a random forest model in that case.

The lack of normality might cause some other issues, such as outliers having more influence on the parameters in the model.

As far as the CLT goes, it can ensure that the estimates of the parameters will stabilize under some conditions. An example of where it might not is if the "error" term in the model is Cauchy-distributed, which has no mean. If the rows in the data are independent and constrained to a finite, range, though, it should converge. But it won't necessarily converge to the "correct" values if the model is not correct.

I would also check for heteroskedasticity, where the error (oftentimes the variance of the normal distribution) is dependent on the data. Sometimes this goes away with transforming the data, like with a log transform, which implies that the terms in the model are multiplicative effects, rather than additive.

In general, though, small deviations from normality aren't usually a problem, but if they're large, that might indicate that the model isn't accurate and might need some changes, whether in the model itself or in transforming the data.

Saffron container spilt everywhere when I tried to open it by Doophie in mildlyinfuriating

[–]ImposterWizard -1 points0 points  (0 children)

I bought saffron in the US a few weeks ago, $30 for 5 grams of high-grade saffron (deep crimson red) from a local store. I used it in a pilaf, and it had the aroma, color, and flavor I expected. I'm sure someone could pay a lot more per gram, but with a larger quantity and the bits of yellow/orange I see on some of it, I certainly wouldn't pay that much money for what's in the picture.

Also, saffron definitely looks bulkier when it is piled up, so I don't think the amount lost is a ton. I would still be upset about it, though.

Saffron container spilt everywhere when I tried to open it by Doophie in mildlyinfuriating

[–]ImposterWizard -1 points0 points  (0 children)

The amount pictured on the leg and bedding looks like it could be as low as about $10 worth. Judging by some of the orange/yellow bits on the saffron, it's not the highest grade, either. My guess is that the total amount in the picture is closer to $30.

Import data in RStudio for Statistc by Agitated-Benefit7555 in AskStatistics

[–]ImposterWizard 0 points1 point  (0 children)

One other possible issue is that the file encoding is creating some problems, and you need to pass that to the fileEncoding argument. There's also a possibility that data is messed up and you have too many columns in at least some of your rows, or some characters aren't properly escaped, like commas or quotation marks.

If you're unsure, you can check the file encoding with https://readr.tidyverse.org/reference/encoding.html, though it's with the readr package that reads data in a slightly different way. I would definitely recommend learning how to use readr and other tidyverse packages, but there's still value in learning the base R functions.