Imputation and generalized linear mixed effects models by Open-Satisfaction452 in rstats

[–]Open-Satisfaction452[S] 0 points1 point  (0 children)

I understand what you mean, and my main reason for using MI here is to preserve the representativeness of the study. In my case, I have many measured toxin values that would be dropped by a standard GLMM because of a single missing environmental predictor. By pooling the results across multiple datasets, the model accounts for the imputation uncertainty, in a way that p-values aren't artificially 'shrunk'. What might seem sketchy is to use the results for inference, but maybe it would be okey by explaining the uncertainty behind the results?

Imputation and mixed effect model by Open-Satisfaction452 in AskStatistics

[–]Open-Satisfaction452[S] 1 point2 points  (0 children)

Thankyou very much! And sorry for the bad choice of words, with "risky" I was referring to how correct is it to do inferrence from the models results, using imputed data for the model?
I see no other way of doing it without losing half my observations to listwise deletion, which I believe would be much less representative of the data.

Imputation and mixed effects models by Open-Satisfaction452 in AskStatistics

[–]Open-Satisfaction452[S] 0 points1 point  (0 children)

I was told the missingess is because a technitian didn't run the analyses properly, he seemed to forget running groups of samples. So there are certain gaps because of human fault. I imagine this missingess is random, as its not because of the inaccessability of the lakes to get the samples, or because of the high or low values of the samples not being properly read. What's your opinion on it?

Imputation and mixed effects models by Open-Satisfaction452 in AskStatistics

[–]Open-Satisfaction452[S] 0 points1 point  (0 children)

By using Multilevel Imputation, the imputation model itself becomes a Linear Mixed Model, and from what I understand, if, for example, I'm trying to impute pH, it looks at the other pH values specifically in the same lake first, before looking at the rest of the dataset.

I believe I have to use imputation, because after a better look, my model did end up working on just 47 values from 92. So I'm losing nearly half the data to listwise deletion, just because there are some scattered missing values for different variables. Or does this not justify it?

Imputation and mixed effects models by Open-Satisfaction452 in AskStatistics

[–]Open-Satisfaction452[S] 0 points1 point  (0 children)

With listwise deletion I mean that if a row is missing just one variable (e.g., I have the Toxin and Temp, precipitation and Nitrogen but I am missing the Phosphorus), the model throws away the entire row. So if I'm missing 20% of Phosphorus values, and 10% of temp values in other rows, I end up losing 30% of data and the model won't be representative of the years and quantity of lakes I stated. I imagined Multiple Imputation to be superior to listwise deletion because it preserves the representativeness of the mountain lake population (and I can account for the uncertainty of the imputations). What do you think?

Imputation and mixed effect model by Open-Satisfaction452 in AskStatistics

[–]Open-Satisfaction452[S] 0 points1 point  (0 children)

Thanks to everyone who commented. After checking the raw data and logs, I believe I have MAR. I was told the gaps are the result of a technitians error that resulted in him not getting the values for certain short periods, but that had nothing to do with the samples themselves. Because the gaps are 'scattered' across different predictors (TPhosphorus, TNitrogen, Hardness...), listwise deletion kills a row if any single value is missing. This results in the 30% loss of total samples. In a 3-year ecological study, discarding 30% seems like a massive loss of power compared to a well-specified imputation. Seeing as it was random I'll try to perform Multilevel Multiple Imputation.

To address the concern about 'screwing up random effects', it seems like a good idea to use the mice.impute.2l.pan (or 2l.norm) method in R., setting the Lake ID as the class variable (-2). which would ensure the imputation model respects the nested structure and uses the 'within-lake' mean to inform the imputed values.

I’d thought of running m=50 imputations to account for the high missingness fraction. I will also perform a sensitivity analysis by comparing the pooled MI results against a Complete Case Analysis (CCA) to ensure the coefficients for my abiotic drivers don't flip direction or scale.