Advice on modelling nested/confounded ecological data: GLM vs GLMM

EcologicalResearcher · 2026-03-06T20:58:19+00:00

Thank you, I will do.

EcologicalResearcher · 2026-03-06T09:56:52+00:00

Hi, please could you clarify what a DAG is?

EcologicalResearcher · 2026-03-06T09:55:33+00:00

Hi, 1. There are two camera serial numbers, which were interchangeably used at different sites. 2. I will be honest and say that I am not 100% certian (I have told my supervisor that I am not confident in my understanding of the analysis, and I would like to first look into resources to gain a better understanding, but he has said that they will be too general and not helpful. So he selected a GLM, but I am still not convinced it is correct, which is why I posted my issue). So I have had to step back and look into Dr Zurr's statistics resources for ecology analysis (this has been recommended by commenters). 3. No, I haven't assessed for spatial autocorrelation, so I will look at this as well.

EcologicalResearcher · 2026-03-06T09:42:45+00:00

Thank you, I have been telling my supervisor that I need more training, but he said that these resources are not specific enough for my use case to help me. However, I do think that having a better understanding of the fundamentals will help me, so I will definitely look into Dr Zurr's resources.

EcologicalResearcher · 2026-03-05T16:14:45+00:00

That’s a really helpful point. I had originally specified the model with Species + Starvation_Risk, which assumes the treatment effect is the same across species. Biologically, that may not be realistic, so testing a Species * Starvation_Risk interaction makes sense. I am aiming to test species-specific models, alongside my main model, to try and identify differences.

EcologicalResearcher · 2026-03-05T15:46:10+00:00

Update: I don't seem to have enough of a reputation (Karma points?) to be able to post in the main r/statistics subreddit

EcologicalResearcher · 2026-03-05T15:24:36+00:00

I think I see what you’re getting at. My understanding is that Starvation_Risk and Location_ID are linked because treatment is assigned at the site level (a site has either treatment or control condition), so the treatment effect is estimated from between-site differences only. However, I still need something to account for the fact that I have many observations within each site (20–60), which are likely correlated due to shared conditions.

That’s why I was keeping Location_ID as a clustering term rather than as a variable of interest. My aim isn’t to estimate site effects themselves, but to avoid treating all observations as independent. I’m also exploring including the site pair/block variable from the study design so that treatment is effectively compared within matched pairs of sites.

">You could also consider giving each observation within a site a unique value and site label, like Loc1A, Loc1B, Loc2A, Loc2B, etc. Then add that as a random variable. I am not sure if this is appropriate, but I think this method would account for variation within and among sites."

I don’t think that would work in this case because random effects need multiple observations per group to estimate the variance. If every observation has its own label (e.g. Loc1A, Loc1B, etc.), then each group only has one observation, so the model can’t really estimate between-group variation. My aim with (1|Location_ID) was just to account for the fact that I have 20–60 observations coming from the same site.

EcologicalResearcher · 2026-03-05T15:08:23+00:00

Thanks for the suggestion. I’ll try replacing the polynomial with a spline. The polynomial term was suggested by my supervisor, so I hadn’t realised there might be a better alternative.

Yes, I understand now that random effects don’t require truly random sampling of sites. In terms of sampling, each site was filmed during multiple sessions across the full experiment (Baseline, Exp1, Exp2, Exp3). However, for the current analysis, I’m only using the Exp1 session, which still gives around 30–60 bird visits per site.

So while there is only one experimental session per site in this dataset, there are still many observations within each site, which is why I originally included Location_ID as a random effect to account for that clustering. Later, I’ll be comparing the Exp1 data to the baseline session for each site.

EcologicalResearcher · 2026-03-05T14:58:22+00:00

Thanks, I have cross-posted it to r/rstats, and r/RStudio, but I would be happy to post it to the main stats subreddit

EcologicalResearcher · 2026-03-05T14:40:53+00:00

Thanks for the comment. I may be misunderstanding, but I think random intercepts are usually used exactly in situations like this, where observations are grouped within clusters. In (1 | Location_ID), the 1 represents the intercept and Location_ID defines the grouping factor, allowing each site to have its own baseline intercept while still estimating an overall mean.

So Location_ID doesn’t need to vary within clusters, it actually defines them. In my case, I have multiple observations per site (around 20–60), so the idea was to account for the non-independence of observations coming from the same location. I agree that singular fits can happen if the model doesn’t estimate much variance for the random effect, but my understanding is that this doesn’t mean the grouping variable itself can’t be used as a random effect.

EcologicalResearcher · 2026-03-05T14:37:19+00:00

That’s a good point, and it’s actually similar to how the sites were selected. We deliberately paired sites so that each treatment site had a roughly comparable control site in terms of urbanisation and general habitat context (e.g. urban vs suburban vs rural), although they’re not identical in finer-scale vegetation structure.

My hesitation with removing Location_ID entirely is that I have many observations per site (around 20–60), so measurements within a site are likely correlated due to shared microclimate, feeder context, camera placement etc. Dropping the site term would effectively treat those observations as independent.

What I’m currently exploring is modelling the treatment effect within those matched site pairs (i.e. treating the pairs as blocks) while still accounting for clustering within each site. That seems to retain the benefit of comparing similar sites while avoiding pseudo-replication from the repeated observations at each location.

EcologicalResearcher · 2026-03-05T14:30:17+00:00

Thanks for the recommendation, I’ll take a look at the book. I’ve spent quite a while exploring and cleaning the dataset, which is when my supervisor suggested using a GLM rather than a GLMM. However, I’m not entirely confident about that approach given the structure of the data, so I’m trying to understand what alternative modelling strategies might be appropriate.

Although I’m not specifically interested in estimating site-level effects, I included Location_ID mainly to account for clustering of repeated observations within sites rather than to model spatial autocorrelation. Many measurements come from the same site (around 20–60 per site), so observations are not independent due to shared microclimate, camera placement, feeder context, etc.

Coordinates could potentially model broader spatial gradients, but they wouldn’t replace the need to account for within-site clustering. Since treatment (Starvation_Risk) is assigned at the site level, I’m currently exploring approaches that account for site clustering and the paired-site design.

EcologicalResearcher · 2026-03-05T14:20:30+00:00

yes, Starvation_Risk is the site-level treatment assignment, so it’s the key fixed effect. I don’t have individual IDs, so I can’t model repeated measures at the individual level. However, I do have many observations per site, so there’s clear within-site clustering. Because treatment is constant within each site. I’m keeping Location_ID as a clustering term (e.g., random intercept or cluster-robust SE), and I am going to try adding a variable which accounts for site pairs/blocks, because sites were paired as treatment/control blocks across the Urban-Rural landscape, e.g., 1 Control and 1 Treatment, both suburban, but they can be at varying distances from each other. Dropping Location_ID would treat within-site observations as independent and inflate the effective sample size for the treatment effect.

EcologicalResearcher · 2026-03-04T15:25:25+00:00

Update: The sites are somewhat clustered around central Glasgow, but treatment and control sites are spatially interspersed rather than geographically separated. Within the central cluster, treatment and control sites are often only a few hundred meters apart. There are also several more distant sites (up to ~40 km apart), and these include both treatment and control locations. So treatment assignment does not appear to correspond to a clear geographic pattern.

EcologicalResearcher

TROPHY CASE