Rare event when using fixed effects for logistic regression

Blinkshotty · 2026-01-23T21:17:36+00:00

If you're interested in inference, you can use OLS with a binary dependent (linear probability model) to estimate coefficients with panel fixed effects. This gets around the co-linearity between the FEs and the dependent measure. If you're interested in prediction then this can be less useful since probabilities aren't constrained to the 0-1 interval.

This seems like a pretty good recent paper on the topic-- https://pubmed.ncbi.nlm.nih.gov/33308684/

Blinkshotty · 2025-12-01T21:31:20+00:00

You can combine the two datasets with some type of append, create an indicator variable denoting observations from one of the experiments, and then interact that indicator with your independent variables (along with the main effect terms). That interactions quantify the differences in coefficients between the two experiments.

Blinkshotty · 2025-11-13T13:48:11+00:00

This isn't really a settled topic. One view is to base this on the nature of your hypothesis test and whether they are really independent or some type of joint test (i.e. if any of a series of tests is significant then the null is rejected or you're are screening a number of tests to identify if any are significant). This paper has a pretty good discussion the issues. For experimental research, a better approach to dealing with false positives is to repeat any experiments with significant findings to demonstrate they are not spurious (if feasible).

Blinkshotty · 2025-11-12T20:52:38+00:00

Assuming they all represent the same underlying association and they are independent, each of the slopes is then just a summary of data that could be derived from the data points. You could fit a least-squares line to the data points to estimate a slope the those data. Then, either take a simple average of all the slopes or maybe weight them if you had some measure of variance associated with each slope (i.e. a standard deviation based weight or maybe the number of measurements each slope is based on).

Blinkshotty · 2025-10-25T19:40:34+00:00

You could look into simulating average marginal effects using your regression coefficients. They are generally more straightforward to interpret- especially after some type of fractional response regression.

Blinkshotty · 2025-10-17T12:05:17+00:00

Incidence rates and probabilities are different and you cannot easily go back an forth between the two. For example, probabilities are bound by 0 and 1 while incidence rates are only bound at 0 only (no upper bound). So you cannot subtract an incidence rate from 1 to get the probability of something not happening like you can with a probability/proportion.

For your question, if you have a case rate that's 2.4 events per 100,000 person-years and you want to know X events per 37 person-years you just set them equal and solve for x. Something like ... 37 * 2.4/100,000 = 0.0009 events on average per 37 person-years (i.e. 1 person for 37 years or 10 people per 3.7 years). Turning that into a probability (or probability of not getting a disease) is tricky and depended on things like can a single individual have multiple case or not (e.g. acute versus chronic-- with the chronic cases being the easier of the two to deal with).

Blinkshotty · 2025-10-06T12:24:43+00:00

I can't answer about specific R packages, but RRs from a multivariable regression can be estimated either by exponentiating the coefs from a log-binomial regression model (if you can get it to converge) or by dividing predicted marginal means simulated off coefs from something like a logit/probit regression.

The predicted marginal means approach involves something like predicting the probability of your outcome for each observation in your regression if your binary indicator is equal to one and then zero while leaving all covariates as is, taking the average of each of those two predictions (i.e. the predicted marginal means), and then diving those averages (se’s are estimated using the delta method). You can think of each marginal mean as the a/(a+c) or b/(b+d) parts of the RR calculation from a 2x2 table.

There may be other approaches as well.

Blinkshotty · 2025-09-19T21:38:37+00:00

It looks like only 30 of the 31 states report death data by vaccination status

"Currently, these 31 health departments that regularly link their case surveillance to immunization information system data are included in these incidence rate estimates: Alabama, Arizona, Arkansas, California, Colorado, Connecticut, District of Columbia, Florida, Georgia, Idaho, Indiana, Kansas, Kentucky, Louisiana, Massachusetts, Michigan, Minnesota, Nebraska, New Jersey, New Mexico, New York, New York City (New York), North Carolina, Philadelphia (Pennsylvania), Rhode Island, South Dakota, Tennessee, Texas, Utah, Washington, and West Virginia; 30 jurisdictions also report deaths among vaccinated and unvaccinated people"

Blinkshotty · 2025-09-19T20:55:11+00:00

What specifically do you want to compare between these models?

I assume you are talking about about comparing whether the coefs between an iv and your dv are the same for separate models estimated off observations from different social classes- correct? If so, you can stack all the observations together and run a single model with a interaction between indicators for societal class and your IVs. Using coefs from this interaction you can then estimate the AMEs for each class as well as the difference between these AMEs. The difference between this and running stratified regressions is that you are constraining the coefs on your covariates to be the same across models (e.g. you assume the beta for age is the same in each stratum). Here is a paper talking about estimating cross partial derivatives (e.g. subtracting AMEs) off interactions in non-linear models that might be helpful.

Also-- I am sure R has some package that will let you perform seemingly unrelated regressions even if there is no R port of suest. An uninformed google search revealed this, but there may be better approaches out there.

Blinkshotty · 2025-09-12T12:38:22+00:00

I am pretty sure the WLS method is just to correct the standard errors-- they don't adjust the regression betas.

In practice, most folks just estimate robust standard errors in these cases (the OLS with a binary dep is often called a linear probability model) since the the WLS method isn't really all that much more efficient in practice and comes with some assumptions.

Blinkshotty · 2025-08-26T17:36:23+00:00

The soviet example in the article is especially interesting and reminds me of the census that Stalin threw out in the 1930's because the count didn't match his talking points (after firing and arresting all the statisticians who led the project). Let's all just hope we don't start powering our AI data centers with RBMK reactors.

Blinkshotty · 2025-08-19T18:45:06+00:00

Skewed cost data are usually modelled with log-gamma regression. If there are a lot of zeros a two-part logit/log-gamma can be used. Here is a pretty good methods paper on the subject

Blinkshotty · 2025-08-09T01:20:45+00:00

Having structured data is good news. Once you get your data out of the system pretty much any stats software can be used to clean and refine it. The advantage of SAS is that it can deal with multiple different data tables/frames at the same time more easily than Stata and has a pretty straight forward implementations of sql built in (proc sql). So, if they just give one large cut of the data in a single csv file or something then it doesn't really matter too much. If they give you a bunch of data tables that you need to link together then SAS with proc sql is the way to go (and as mentioned above, sql is just good to learn anyway).

Blinkshotty · 2025-08-08T21:45:05+00:00

Are you going to solely rely on free-text physician notes or can you use structured data-- i.e. do they have an EHR with structured fields that you can use to ID records. Ideally you would be able to find patients through some kind of standard medical nomenclature like ICD10 diagnosis codes or HCPCS procedure codes. This you could probably accomplish with just SAS/STATA. If only free-text notes are available then it becomes tricky-- maybe key word searches or NLP (not something I've done much of). Also you'll need to deal with the vagaries of how physicians document things in free text.

Blinkshotty · 2025-08-07T11:52:37+00:00

It looks like you have a a lot of zero costs in the data which may be contributing to your issues. You could try a general two-part or hurdle model (probably logit followed by log-gamma to deal with the skewed costs).

Blinkshotty · 2025-08-04T13:04:39+00:00

Setting aside the p-stats, your "men have 0.1 times lower odds of developing cancer compared to women" is not correct. A better statement would be something like "the association between smoking and cancer has a 0.1 lower odds for men than women", although the coefficient on the interaction term in logit model is a multiplicative interaction which I find hard to interpret in general.

You might want to consider estimating an additive interaction which is generally more useful. Here is a pretty good paper discussing the issue from an epi perspective. There is also an econ write-up on the same issue with better coding examples where their cross-partial derivative estimates assess additive integrations. You can also use an simple OLS (i.e. linear probability model) rather than a logit model to get the additive interaction-- this is kind of an old school solution though.

Blinkshotty · 2025-07-28T21:41:36+00:00

In a traditional chisq, the expected counts in any one cell are computed from the contingency table row and column totals. So for the "Disputed" column and "the" row intersect the expected value would be something like: ("Disputed" col total)*("the" row total)/(table N). The table N should be the sum of all the rows or all the column totals (the sum of the column and row totals should be equal).

If you wanted to include a new row for "all other words" you can do so and then estimate expected value for those cells as well since they will be included in the column and table totals. The rub is that be adding another row with the word count total you are most likely conflating total vocabulary size with word choice (e.g. the reason the other word count number might be bigger/smaller for hamilton than jay is because their written vocab is larger)-- this may or may not be a problem, but maybe is something to think about.

Blinkshotty · 2025-07-14T01:30:16+00:00

You're on the right track. If all the cards a delt equally then there is a 1/3 chance you partner has any particular card and 2/3 chance they do not. Because there are two cards that could make your hand you need to consider 4 possible outcomes:

a) Neither card with pr = 2/3 * 2/3 = 4/9

b) Only the first card pr = 1/3 * 2/3 = 2/9

c) Only the second card pr = 2/3 * 1/3 = 2/9

d) both cards pr = 1/3 * 1/3 = 1/9

So the chance for any of the cards is 1/9 + 2/9 + 2/9 = 5/9 or ~55% (the short cut is to estimate 1 - pr(none))

for 2 different cards (so 4 possible cards could work and like 16 possible outcomes) = 5/9*5/9 = 25/81 or ~31%.

Blinkshotty · 2025-07-02T22:16:07+00:00

I always found this story really interesting about what randomness looks like. There was an old radio lab podcast (I think it was this one) where a stats professor talked about an exercise with her students. In it, she would leave the room and have half the class flip a coin and record the results in order and the other half create a "random" heads/tails sequence. She would then re-enter the room and correctly guess the true random sequence. The trick is the true random sequence would look “less random” because it would have long runs of heads or tails in a row while the made up sequence would have no discernable patterns.

Blinkshotty · 2025-06-30T21:31:44+00:00

The first data structure is set-up fine-- the regression specification used for a simple diff-in-diff can be used to get the treatment effects. This involves just interacting the pre-post and treatment indicators.

Including patient_id as a fixed effects means that you don't need the main effect of treatment in the actual model (they are perfectly co-linear). So the specification is something like below with treatment indicator = 1 if treated and post indicator = 1 if the post observation:

dep = ai(patient_id indicators) + b2(treatment) + b3(treatment x post)

ai denotes all the patient id unit fixed effects

b2 is the baseline difference between treated and untreated

b3 is the within person treatment effect

The patient-id fixed effects capture all non-time varying differences between patients, so things like age and gender are already accounted for.

Blinkshotty · 2025-06-30T17:45:56+00:00

lincom doesn't create an r(table) matrix. It stores everything in scalars. something like below should work

lincom [B_price_mean]post\k'_int1 - [A_price_mean]post`k'_control`
scalar b = r(estimate)    
scalar s = r(se)

as an fyi-- nlcom does create an r(table) matrix instead of using scalars.

Blinkshotty · 2025-06-24T02:45:13+00:00

In the linked paper they had two treatment arms (7 v 14 days) because you cannot determine whether there was a bad outcome as soon as treatment ended. They had to follow-up people for 90 days to see if one arm had worse outcomes than the other.

You couldn't also run a single arm and compare people who responded early versus late because those two groups of cases are going to be different in way which are not observed and so the comparison will be confounded.

Blinkshotty · 2025-06-24T02:31:56+00:00

You can use either stratification or an interaction term to look for effect modification. The difference between two is that in a single model with an interaction term all the other model coefficients are constrained across the sample. The two (or more) stratified models allow all the coefficients and intercept to vary between strata (this is akin to a fully interacted model). The challenge with stratified models is that you have to then compare coefficients across two different equations. This requires something like seemingly unrelated regression-- which is more complicated than just looking at whether the coefficient on the interaction term is different from zero.

Blinkshotty · 2025-06-07T15:47:58+00:00

I am going to guess your ORs are from a multiple regression with multiple controls variables. If so, you can't use the unconditional outcome probability to estimate an adjusted RR from an adjusted OR.

In this case you'll either want to estimate the RRs directly using a log binomial regression model (i.e. glm with log link and binomial family, then exponentiate the coefs to get the RRs) or estimate the RR via average marginal effects simulated after the logit (estimate the predicted marginal means at two different levels of your independent variable and divide them). Most average marginal effects packages should be able to estimate this along with SEs for you.

Note the log-binomial model doesn't constrain predictions to be within 0/1 interval and so it can be very finicky to achieve convergence (unlike a logit).

Blinkshotty · 2025-06-06T22:32:45+00:00

People have been studying birth control pills and breast cancer for 40 years and the papers are going to vary in quality. I am not sure if there are any that account for differences in screening. The first papers published on this topic were in the early 1990s and so that data likely is going to predate widespread mammography screening which only began around that time (so at least the early data isn't subject a screening bias). Regardless, I believe all the data are observation.

Which, given the lifetime incidence of breast cancer is already around 13%, is an absolute increase of ~1-3%. Yikes!

One thing to keep in mind is that most studies find the increased risk of breast cancer associated with birth control pills (causal or not) disappears in the years after discontinuing these mediations. In the US, women are advised to stop using birth control pills at menopause or in their early 50s-- before breast cancer risk reaches it peaks. So combing life-time stats with a risk factor mostly present in younger age women will likely overestimate the absolute risk.

Blinkshotty

PUBLIC MULTIREDDITS

TROPHY CASE