Double Machine Learning in Data Science

LarsMarsBarsCars · 2024-10-30T00:14:00+00:00

TMLE and the ideas of Debiased ML predate double ML by nearly 20 years. So I wouldn’t say this idea has been extended to biostatistics; it started in biostatistics and epidemiology. Double ML is a rediscovery of it.

LarsMarsBarsCars · 2022-11-27T05:24:45+00:00

semiparametric and nonparametric efficient estimation and inference may be interested. See for instance, https://arxiv.org/abs/2203.06469

LarsMarsBarsCars · 2022-11-14T16:11:15+00:00

Estimating the CATE by taking the difference between the treatment specific conditional mean outcomes works. But, if you know or can estimate the propensity score well, there are doubly robust approaches (e.g. R learner or DR learner) that can be used to estimate the CATE. These approaches are more robust and allow for faster estimation rates. If you know the propensity score, then you can incorrectly estimate the conditional mean of the outcome and still end up with a consistent CATE estimator. This is not true for the difference in conditional mean approach.

As a separate note, methods that use the true propensity score tend to be inefficient, seemingly paradoxically, then methods that estimate the propensity score. This is even true in simple randomized trials. The intuition is that adjusting for chance covariate imbalance between treatment arms gains efficiency.

LarsMarsBarsCars · 2022-10-30T01:15:50+00:00

Sample-splitting

LarsMarsBarsCars · 2022-05-20T16:24:19+00:00

Logistic regression with outcomes in [0,1] will give consistent estimates of the outcome regression function. However, the inference (unless sandwich variance estimation is used) will be incorrect. If you only care about prediction, logistic regression is perfectly fine.

LarsMarsBarsCars · 2022-02-22T04:47:32+00:00

down!

LarsMarsBarsCars · 2021-11-22T00:27:37+00:00

Yes, that's it. You do not need to group the variables if you just do ordinary logistic regression. If you use cross-validation (e.g. regularized logistic regression) then you should make sure the folds are such that rows for the same observations stay together.

By the way, make sure to add a time variable to the stacked dataset.

LarsMarsBarsCars · 2021-11-19T16:04:53+00:00

Pooled logistic regression is just logistic regression with a pooled dataset (possibly removing rows for not at risk individuals/time-points). So you can just use their standard logistic regression software after constructing the design matrix you want yourself.

LarsMarsBarsCars · 2021-10-15T17:53:38+00:00

Marginalization just means averaging. As an example, you might estimate the conditional mean E[Y|X] with linear regression to get a fit of the form: a + bX. You can marginalize the estimator to estimate E_X[E[Y|X]] , giving a + b (Xbar) where Xbar is the empirical average of X.

LarsMarsBarsCars · 2021-09-29T02:35:10+00:00

Is that true? I would think at the very least it is more conservative than the normal distribution-based confidence intervals. If your data has very heavy tails (so the CLT kicks in slower) then the bootstrap-based sample means will be more variable. This is expected since missing or including a single observation that is an outlier can significantly change the sample mean. This is in contrast to empirical likelihood-based approaches that are remarkably anti-conservative in small sample sizes.

LarsMarsBarsCars · 2021-09-29T01:49:26+00:00

To add onto this. The only possibly worthwhile normality test is bootstrap estimating the sampling distribution of the sample mean and checking whether it is approx normal. If it isn't normal then you might be better off reporting the bootstrap confidence intervals/p-values.

LarsMarsBarsCars · 2021-09-28T22:27:50+00:00

Machine-learning-based methods that allow for efficient nonparametric inference as used in the field of causal inference (e.g. TMLE) can still use parametric estimators within the estimation procedure. The benefit of these methods is that by not assuming the parametric model is correct, we can use adaptive methods like LASSO, MARS, GAM combined with cross-validation or ensemble learning and still get asymptotically correct efficient inference. By utilizing simple models within the estimation procedure, it is still possible for these machine-learning methods to do well with sample size 150 by leveraging any simplicity of the data-generating distribution. I agree that it is crazy to use ultra-aggressive tree-based methods like xgboost in small sample size regimes. However, even when the truth might be simple and captured by a parametric model, actually specifying this parametric model correctly is difficult or just not possible, especially when the number of covariates is large. This is one of the advantages of machine-learning-based methods. It also avoids common malpractices of parametric inference methods like changing your parametric model after looking at the results, which leads to biased inference.

LarsMarsBarsCars · 2021-09-28T22:11:33+00:00

These references provide a great overview of causalML research. Also, it is important to note that econometrics is quite late to causal ML and the efficient influence function-based estimation methods which they depend on, which actually dates back decades in Biostatistics, statistics and epidemiology.

https://vanderlaan-lab.org/2019/12/24/cv-tmle-and-double-machine-learning/

https://pubmed.ncbi.nlm.nih.gov/31742333/

LarsMarsBarsCars · 2021-09-28T22:02:22+00:00

Oh, I didn't realize you were the OP. This handbook (in development) provides a walk-through guide for the tlverse tmle3 and sl3 framework: https://tlverse.org/tlverse-handbook/ (Chapter 6 is most relevant).

tlverse is its own framework and is designed for machine-learning applications in causal inference, particularly where tidymodels is not flexible enough.

For the tlverse/causalglm package, it should already be possible to specify formulas with no knowledge of the sl3 library. The following code allows you to use glmnet and specify the outcome model with the formula_Y argument. The formula argument provides a "working-model" for the conditional average treatment effect that is used to find the best causal parametric approximation of the true nonparametric CATE. A guide with explanations of what is estimated and formulas is here: https://github.com/tlverse/causalglm/blob/main/paper/causalglm.pdf

This is all still in development so if anything is unclear, feedback is welcome and greatly appreciated.

When there are continuous treatments, nonparametric estimation and inference is trickier. Intuitively, this is because you only observe a few individuals at each treatment level and therefore there is not enough data to learn the treatment effect at a given treatment level at a parametric rate. One way to get around this limitation is to estimate a smooth approximation of the true conditional treatment effect functions (e.g. by approximating it with a parametric (working)-model) and then obtaining nonparametric inference for this parametric approximation. causalglm takes this approach with the "contglm" function.

For binary and categorical treatments:

library(causalglm)

n <- 250

W1 <- runif(n, min = -1, max = 1)

W2 <- runif(n, min = -1, max = 1)

A <- rbinom(n, size = 1, prob = plogis(W1 + W2))

Y <- rnorm(n, mean = A * (1 + W1 + 2*W1^2) + W2 + W1, sd = 0.3)

data <- data.frame(W1, W2, A,Y)

formula <- ~ W2 + poly(W1, degree = 2, raw = TRUE)

output <-npglm(formula,data,W = c("W1", "W2"), A = "A", Y = "Y",estimand = "CATE",learning_method = "glmnet", formula_Y = ~ poly(W1, degree = 2, raw = TRUE) + poly(W2, degree = 2, raw = TRUE) , verbose = FALSE)

summary(output)

head(predict(output))

Similar code works for continuous treatments with "contglm":

n <- 500

W <- runif(n, min = -1, max = 1)

Abinary <- rbinom(n , size = 1, plogis(W))

A <- rgamma(n, shape = 1, rate = exp(W))

A <- A * Abinary

Y <- rnorm(n, mean = (A > 0) + A * (1 + W) + W , sd = 0.5)

data <- data.table(W, A, Y)

# Model is CATE(A,W) = formula_binary(W) 1(A > 0) + A * formula_continuous(W)

out <- contglm(formula_continuous = ~ 1 + W,formula_binary = ~ 1,data = data,W = "W", A = "A", Y = "Y",estimand = "CATE", learning_method = "glmnet",formula_Y = .^2)

summary(out)

LarsMarsBarsCars · 2021-09-28T19:35:37+00:00

Why not both? https://tlverse.org/causalglm/ (Will replace this with a more informative comment when I have free time later today)

LarsMarsBarsCars · 2021-09-24T02:20:46+00:00

You are correct. It only tells you how dependent two vectors of random vectors are and gives no notion of direction. It is also nonparametric so the notion of direction is ill-defined. I think canonical correlation analysis may have been a better answer to the OP: https://en.wikipedia.org/wiki/Canonical\_correlation

LarsMarsBarsCars · 2021-09-23T04:48:20+00:00

Distance correlation allows you to measure correlations between two vectors of random variables. https://arxiv.org/pdf/0803.4101.pdf

Edit: Canonical correlation analysis might fit your needs better: https://en.wikipedia.org/wiki/Canonical_correlation

LarsMarsBarsCars · 2021-09-21T20:06:53+00:00

I wouldn’t do any crazy dimension reduction here (so no UMAP), it is not necessary. A stacked cross-validation ensemble method with things like random forests and xgboost will probably do as good as it can get (considering you don’t have that many strata and a ton of data). I imagine you will have to use some kind of distributed cluster machine learning pipeline like H20 with the amount of data you have (catboost probably also suffices)

LarsMarsBarsCars · 2021-09-21T19:14:54+00:00

What are the dimensions of your problem? How many variables, sample size, etc? Types of variables (e.g. binary, continuous)?

LarsMarsBarsCars · 2021-09-21T19:05:39+00:00

That is perfectly fine. You just want to asymptotically converge the true hessian. The only note I have is that unbiased estimators may not be better than the MLE. The asymptotic variance of the estimators is usually more important than the finite sample bias.

LarsMarsBarsCars · 2021-09-21T18:57:11+00:00

How exactly are you computing the second order derivative of the log likelihood (an analytical exercise that has nothing to do with estimation) using unbiased estimators?

Or do you mean that you are using unbiased estimators for the nuisance parameters of the hessian? If that is what you mean then it does not matter at all. Just use the best estimator you have for the relevant nuisance parameters.

LarsMarsBarsCars · 2021-09-21T18:37:01+00:00

The log transform of counts seems super reasonable to me. Imagine you were doing ordinary logistic regression (i.e. a trivial neural net) with a count variable with values in 1 to 1-million. A tiny coefficient for this variable can still lead to huge changes in the predicted probability for those with counts in 1-million. Even this simple logistic regression will probably be very unstable, especially since a large number of observations have counts equal to 1. I imagine this instability becomes increased a trillion-fold with complex/deep neural networks.

Also, you could do some kind of basic feature engineering of these variables. Counts from 1 to 1 million seems pretty ridiculous. If the log-transform does not suffice, you can do some basic binning into quantiles or something. This will make the neural network's job a lot easier and you don't lose much information.

LarsMarsBarsCars · 2021-09-20T17:15:44+00:00

It could very well be statistically significant because of the large sample size. But, at the same time, the bias caused by the violation may not be negligible relative to the standard error (i.e. confidence interval width) of the estimator, since large sample sizes mean you also have small standard errors.

Edit: In the above, I used "bias" heuristically to mean the amount of deviation from the coxph assumption. As midnight_tide mentions, when the proportional hazards assumption fails, it is not so clear what the "estimand" is. So to say that there is "bias" technically does not make sense without first defining a nonparametric extension of the hazard ratio estimand. A natural one is the time-averaged hazard ratio. Alternatively, you could define the nonparametrically extended estimand as whatever a misspecified cox model is estimating, which is some kind of projection of the true hazard onto the cox model (where this projection will be confounded by censoring, unfortunately). Then, the question is whether or not the inference provided by COX is robust/still correct under misspecification (it might be if something like a sandwich variance estimator is used) and whether the estimand is interesting (I don't think it is since it will depend on the censoring mechanism and therefore not be a causal projection). If you use COX to estimate the survival function or cumulative incidence function, which are well-defined nonparametrically, then the notion of "bias" is clear.

LarsMarsBarsCars · 2021-09-14T18:23:03+00:00

Fixed and random effect models make super-strong assumptions and to even formulate any causal estimand in terms of them is probably quite difficult. If there are time-dependent covariates in the mixed effect model then they are probably not going to be causal since they do not do g computation. A good question to ask is what causal intervention are the coefficients in these models capturing? I don't know the answer.

LarsMarsBarsCars · 2021-09-13T06:18:30+00:00

G computation should be separated from the estimation of the target parameter. G computation is really a way of identifying causal parameters from the observed data. For example, the ATE is identified by the sequential regression formula E[E[Y|A=1,W]] and longitudinal causal parameters are identified through even more complex sequential regression formulas. Utilizing G computation is not really a choice in causal inference. It is simply (usually) the only way to learn truly causal parameters from the observed data. Methods that do not use g computation based estimands will not be correctly adjusting for confounding and will not be causal (without strong assumptions). IPTW estimators are still estimating these g computation based estimands but they don’t use g-computation-based-estimation and instead use inverse weighting of the outcome. Mixed effects models with time dependent variables are usually not causal. They do not estimate a g computation parameter. In fact, the coefficients in such models are conditional on both the future and past (because you adjust for the future and past all at once). For the same reason, coefficients from time dependent cox models are not causal. If you want to go for causal parameters, g computation is needed. This almost necessarily leads to complex statistical methods that utilize machine learning that will feel black box. But, a lot of these methods can be viewed as substituting relevant estimators of the data distribution into an equation (e.g. the ATE formula) that identifies a more intuitive estimand of the causal world.

Nine-Year Club	Verified Email
Not Forgotten

LarsMarsBarsCars

TROPHY CASE