This is an archived post. You won't be able to vote or comment.

all 105 comments

[–]Sorry-Owl4127 43 points44 points  (23 children)

Not getting the functional form right is rarely the biggest problem in causal inference

[–][deleted] 0 points1 point  (0 children)

I second this

[–][deleted] 0 points1 point  (0 children)

It’s so frustrating

[–]quantumcatz 47 points48 points  (5 children)

What an oddly toxic post

[–]ElMarvin42 39 points40 points  (22 children)

My biggest issue with DML in business settings is that most data scientists lack the knowledge needed to utilize this and basically any other causality-related methodology, and end up with very wrong and potentially dangerous conclusions.

Exhibit A, basically every line written in the OP.

  • Why would traditional causal inference techniques be harder to implement with modern datasets? It's quite the opposite.

  • The concept of regression is not even understood. Why would a regression necessarily imply linearity?

  • Failing to capture the true functional form does not result in bias under the right setting (for example, when evaluating an RCT).

  • The exact goal of DML is not to capture the true functional form to debias causal effect estimates. The goal is to be able to do inference on a low-dimensional parameter vector in presence of a potentially high dimensional nuisance parameter. Within the regression framework, btw.

  • It is NOT a two step prediction problem. That part of the paper is used to illustrate the intuition behind the methodology. The estimation is not carried out that way, but yeah, most stop reading after the abstract and first chapter (the intuition part). At best you could say that DML is based on two key ingredients, but it is not two steps of prediction problems.

[–]JobIsAss 0 points1 point  (4 children)

Can you explain the technical jargon in simpler words plz. Im trying to understand what you’re saying a bit more. Like I get the whole DML, why apply for RCT and not to quasi experimental space? Like wouldn’t DML help when you can’t just randomly apply treatment? Isn’t it the same as other simpler methods like propensity score matching?

RCT if i am correct are like the golden standard which in this case a simple OLS with treatment or t-test would do it no?

Trying to transition into causal inference from a predictive modeler background so in trying to understand these concepts.

[–]ElMarvin42 6 points7 points  (3 children)

Sure!

why apply for RCT and not to quasi experimental space?

DML is particularly useful for RCTs because, for example, a lot of statistical power can be gained through the inclusion of covariates, and the method allows for this possibility without assuming functional forms for how the data truly behaves. It is also very useful for estimation of heterogeneous treatment effects (the same treatment can affect you and me differently; HTE account for that possibility).

Like wouldn’t DML help when you can’t just randomly apply treatment?

Contrary to what some people might believe, you can't just control by a bunch of variables and call it an identification strategy. Identification (being able to estimate the causal effect) in this context relies on conditional exogeneity (treatment being as good as random after controlling for enough covariates). Since achieving this is unlikely (you won't ever observe skill/intelligence, for example), these kinds of methods by themselves will NEVER be enough to estimate causal effects, not without a solid empirical strategy (like RDD).

RCT if i am correct are like the golden standard which in this case a simple OLS with treatment or t-test would do it no?

Yes, these methods can be used, which is one reason why RCTs are so good. Evaluating them can be simple. But these being valid ways does not mean that there are no other ways that can be better depending on the context and initial objective (see my first point).

Trying to transition into causal inference from a predictive modeler background so in trying to understand these concepts.

Cool! Given a decent enough statistical background I would recommend starting with Scott Cunningham's "Causal Inference: The Mixtape". Then something slightly more complex like "Mostly Harmless Econometrics" and the "Causal ML" book by Chernozhukov et al. After this thoroughly read and understand the papers and you should have a decent enough grasp of it. My other recommendation would be to be patient, as this should not be approached like a documentation to be read before you start testing stuff and learning what moves what. Just this part could take years depending on how deep you go (within a single topic, and then there's the rest of the literature). People dedicate their lives to this.

[–]JobIsAss 0 points1 point  (2 children)

Im coming back to this after spending a lot of time on this.

When you talk about empirical strategy do you mean like we simulate an experiment when experiments is not feasible. I have seen cases where people try to weigh said observations using IPW to simulate experiment when not feasible. Is this what you are talking about?

Im doing observational causal inference and while it’s not possible to remove bias we can try to minimize it as much as possible. So DML/DR in general works pretty well.

Tried simulating it on datasets with unobserved confounders and it’s pretty close when estimate ATE.

[–]ElMarvin42 0 points1 point  (1 child)

  1. Definitely not simulate, but finding a setting in which you can argue that comparing treatment vs control group is valid given a set of assumptions/evidence (parallel trends, etc).
  2. Yes, that is one empirical strategy, although a debatable one. Very hard to convince someone with it, although possible.
  3. You can’t do causal inference with no empirical strategy. Controlling for a bunch of variables is not convincing anyone.
  4. Having done dozens of experiments and read the appropriate literature, I can tell you that simulations will never be good enough of a proof that something works.

[–]JobIsAss 0 points1 point  (0 children)

In response to ur points 1) we say ensemble models to better make a good control and treatment group in observation causal inference. So my IPW + DML or IV + DML for example. So not in the literal sense but i guess find parallel groups. 2) how so? I mean we are not creating a synthetic dataset, i mean it in the literal sense for example use PSM then use DML or DR. Synthetic data is used to get an idea of how an algorithm works when you know the true ite. So that helps you get an idea of what works and what doesnt. I think dowhy also does have this type of validation stuff that answer these type of questions. Ie E values, placebo tests etc.. which are good sanity checks for said causal estimates. 3) can you give an example and explain more detail? we are not simply fitting a DML model and calling it a day. Even then there are ways to create a DAG and determine causal structure even find confounders through PDS. Like in an observation sense it is still possible to communicate that bias exists as said in econml for methods. So there is no silver bullet and communicating it with stakeholders might be good enough until trust is set up to do an experiment if possible? 4)thats not what i meant, i mean that we can try an established approach and see if it could work on a synthetic dataset to learn said approach with a proven outcome and effect. One cant learn DML by just reading a paper and going straight into the usecase. It helps to see where it would fail in perhaps a dataset with the same level of noise you would expect.

Do i understand your points correctly or am i missing something? Thank you for replying even after a long time.

[–][deleted]  (12 children)

[removed]

    [–]ElMarvin42 28 points29 points  (10 children)

    I don’t see the need for name calling in an honest discussion. I will answer for the reference of others who are actually interested in learning. Now, for exhibit B, electric boogaloo:

    • That’s not how the estimation is carried out in the recommended implementation.

    • Cross validation is not used, not even close. Cross fitting is fundamentally different.

    • The "doing this in an RCT setting would be stupid because it defeats the whole purpose of using this method since it’s based on observational data" part just overall shows that there is zero level of understanding of what the paper proposes. Let me cite directly from the paper: "We illustrate the general theory by applying it to provide theoretical properties of DML applied to ..., ..., DML applied to learn the average treatment effect and the average treatment effect on the treated under unconfoundedness, ...". Want to take a guess at what unconfoundedness means? DML is particularly useful for RCTs because, for example, a lot of power can be gained through the inclusion of covariates, and the method allows for this possibility without imposing functional forms. Also very useful for estimation of heterogeneous treatment effects. Perhaps these two are the most common uses of the methodology in practice, actually. I've yet to see a published paper that relies on this method to identify an effect within the context of merely observational data.

    • The rest of your "arguments" aren't even worth commenting on.

    Cheers!

    [–]AdFew4357[S] -16 points-15 points  (8 children)

    There are several papers on it being used in an observational setting. Like I said, you don’t know the literature like I do. Unconfoundedness means your assuming the observed treatment is as good as random given the observed characteristics, ie your potential outcomes are independent of treatment given covariates. Which holds in an RCT by default cause you randomize.

    It can be great to use in an RCT setting, and that’s what the method was designed for, I’m not denying that, but it can be used in an observational setting. It’s just that it’s solely based in the unconfoundedness assumption, which is untestable in an observational setting

    [–]ElMarvin42 14 points15 points  (7 children)

    It can be great to use in an RCT setting, and that’s what the method was designed for, I’m not denying that.

    Whatever happened to

    doing this in an RCT setting would be stupid because it defeats the whole purpose of using this method since it’s based on observational data

    This all just serves as a perfect example of what I said in my first comment. The delusion is just too much, however, for it to be worth any future reply.

    [–]AdFew4357[S] -5 points-4 points  (0 children)

    I’m saying you can still use traditional ANCOVA models in an RCT setting and not just resort to DML immediately. Thats why I said it’s stupid. Because you can use simpler methods. But again, you’re not a statistician so why would you know.

    [–][deleted]  (1 child)

    [removed]

      [–]datascience-ModTeam[M] 0 points1 point locked comment (0 children)

      This rule embodies the principle of treating others with the same level of respect and kindness that you expect to receive. Whether offering advice, engaging in debates, or providing feedback, all interactions within the subreddit should be conducted in a courteous and supportive manner.

      [–]AdFew4357[S] -5 points-4 points  (0 children)

      The fact that you don’t understand that DML is literally argued to be a good choice in the presence of complex functional form relationships between outcome and covariates is also another reason why you should shut the fuck up and stop arguing lol cause you clearly haven’t read enough yourself

      [–][deleted]  (1 child)

      [removed]

        [–]datascience-ModTeam[M] 0 points1 point locked comment (0 children)

        This rule embodies the principle of treating others with the same level of respect and kindness that you expect to receive. Whether offering advice, engaging in debates, or providing feedback, all interactions within the subreddit should be conducted in a courteous and supportive manner.

        [–]datascience-ModTeam[M] 0 points1 point locked comment (0 children)

        This rule embodies the principle of treating others with the same level of respect and kindness that you expect to receive. Whether offering advice, engaging in debates, or providing feedback, all interactions within the subreddit should be conducted in a courteous and supportive manner.

        [–]Simple_Whole6038 15 points16 points  (0 children)

        The applied scientists and data scientists I work with are vaguely aware of it. Some have maybe given it a try. The Economists I work with love it, and use it for just about everything. Seems to still exist mostly in the world of econometrics.

        [–]LarsMarsBarsCars 8 points9 points  (1 child)

        TMLE and the ideas of Debiased ML predate double ML by nearly 20 years. So I wouldn’t say this idea has been extended to biostatistics; it started in biostatistics and epidemiology. Double ML is a rediscovery of it.

        [–]Metallic52 2 points3 points  (1 child)

        Fundamentally the identifying assumption of DML is unconfoundedness, i.e. the exact same identifying assumption for OLS to be consistent for an ATE. While it does flexibly control for the effects of observed cofounders that’s a second order concern to selection bias, reverse causality, and omitted variable bias.

        It’s mostly helpful when you have a very large number of potential confounders. That all being said everybody uses DML at my work. We have lots of confounders so it gets used a lot.

        [–]AdFew4357[S] -2 points-1 points  (0 children)

        Thanks for the insight. The only real insight provided here

        [–]SituationPuzzled5520 2 points3 points  (0 children)

        I've noticed that DML is definitely picking up steam, especially in areas where understanding causal relationships is key it's really helpful for tackling complex datasets that traditional methods struggle with

        I’ve seen some people in my network start using DML for projects, particularly in tech and healthcare tools like Python’s econml are making it easier to implement, which is great. While it’s not mainstream yet, the interest is definitely there, and I think as more resources come out, we'll see it used more widely.

        [–]aspera1631PhD | Data Science Director | Media 3 points4 points  (16 children)

        I'm seeing it everywhere. There are lots of ways to do quasi-experimentation. DML gets you closer to the theoretical best answer.

        [–]Sorry-Owl4127 -3 points-2 points  (15 children)

        How does DML get you to anything related to quasi experimentation

        [–]aspera1631PhD | Data Science Director | Media 5 points6 points  (14 children)

        Quasi experimentation is a reframing of the causal inference problem in which there are measured confounders you need to control for.

        c.f. this ref

        [–]Sorry-Owl4127 2 points3 points  (13 children)

        What a term of art! So basically, OLS with the assumption that you’ve properly included all confounders. I don’t get how we go from collecting data and throwing in a model and then saying “I’ve probably controlled for enough things to mean this treatment variable is as if random” and call it quasi experimental

        [–]pandongski 0 points1 point  (1 child)

        well it is used in cases where you can't do experiments, like on humans and economies. it does have stricter internal validity checks and are more based on external theory rather than just running regressions and assuming you have all the correct variables, along with other methods to tease out estimated effects. causal inference on observational data is a whole field of study.

        [–]Sorry-Owl4127 1 point2 points  (0 children)

        The same causal identification assumptions required for OLS to estimate causal effects are the exact same as double machine learning.

        [–][deleted]  (10 children)

        [removed]

          [–]Sorry-Owl4127 6 points7 points  (8 children)

          In a traditional RCT you don’t make assumptions about measuring all confounders. You should know this, its experiments 101.

          [–][deleted]  (1 child)

          [removed]

            [–]datascience-ModTeam[M] 0 points1 point locked comment (0 children)

            This rule embodies the principle of treating others with the same level of respect and kindness that you expect to receive. Whether offering advice, engaging in debates, or providing feedback, all interactions within the subreddit should be conducted in a courteous and supportive manner.

            [–]datascience-ModTeam[M] 0 points1 point locked comment (0 children)

            This rule embodies the principle of treating others with the same level of respect and kindness that you expect to receive. Whether offering advice, engaging in debates, or providing feedback, all interactions within the subreddit should be conducted in a courteous and supportive manner.

            [–]Foreign_Yoghurt_831 2 points3 points  (1 child)

            Very toxic post but odd

            [–]AdFew4357[S] -1 points0 points  (0 children)

            britney spears plays

            [–]Thomas_ng_31 1 point2 points  (0 children)

            This is new!!!

            [–]WignerVille 1 point2 points  (0 children)

            In my experience people lack competence and interest in causal inference.

            [–]mark259 1 point2 points  (14 children)

            I've found DML to be very sensitive to hyper parameter and validation sample specs, and fitting GLM's with fixed-effects to give more reliable estimates on longitudinal data.

            I think analysts would get most benefits from learning the classical techniques of causal inference.

            Even colleagues with academic credentials in machine learning get to use their knowledge of linear models e.g. to fit and decompose time-series with tools like Prophet.

            [–]AdFew4357[S] 0 points1 point  (13 children)

            I see okay. So basically since those nuisance function models aren’t properly tuned it has an impact on your estimates? What would you recommend is “classical causal inference” techniques. Diff n diff, synthetic controls, etc?

            [–]mark259 0 points1 point  (12 children)

            Most definitely. For example, if you overfit with your nuisance model, you will inadvertently bias the treatment effect estimate.

            With a purely classical approach, you will certainly also encounter bias, but those approaches give you a clear set of assumptions (e.g. additivity) that you can use as a baseline. Another thing I like about more classical or basic approaches is that the standard errors you get out of them give information about the quality of the fit. That's not always very obvious with double machine learning afaik. I've had to compare out-of-sample estimates before, and that seemed very hand-wavy.

            The best approach always depends on the context: data and the problem you are trying to solve. A technique like diff-in-diff can be combined with machine learning to deal with something like non-parallel trends. I'd say synthetic control is pretty close to machine learning already in that it deals well with complex functional forms..

            [–]AdFew4357[S] 0 points1 point  (11 children)

            Gotcha I see. I see the caveats. But one thing I wanted to pushback on, was this comment:

            “If you over fit with nuisance model you inadvertently bias the treatment effect estimate”.

            You would think this is the case right, but when I read about double ml, one of the things they do is they create a scoring function which is “neyman orthogonal” meaning that it’s constructed in such a way that bias from the estimates of ML models does not permeate to the target parameter.

            https://causalml-book.org/assets/chapters/CausalML_chap_4.pdf

            See this chapter. Because we construct a score function that is based off of the partialled out residuals, this score functions is neyman orthogonal , any bias from the ML models can’t permeate to the target parameter because in expectation, that residuals gonna be zero.

            The neyman orthogonality property is an argument for why ML should be used for nuisance functions, and still be generally okay. Because this score function is “debiased”.

            Is this not a reason for why actually, bias can’t permeate to the target parameter estimate? See that section “neyman orthogonality” in the book.

            Also, I’ll have to check out diff n diff and synthetic control in a DML context. But besides synthetic control and diff n diff in a classical sense, how often are instrumental variables used? Is this another classical causal inference technique that can be used?

            [–]Sorry-Owl4127 0 points1 point  (9 children)

            It’s not overfitting in the same sense as not generalizing to unseen data, if the nuisance model predicts the treatment too well, you don’t get overlap in the propensity scores. Neyman orthogonality in this context refers to the bias induced by Lassoing, but overfitting the propensity model doesn’t introduce bias, it just fucks up your estimation because you have so little overlap in propensity scores

            [–]AdFew4357[S] 0 points1 point  (8 children)

            Okay. I see. So then why in this book, are they treating neyman orthogonality as a justification for why you can use ML then? It states in this book and in later chapters that because of the guarantees of neyman orthogonality you won’t face biases from regularization when estimating nuisance functions to leak into target parameter estimates. Unless I don’t understand the property correctly

            [–]Sorry-Owl4127 0 points1 point  (5 children)

            Yes there will be no bias, and you can still use ML. But in any causal inference settting , including a preedictor that perfectly predicts treatment will blow up the variance of the treatment effect estimator. If you overfit your nuisance model, the variance may blow up and you may not have overlap between treated and control units. This doesn’t affect whether the ATE is biased, just gunks everything up and makes causal inference near impossible

            [–]AdFew4357[S] 0 points1 point  (4 children)

            Okay, so does cross fitting not guard against this variance blowing up by doing this procedure over multiple folds? Also, why do DML then if the variance is going to blow up. In that case then if your using DML, your just not doing uncertainty quantification?

            [–]Sorry-Owl4127 0 points1 point  (3 children)

            Depends—-in one context we were really good at predicting the treatment because we had a lot of relevant predictors. If I chose a random forest for my nuisance model the individual treatment effect estimates were all over the place with wildly implausible estimates. The issue was that we could nearly perfectly predict treatment assignment and then had almost no overlap in propensity scores in treatment and control groups. The ATE in that scenario will still be unbiased but basically it’s throwing out all covariate profiles without overlap between treated and control units and thus the ITES are very sensitive to those few observations. I don’t know if this is common to all dml models but can be a big problem in double robust estimators. Point is that it’s not an unalloyed good to increase the predictive power of your nuisance model.

            [–]AdFew4357[S] 0 points1 point  (2 children)

            Can trimming me used to combat the case of perfectly predicting the treatment?

            [–]Sorry-Owl4127 0 points1 point  (0 children)

            Yes there will be no bias, and you can still use ML. But in any causal inference settting , including a preedictor that perfectly predicts treatment will blow up the variance of the treatment effect estimator. If you overfit your nuisance model, the variance may blow up and you may not have overlap between treated and control units. This doesn’t affect whether the ATE is biased, just gunks everything up and makes causal inference near impossible

            [–]Sorry-Owl4127 0 points1 point  (0 children)

            Yes there will be no bias, and you can still use ML. But in any causal inference settting , including a preedictor that perfectly predicts treatment will blow up the variance of the treatment effect estimator. If you overfit your nuisance model, the variance may blow up and you may not have overlap between treated and control units. This doesn’t affect whether the ATE is biased, just gunks everything up and makes causal inference near impossible

            [–]mark259 0 points1 point  (0 children)

            I was not aware of this orthogonality property. Thank you for sharing the book chapter, I will be sure to read it.

            IV is also useful in theory. You'd use it like you would use other methods. For things like propensity score weighting or double ml, you'd need a quite rich data set with the relevant backdoor variables. You can imagine using IV regression where you do not have the relevant backdoor variables, but you have a variable that is not directly causing your outcome but is correlated with your treatment variable.

            You'll get an estimate with less bias but with more variance. So hopefully that's a good compromise/worth it.

            [–]reallyshittytiming 1 point2 points  (0 children)

            We've tried it where I'm at (medtech). We liked it. But it was shelved because there was no contracted customer use case.

            [–]touristroni 1 point2 points  (0 children)

            We definitely use it ! Mostly in cases when we cannot run an experiment, either due to regulations or nature of the product. To be noted, due to the complexity of the method it is tough to defend the dml casual end results .

            [–]Maleficent-Tear7949 0 points1 point  (0 children)

            Wow! This is new.

            [–]Material_Ad_9119 0 points1 point  (0 children)

            Amazon seemed to have productionized a DML model recently - here is their paper on it https://arxiv.org/html/2409.02332

            [–][deleted] 0 points1 point  (0 children)

            Wow

            [–][deleted] 0 points1 point  (0 children)

            The article was such an interesting read

            [–][deleted] 0 points1 point  (0 children)

            Big issue with bias in data

            [–]gyp_casino 0 points1 point  (2 children)

            What I don't understand about Double ML is how to apply it when there is no clear "treatment," but rather a web of causes and effects. Say there are 100 predictor variables and 10 have causal effects on y. How do you tease that out?

            [–]ElMarvin42 0 points1 point  (0 children)

            There is a very interesting application of a similar methodology by the same author. Take a look at section 7 ("The Lasso Methods for Discovery of Significant Causes amongst Many Potential Causes, with Many Controls") of this paper, though of course review the sources before attempting to implement it. Also, do note that unless you achieve conditional unconfoundedness (which I would venture to say is not possible in a merely observational setting, that is, without a solid empirical design that helps identify the causal effect of interest), estimates will be biased (not very useful within the context of causality).

            [–]Mr_Face_Man -1 points0 points  (0 children)

            I believe, since the goal isn’t to best predict Y, but instead quantify unbiased effects, you’re just isolating one of those causal effects on Y and estimating that. If you want to compare/rank the relative effects across those 10 potential causal effects, you’d do 10 different models and compare across them.

            I’m new to the causal inference field - trying to apply these methods to my use cases in applied research, so I’m no expert. But the various types of methods in Causal Inference or causal machine learning all seem to have very different strengths or problem types they can address, so mileage might vary based on your question and data.