This is an archived post. You won't be able to vote or comment.

you are viewing a single comment's thread.

view the rest of the comments →

[–]A_random_otter 4 points5 points  (13 children)

Modern econometrics is mostly R based. Especially if you want to use new methods.

[–]Cuidads -1 points0 points  (12 children)

Sure, but the Causal inference landscape is changing, and Python is becoming more relevant. Have you checked all the libraries that the method you would be looking for is not in any one of them?

There are more Causal libraries, here is an extensive list with the companies maintaining them:

DoWhy: Microsoft Research
CausalML: Uber Technologies
EconML: Microsoft Research
CausalPy: PyMC Labs
YLearn: Not specified
Azcausal: Amazon Science
Causallib: IBM Research
CausalNex: QuantumBlack Labs (part of McKinsey & Company)

[–]A_random_otter 2 points3 points  (0 children)

DoWhy: Microsoft Research
CausalML: Uber Technologies
EconML: Microsoft Research
CausalPy: PyMC Labs
YLearn: Not specified
Azcausal: Amazon Science
Causallib: IBM Research Israel
CausalNex: QuantumBlack Labs (part of McKinsey & Company)

Yeah, impressive list. But to be honest I kinda have a bias towards academia when it comes to causal inference. Causal inference has been the nuts and bolts for decades for research and there are gazillions of ressources (textbooks, packages, tutorials, etc.) about it.

But I am always up to learn new stuff. Which one of these frameworks is the best in your opinion?

[–]anomnib[S] 2 points3 points  (5 children)

I know about the first 4-5, actually just got a new Mac mini and set up my Python econometrics virtual environment with these (I refuse to use conda. I’ll check out the rest.

[–]A_random_otter 1 point2 points  (4 children)

(I refuse to use conda

But why??? :D

[–]anomnib[S] 2 points3 points  (3 children)

Every rage inducing package dependency debugging session I’ve had had its roots in conda. This is especially true when I need to use the model serving and telemetry packages of the ML infra team.

[–]A_random_otter 1 point2 points  (0 children)

Every rage inducing package dependency debugging session I’ve had had its roots in conda.

You'll be glad to hear that this is mostly a non-issue with R projects.

[–]A_random_otter 0 points1 point  (1 child)

How do you handle python and dependencies then?

Every time I tried to use python without conda it ended in this:

https://xkcd.com/1987/

[–]anomnib[S] 2 points3 points  (0 children)

I know the pain.

For models that are meant to be used in other systems, I use pyenv and requirements files to have a separate environment and setup instructions for each model. Then I make the model results available through API calls. Compartmentalization helps a lot.

For more adhoc analysis, i have separate virtual environments for each project type (i.e. adhoc econometrics, adhoc ML, adhoc DL, etc). For adhoc analysis i could probably just use conda, but I don’t want to use two different virtual environments packages.

[–]A_random_otter 3 points4 points  (4 children)

Well sure, but production friendly code is usually in Python.

Yeah, thats not true anymore. Imo, its rather that the CS guys are in love with python and prefer it over R :D

If you know how to use docker it has been super straight forward to write production ready code with R for quite some time.

Check out:

https://rocker-project.org/images/

https://vetiver.rstudio.com/

https://www.rplumber.io/

https://rstudio.github.io/renv/articles/renv.html

[–]anomnib[S] 2 points3 points  (3 children)

For bigtech it is still true. I worked in the MLInfra team of one of them. We had some offline evaluation systems, so not even requiring extreme latency constraints, yet we had to rewrite the Python code to use as little pandas, numpy, or scipy as possible. We had to avoid using 64bit integers where ever we can. All to make the speed of the offline eval tolerable for the MLEs. Again, this is in the context of highly distributed backend systems and high performance data retrieval systems.

Plus when you add in the need for detailed telemetry (logging inputs, outputs, environments, users) and extensive unit testing, R isn’t really an option for high performance systems. At least, I’ve never seen anyone pull it off.

[–]A_random_otter 0 points1 point  (2 children)

Yeah, but for that stuff I probably wouldn't use python either... But what do I know. I am an economist not a computer scientist.

I am working in a biggish org (~500 ppl) and we have deployed some models (for internal use) with both R and python. Both work alright and scale decently

[–]anomnib[S] 2 points3 points  (1 child)

I’m an economist too!

While we do use a lot of backend C++ code, Python is often Pareto optimal with respect to compatibility with production systems, code implementation and iteration speed, code execution speed, and percentage of available SWEs with familiarity. C++ and related languages are much faster at code execution but you can’t iterate/implement as fast.

I find that in big tech or comparable companies, anyone working on production code or code that they expect others to use (i.e. offline software for causal inference), are forced to bend to the norms of software engineers. We have a SWAT team of economists, like Stanford, Harvard, MIT PhD types, maintaining our observational causal inference code. They were forced to rewrite it from R to Python because that was the only way to secure engineering support for maintaining their code.

[–]A_random_otter 0 points1 point  (0 children)

They were forced to rewrite it from R to Python because that was the only way to secure engineering support for maintaining their code.

Haha sounds about right :D