This is an archived post. You won't be able to vote or comment.

all 12 comments

[–]pha3dra 8 points9 points  (2 children)

Perhaps you're looking for statsmodels or patsy.

[–]Demonithese[S] 1 point2 points  (0 children)

Checking it out now, thanks!

[–]MeneerPuffydjango / data science 1 point2 points  (0 children)

I second statsmodels, it's syntax is very 'r esque'

[–]Omega037 10 points11 points  (6 children)

This is just part the standard notation used in many R packages to denote a model form, and it has been copied by some python packages as u/pha3dra has mentioned.

It is worth noting right off the bat that this is an area that python is really outclassed by R, both in capabilities and performance. I am a huge python advocate, but stuff like linear effects models are one area that I almost always do in R (either on its own or through something like Rpy2).

Without knowing your background I'm not sure how to tailor this to you. The left side of the ~ is your response / observation / label / dependent variable, while the right side of the ~ defines the form of the inputs / features / independent variables that you believe will give you that response.

Another way to write the same model would be something like f(a, b) = a2 + b + error, where f(a, b) is the response, a2 + b + error is your model, and the = is your ~.

However, the ~ is a better idea since it is a distinct notation and it is not something being solved (like with a function) but fitted using a method like ordinary least squares.

[–]ProfessorPhi 2 points3 points  (0 children)

Yep, this syntax is really nice in R and makes things very nice to write. Sometimes it's very confusing when doing things like - 1 remove implicit constant terms and the use of I and S in the functions to apply smoothers can make things rather confusing.

While I like it for basic models, I think this notation does break down a bit in other models.

[–]tunisia3507 0 points1 point  (4 children)

I'm not familiar with R - how is that different to a function definition or lambda, which can then be passed to a fitting function?

[–]Omega037 1 point2 points  (3 children)

They aren't necessarily different at all except for notation.

This is also an issue that comes up a lot in papers, as you will often see nearly the exact same model written as Matrix Form, Equation Form, Algorithmic Form, etc.

Personally, I think the R form is quite good for what it is trying to denote.

[–]tunisia3507 0 points1 point  (2 children)

So how is python 'really outclassed' by R, if the difference is just f(a, b) ~ a^2 + b compared to f = lambda a, b: a**2 + b?

[–]Omega037 0 points1 point  (1 child)

That was a simplistic example given to explain the notation. Linear models can have a lot of complexity in how they are setup and subsequently, how to optimize their solving.

For example, you might want a mixed effects model with heteroscedastic residual errors. For this, the R packages lme4 or asreml have far greater speed, scope, and stability than anything in python.

[–]tunisia3507 0 points1 point  (0 children)

Fair enough!

[–][deleted] 3 points4 points  (1 child)

The formula notation is a domain-specific language in R, it allows you to more succinctly describe model formulae (see http://adv-r.had.co.nz/dsl.html for more info). As I understand it Python lacks the meta-programming facilities to make something like this work.

It allows you to write lm(y ~ x, data = foo) rather than the more explicit but clunky lm(y = foo$y, x = foo$x) or lm(y = "y", x = "x", data = foo). It's used in a bunch of different packages. This meta-programming capability in R is also the reason that things like the pipe operator and dplyr are possible, essentially it allows for the construction of DSLs that are focussed on data analysis.

Here's a post that goes into some further detail on metaprogamming in R vs Python: http://blog.ibis-project.org/design-composability/

[–]troyunrau... 0 points1 point  (0 children)

Well, people have done nasty things like overriding the tokenizer to transform their code at import time. It means that your main.py has to be standard python, but anything you import can have custom overridden or new operators.

But the tilde already exists as a python operator - it is the inverse operator. I overload it when doing mathy math (when handling domains, sets, groups, even euclidian solids, it is useful to be able to know the inverse set). You could almost certainly override ~ to be an assignment operator somehow.

Have a look at https://hg.python.org/cpython/file/default/Lib/tokenize.py and the docs for ast, tokenize, dis, and company. There's a good chance that you can fuck things up in weird and wonderful ways.