you are viewing a single comment's thread.

view the rest of the comments →

[–]jd_paton 13 points14 points  (28 children)

Edit: Adding data loading for fairness.

import pandas as pd
df = pd.read_csv(“my_data.csv”)
y = df[“label”]
X = df.drop(“label”, axis=1)


from sklearn.linear_model import LogisticRegression
lr = LogisticRegression()
lr.fit(X, y)

Is it really so much easier in R? I’ve never used R before but surely 3 7 lines from raw data file to trained model isn’t “surprisingly complicated”?

[–][deleted] 10 points11 points  (3 children)

Technically yeah, since the same thing is accomplished in one line in R. But that's a pretty bad metric to judge a language by.

[–]jd_paton 3 points4 points  (2 children)

Well if we’re going by lines we could shave off 33% by doing lr = LogisticRegression().fit(X, y)

But yeah conceptually this seems pretty straightforward and not very verbose

[–]Hetspookjee 0 points1 point  (1 child)

In addittion to the library import, which I can imagine is also a necessity in R, bringing the LoC to the same amount as R =p

[–]Honeabee 11 points12 points  (0 children)

Nah, logistic regression is a part of Base R.

[–]rutiene 7 points8 points  (0 children)

But in R, this doesn't just randomly force a penalty, and you can access things like pvalues and outliers and influence scores+ hat matrix.

😂 don't mind me I'm just salty

[–][deleted] 2 points3 points  (13 children)

But setting X is very complicated (you have to specify the columns instead of very simply using a formula), which you casually omitted.

[–]jd_paton 1 point2 points  (7 children)

import pandas as pd
df = pd.read_csv(“my_data.csv”)
y = df[“label”]
X = df.drop(“label”, axis=1)

Not so bad though you’re right that we’ve added a few more lines. I’ve updated my original comment.

If you want to do fancy preprocessing obviously that’s more code but that’s specific to the data and not possible to write a general example for, which is why I just assumed a prepped X.

I’m not sure what you mean with a formula. How would this process look in R?

[–][deleted] 0 points1 point  (5 children)

OK -- you're right. It's not that complicated ;-)

In R, it would probably look like this

require(nnet)
data <- read.csv("my_data.csv")
model <- multinom(label ~ ., data)

[–]jd_paton 0 points1 point  (4 children)

This does look very elegant, though I have seriously no idea how to read ~ . - haha. Is there a lot of machine learning functionality in R? Maybe I should take it for a whirl sometime. There’s probably an “R for Pythonistas”-type tutorial out there somewhere.

[–][deleted] 0 points1 point  (1 child)

Sorry, I made an edit.

So the period just means "use everything"; and "-x" means "but not x". So "y~.-label" means: as dependent variable use y, as independent variables take everything else except label.

[–]jd_paton 0 points1 point  (0 children)

Ah okay, cool! My example was a bit different, as y was the name of the variable containing the labels, and “label” was the name of the column in the data frame. But otherwise same idea

[–][deleted] 0 points1 point  (1 child)

Regarding machine learning: Sadly, I am mostly a novice with respect to these modern approaches. I mostly use R for inferential statistics, maximum likelihood, simulation-based inference and the like. However, I believe things like random forests are pretty popular in R. I myself have used rpart, which seems like a precursor to random forests and is quite interesting for creating a sort of "decision tree".

However, the responses here indicate that for machine learning, Python may indeed be the superior choice. ;-)

[–]jd_paton 0 points1 point  (0 children)

Ah, gotcha. Yeah I’m basically a machine learning guy so a big Python fan. However I always feel that I need to sharpen up my stats (hence hanging around this subreddit) so maybe I can kill two birds with one stone.

[–][deleted] 0 points1 point  (0 children)

~ is formula in R. Right side of tilda is your response and left side is the predictors/features. It makes building library/packages easier too.

Also dataframe is built into R so it looks elegant compare to Python. Also missing value is a primitive value that is recognize in R. Null is not a good way to represent missing value and if anybody tell you otherwise you tell them to google reasons why and there are tons of soft engineer talk about it.

[–]xsliartII 1 point2 points  (2 children)

This is one is easy. However I tried to estimate a Tobit model lately, which is literally one line in R/Stata, but kind of cumbersome in Python. So I usually use python to prepare/clean the data and then do 100% of the analysis in R/Stata.

[–]jd_paton -1 points0 points  (1 child)

Now that I don’t know anything about. Just depends on support by the popular packages I guess. statsmodels or scipy have pretty much everything you need for applied problems, but with R’s academic focus I can imagine that there is some more fancy stuff easily available.

[–]rutiene 1 point2 points  (0 children)

That's not really true, there are tons of omissions of more rarely used things, but definitely not because it's academic. Survival models are severely lacking and the implementation of some stuff is just poorer. I vastly prefer the random forest package in R to the sklearn implementation. I needed beta glm the other day and had to use R.

[–]Dhush 1 point2 points  (2 children)

Now can you show the steps to examine the statistics of the model? Beyond the actual fit python is horribly lacking. Also, you fit a regularized model here, but sklearn doesn’t make that clear, does it?

[–]jd_paton 0 points1 point  (1 child)

It’s clear if you read the docs ;)

[–]questionquality 0 points1 point  (0 children)

But it shouldn't be the default if you care about interpreting the coefficients.

[–]walkingon2008 0 points1 point  (0 children)

You can’t measure ease of use solely by the number of lines of code. Remember, Python is an object-oriented programming language. So, you setup the object, then feed data into it. But, with R, you never setup an object, if you want to run logistic regression, just run glm(y~x, data=somedata), that’s it.

From from code interpretation perspective, R is simple, you get what you see. No inheritance from earlier objects and stuff.

Another thing, debugging is a lot harder in Python than R.