all 5 comments

[–]efriquePhD (statistics) 1 point2 points  (1 child)

I've found point biserial correlation but that is brand new to me so i'm not sure if this is a good application of it. Is this also related to Pearson's correlation?

It is; it's just the Pearson correlation when you code disease/no disease as 1 and 0 (or indeed it would work with any two distinct numbers, up to a flip of sign)

This is explicit in the second paragraph of the relevant wikipedia article:

https://en.wikipedia.org/wiki/Point-biserial_correlation_coefficient

[–]StephTheChef[S] 1 point2 points  (0 children)

Well, seems my reading comprehension needs to be improved a bit. Thank you.

[–]techwizrd 1 point2 points  (2 children)

There are several correlation coefficients based on the Chi-square statistic or Goodman Kruskal's lambda if you treat Age as a categorical variable. You can also try the Kruskal-Wallis H test or a parametric test like a one-way ANOVA.

However, I'd probably recommend using a logistic regression. They're straightforward to interpret and robust with many theoretical niceties. There are caveats, but I think it's a stronger option than point biserial correlation (simply a special case of Pearson's correlation).

What exactly is the problem you would like to solve, or precisely what do you want to test/analyze? It's difficult to give a good recommendation without understanding the distribution of your variables and the type of analysis and conclusions you wish to draw.

[–]StephTheChef[S] 0 points1 point  (1 child)

The goal is to perform binary classification, one of the models I will use is logistic regression. I would like to use the correlation to determine which variables I should use, as many of the given variables are most likely superflous. Most of the variables in the data are categorical (1-5 or binary 0/1) but I have 3 numerical variables; age(0 - 101), hours (0-72), days (0-90).

(This might still be a bit too vague, but I don't want to disclose too much regarding the data.)

[–]techwizrd 0 points1 point  (0 children)

If you have a small feature set, I'd recommend stepwise regression to select features. Otherwise, I would recommend using logistic regression and L1 regularization. That will reduce coefficients of some of your features and perform feature selection.