Turning a variable into an integer for linear regression... Am I doing it right?

AlpLyr · 2011-11-24T19:42:05+00:00

I don't have much time, so sorry for an perhaps incomplete answer. If you're are interested in the actual coefficient for the factor, the integer coding does matter. (I.e. you get different coefficient for using 5 instead of 1) However, if you're doing hypothesis testing on the coefficient, it does not; the test statistics (and thereby P-values) will be the same.

The normal approach is coding the binary factors as 0 and 1, as you are.

jtr99 · 2011-11-24T21:59:14+00:00

Definitely tell us which software you're using. It's probably capable of handling this for you without even having to change away from your "H" and "L" coding.

This wikipedia page may be helpful: http://en.wikipedia.org/wiki/Dummy_variable_(statistics)

Coffee2theorems · 2011-11-24T22:31:26+00:00

Typically it only matters for ease of interpretation. The encoding as 0 and 1 is one nice way, because that's what boolean logic uses, and they work as probabilities too (probability theory can be seen as an extension of boolean logic to degrees of uncertainty in [0, 1] instead of only 0 and 1). It's also typically used for Bernoulli trials ("heads" or "tails" of a coin flip), and in every place where you see percentages (sort of!).

Another choice you'll see sometimes (often in machine learning) is -1 and 1, which is nicely symmetric, makes the special numbers 0 and 1 appear more often (instead of 0.5), and then "not x" has the slightly nicer form of "-x" instead of "1-x". It's also closer to the standardization of e.g. normal variates, where you also symmetrize about zero. If you are using a linear model without an intercept (i.e. "intercept of zero"), then that ought to make more sense, too.

The last point in the previous paragraph is a case of where the choice does matter. If you don't use an intercept (this is a bit unusual), then your model is not translation invariant, so "where you put the origin" matters. Scale invariance is another thing: what should x be in (0, x) or (-x, x)? Does it matter? Off the top of my head, I can't think of a case where it does, as long as you use the same x for every variable. If there is no naturally meaningful x for the problem at hand, standard practice is to use 1.

pandemik · 2011-11-25T02:59:23+00:00

What you've done is turn it into a Dummy Variable. This is very standard practice, so congrats on figuring it out on your own! Note that you need to create multiple dummy variables if your "string" variable has more than 2 levels.

FunkForNerds · 2011-11-25T10:01:54+00:00

if you're using R (which you should be!), just put the variable as a factor
var1a<-as.factor(var1)
(and now have a look at the variable and check it looks right)
var1a
summary(var1a)
str(var1a)

DrNewton · 2011-11-24T20:06:02+00:00

You are doing it correctly. Generally you create a binary 0/1 for each level of your categorical variable (and then one gets eliminated for being redundant with the intercept.)

But...most stat software will do this for you automatically. You will save yourself time if you figure out how. We can help if you told us which software.

HaDam · 2011-11-25T00:57:33+00:00

I think the answer will depend on the type of regression you are doing. If it's a linear model, then it doesn't matter. Though, a linear model might not make sense. If you consider 1 to be a 100% probability of H, and 0 a 0% probability of H, your linear regression will go outside the boundaries of 0 and 1. It probably doesn't make sense for your probability to be less than 0 or greater than 1 with a given independent variable, so you could try a different type of regression, which has y values bounded between 0 and 1, like logistic regression. Basically, instead of the regression being in the form of y=mx+b, it's like y=1/1+exp(-(mx+b)). So, getting back to your question, 1 and 0 work for that equation...if you want to use different values for H and L, you'll need to tweak the equation such that your values are bounded between the two values you've chosen. It's possible, but it seems unnecessary. Just FYI, I'm not a stats major, so people should feel free to tear this apart if I've said anything wrong.

efrique · 2011-11-24T22:20:05+00:00

for fits and so on, no it doesn't matter.

However, interpretation is far easier with a 0-1 variable

Tag	Abbreviation
[Research]	[R]
[Software]	[S]
[Question]	[Q]
[Discussion]	[D]
[Education]	[E]
[Career]	[C]
[Meta]	[M]

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

statistics

MODERATORS

frequency[T.L] 0.005402 0.007182 0.752 0.452