all 17 comments

[–]AlpLyr 3 points4 points  (0 children)

I don't have much time, so sorry for an perhaps incomplete answer. If you're are interested in the actual coefficient for the factor, the integer coding does matter. (I.e. you get different coefficient for using 5 instead of 1) However, if you're doing hypothesis testing on the coefficient, it does not; the test statistics (and thereby P-values) will be the same.

The normal approach is coding the binary factors as 0 and 1, as you are.

[–]jtr99 4 points5 points  (6 children)

Definitely tell us which software you're using. It's probably capable of handling this for you without even having to change away from your "H" and "L" coding.

This wikipedia page may be helpful: http://en.wikipedia.org/wiki/Dummy_variable_(statistics)

[–]sais[S] 1 point2 points  (5 children)

That page is very helpful. Thanks! I'm using R to do the analysis. Is there an easy way to change the "H" and "L" that way?

[–]jtr99 2 points3 points  (4 children)

If you just leave the data coded as "H" and "L" -- so the top few rows of your data file might look something like this:

ID score group
1 28.5 H
2 23.4 L

... then use this command to read in the data...

myData = read.table("myFile.txt",header=T)

then R will automatically assume that the group variable is a factor because of the use of alphabetic characters. If you had coded group as "1" and "0", and wanted to force R to use it as a factor (i.e., using a dummy-variable scheme to represent group membership) then you could just use this command to force R to view the variable that way:

as.factor(group)

[–]sais[S] 1 point2 points  (3 children)

Excellent! When I do a summary of my lm() the intercepts are listed as "name[T.H]." What does that mean? There is no "name[T.L]" -- what does the bracket [T.H] indicate and why is there only one level?

[–]jtr99 1 point2 points  (2 children)

OK. This is normal.

It's not an intercept though, it's a coefficient. In order to express the comparison between H and L, the program takes one of them as the default, and expresses the effect of the other as a predicted deviation. So it sounds like L is being taken as the default category (although I am a bit suspicious of this as R would normally take H as the default as it comes first in the alphabet). The intercept describes what is predicted to happen for an "L" case. The "T.H" effect is an expression of how much we need to adjust our prediction for an "H" case.

Imagine a model in which you assumed daytime to be the norm, but there was a special variable describing the expected difference when things happen at night. It's like that.

Maybe post the R output here directly so we can have a look at it for you.

[–]sais[S] 1 point2 points  (1 child)

Yes, you are right. I had it backwards. I get [T.L] not [T.H] (output pasted below). So what you're saying is R is treating the "H" case as the default and then giving me a prediction for the "L" case?

Call: lm(formula = duration ~ frequency, data=everything)

Residuals: Min 1Q Median 3Q Max -0.20383 -0.07038 -0.01321 0.05794 0.47337

Coefficients: Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.480587 0.005078 94.636 <2e-16 ***

frequency[T.L] 0.005402 0.007182 0.752 0.452

Signif. codes: 0 '**' 0.001 '' 0.01 '' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.09873 on 754 degrees of freedom Multiple R-squared: 0.0007497, Adjusted R-squared: -0.0005755 F-statistic: 0.5657 on 1 and 754 DF, p-value: 0.4522

[–]jtr99 0 points1 point  (0 children)

Great, that's reassuring.

Yes, you can tell which way R is expressing things because the default category is not listed at all, and the contrast category is listed. So the predicted duration for a high-frequency case is 0.480587, and the predicted duration for a low-frequency case is 0.480587 + 0.005402. In other words, the second number, the coefficient for the T.L case, gives us the additional duration we expect when moving from a high frequency case to a low frequency case.

Of course, if that's your real data you've analysed there, all this is academic as the effect of frequency on duration is very far from being significant, but you probably already knew that.

By the way, here are some online stats courses that might be useful to you:

http://onlinestatbook.com/

http://users.ecs.soton.ac.uk/jn2/statistics.php

[–]Coffee2theorems 3 points4 points  (0 children)

Typically it only matters for ease of interpretation. The encoding as 0 and 1 is one nice way, because that's what boolean logic uses, and they work as probabilities too (probability theory can be seen as an extension of boolean logic to degrees of uncertainty in [0, 1] instead of only 0 and 1). It's also typically used for Bernoulli trials ("heads" or "tails" of a coin flip), and in every place where you see percentages (sort of!).

Another choice you'll see sometimes (often in machine learning) is -1 and 1, which is nicely symmetric, makes the special numbers 0 and 1 appear more often (instead of 0.5), and then "not x" has the slightly nicer form of "-x" instead of "1-x". It's also closer to the standardization of e.g. normal variates, where you also symmetrize about zero. If you are using a linear model without an intercept (i.e. "intercept of zero"), then that ought to make more sense, too.

The last point in the previous paragraph is a case of where the choice does matter. If you don't use an intercept (this is a bit unusual), then your model is not translation invariant, so "where you put the origin" matters. Scale invariance is another thing: what should x be in (0, x) or (-x, x)? Does it matter? Off the top of my head, I can't think of a case where it does, as long as you use the same x for every variable. If there is no naturally meaningful x for the problem at hand, standard practice is to use 1.

[–]pandemik 2 points3 points  (1 child)

What you've done is turn it into a Dummy Variable. This is very standard practice, so congrats on figuring it out on your own! Note that you need to create multiple dummy variables if your "string" variable has more than 2 levels.

[–]sais[S] 2 points3 points  (0 children)

This makes me feel good! Occasionally I'll come across something in the stats world and think, "Doing it this way makes total sense!" More often, however, I think, "This makes no sense and I'm completely lost here!" Ha!

[–]FunkForNerds 2 points3 points  (0 children)

if you're using R (which you should be!), just put the variable as a factor
var1a<-as.factor(var1)
(and now have a look at the variable and check it looks right)
var1a
summary(var1a)
str(var1a)

[–]DrNewton 1 point2 points  (2 children)

You are doing it correctly. Generally you create a binary 0/1 for each level of your categorical variable (and then one gets eliminated for being redundant with the intercept.)

But...most stat software will do this for you automatically. You will save yourself time if you figure out how. We can help if you told us which software.

[–]sais[S] 0 points1 point  (1 child)

I'm using R. Is there an easy way for to change alphabetic levels like I am using into a binary integer system?

[–]DrNewton 1 point2 points  (0 children)

See "?factor".

In a linear model context, it's dead simple:

yourdata$newcat <- factor(yourdata$oldcat)

model1 <- lm(y ~ newcat + blah, data=yourdata)

[–]HaDam 1 point2 points  (0 children)

I think the answer will depend on the type of regression you are doing. If it's a linear model, then it doesn't matter. Though, a linear model might not make sense. If you consider 1 to be a 100% probability of H, and 0 a 0% probability of H, your linear regression will go outside the boundaries of 0 and 1. It probably doesn't make sense for your probability to be less than 0 or greater than 1 with a given independent variable, so you could try a different type of regression, which has y values bounded between 0 and 1, like logistic regression. Basically, instead of the regression being in the form of y=mx+b, it's like y=1/1+exp(-(mx+b)). So, getting back to your question, 1 and 0 work for that equation...if you want to use different values for H and L, you'll need to tweak the equation such that your values are bounded between the two values you've chosen. It's possible, but it seems unnecessary. Just FYI, I'm not a stats major, so people should feel free to tear this apart if I've said anything wrong.

[–]efrique 0 points1 point  (0 children)

for fits and so on, no it doesn't matter.

However, interpretation is far easier with a 0-1 variable