Multicollinearity and R2

Dreary · May 24, 2009, 12:43pm

I have a note that says R^2 becomes high when there is multicollinearity , but don’t seem to see why. Multicollinearity means some of the X’s are highly correlated, but they could be negatively or positively correlated, so R^2, which is a function of the correlation between the dependent variable and the independent variables could get larger or smaller, no? Can you think of an example which shows that if X1 and X2 are positively or negatively correlated, that R^2 still gets larger? Also, why the standard errors get larger if there is multicollinearity? I don’t even know if these things are true, I just had them jotted down from some reading.

seerwright · May 24, 2009, 1:04pm

A couple of points: * Multicollinearity is usually a question of how much is present, not if it is present. * The high R^2 comes from the independent variables explaining the model relatively well as a group, although none of the independents do a particularly good job alone. * Because the independents alone are statistically insignificant, they must have large standard errors. Recall that the t-stat = coefficient/std err. So if the t-stat is small, it’s because the std err is large.

Chi_Paul · May 24, 2009, 1:13pm

I’m not sure if it is neccessarily true that R2 increases when multicollinearity is present, though it could happen, just not sure that it MUST happen? I think the key with multicollinearity is that the individual coefficients can falsely look unimportant, but the explanatory power of the model as a whole is still strong. In other words, your t and F stats are in conflict. I don’t know the math behind why the standard errors get larger, I just know that they do… Anyone else? (Looking at you wyantjs!)

system · May 24, 2009, 1:20pm

seerwright Wrote: ------------------------------------------------------- > A couple of points: > > * Multicollinearity is usually a question of how > much is present, not if it is present. > * The high R^2 comes from the independent > variables explaining the model relatively well as > a group, although none of the independents do a > particularly good job alone. > * Because the independents alone are statistically > insignificant, they must have large standard > errors. Recall that the t-stat = coefficient/std > err. So if the t-stat is small, it’s because the > std err is large. I think you are on the right track, but missing an essential point here. When multicollinearity is present in a high degree, and the model shows overall significance, we have a problem of the model not being able to distinquish between the individual effects of each regressor. It may be the case that one or more are truely insignificant, but may also be the case that t-stats are biased downward due to the high std. error. The problem arises because we have an independent variable that is very close to being a linear combination of another ind. variable. The model cannot determine if the information in this variable is significantly explaining the variation in the dependent, but it can determine if the model as a whole has explanatory power. Therefore, you will see a significant F-stat, high R^2, but insignificant individual t-stats for coefficients.

system · May 24, 2009, 1:26pm

Chi Paul Wrote: ------------------------------------------------------- > I’m not sure if it is neccessarily true that R2 > increases when multicollinearity is present, > though it could happen, just not sure that it MUST > happen? > Good point. High R^2 and insignificant t-stats are just a common indicator of a problem, but not a necessary, nor sufficient, condition for the presence of multicollinearity.

seerwright · May 24, 2009, 1:44pm

> Good point. High R^2 and insignificant t-stats > are just a common indicator of a problem, but not > a necessary, nor sufficient, condition for the > presence of multicollinearity. If you have a high R^2 _and_ insignificant t-stats, what else could possibly explain those effects other than highly correlated independent variables? I think you have to accept that there is some problem with the model than needs to be explained before you can rely on it. So, aside from multicollinearity, what else should be considered as a possible cause of these two effects occurring in tandem?

system · May 24, 2009, 1:50pm

It is not a necessary condition because multicollinearity may be present, and the model sucks anyway, giving a low R^2. It is not a sufficient condition because R^2 will always increase (or possibly remain constant) when a new ind. variable is added. Thus, we may have an equation with 20 independent variables that are not close to being linear combinations of each other, none of which explain anything, and R^2 could be high. This is a main reason why adjusted R^2 is used instead.

seerwright · May 24, 2009, 2:03pm

Sorry to be a stickler, but here’s what you said two messages ago: >High R^2 and insignificant t-stats are just a common indicator >of a problem, but not a necessary Your most recent response only addressed high R^2, not the combination of high R^2 and insignificant t-stats. There’s a big difference between evaluating R^2 alone and evaluating R^2 in conjunction with insignificant t-stats. The exam is going to be specific about causes, consequences, detection, and correction of statistical problems. I’m only asking for clarification because if there’s something else that you know about about the combination of high R^2 and insignificant t-stats, it could be helpful. To my knowledge the presence of those two conditions only result from a high level of multicollinearity.

system · May 24, 2009, 2:24pm

For the purpose of the exam, and in a large number of applications, you are correct. The presence of both factors will most likely be due to multicollinearity. Let me give an example of why it might not be. Since R^2 is a measure of the percent of dependent variable variation that is explained by all included indendent variables, inclusion of each additional independent variable will almost always increase R^2, regardless of its significance. The increase may be very small, but it is an increase nonetheless, and in the case of an insignificant variable, it will be due to a spurious relationship. This is just a mathematical result. It is possible for R^2 to remain constant, but it will not decline. Now, assume we are regressing the length of my morning turd on some variables. We may include what I ate last night, but we could also include the temperature. The temp obviously will be insignificant, but as I said above, will increase R^2 (not by much). Continue this will 20 more totally irrelevant and uncorrelated variables, and you get a useless regression, but a misleading high R^2. This example is obviously a little extreme, but you may very well encounter a model in real life that has too many independent variables. You may see your typical measures or R^2 and such that give the appearance that you have done something special, but each variable’s t-stat is around 0.00006. My point isn’t that this problem occurs regularly, but that you should be aware that it may be a case of overspecification rather than multicollinearity. Multicollinearity still provides meaningful forecasts. Overspecification using irrelevant variables doesn’t.

Dreary · May 24, 2009, 5:26pm

> It is possible for R^2 to remain constant, but it will not decline. This is my earlier question, why does R^2 necessarily go up when you add another X? I know this is true and it’s why you need adjusted R^2 to adjust for the effect of the increased number of indepedent variables. Isn’t it possible that the added X variable has a negative correlation with the dependent variable (Y), thus reducing R^2 when added?

bpdulog · May 24, 2009, 5:28pm

wyantjs Wrote: ------------------------------------------------------- > For the purpose of the exam, and in a large number > of applications, you are correct. The presence of > both factors will most likely be due to > multicollinearity. Let me give an example of why > it might not be. Since R^2 is a measure of the > percent of dependent variable variation that is > explained by all included indendent variables, > inclusion of each additional independent variable > will almost always increase R^2, regardless of its > significance. The increase may be very small, but > it is an increase nonetheless, and in the case of > an insignificant variable, it will be due to a > spurious relationship. This is just a > mathematical result. It is possible for R^2 to > remain constant, but it will not decline. > Now, assume we are regressing the length of my > morning turd on some variables. We may include > what I ate last night, but we could also include > the temperature. The temp obviously will be > insignificant, but as I said above, will increase > R^2 (not by much). Continue this will 20 more > totally irrelevant and uncorrelated variables, and > you get a useless regression, but a misleading > high R^2. This example is obviously a little > extreme, but you may very well encounter a model > in real life that has too many independent > variables. You may see your typical measures or > R^2 and such that give the appearance that you > have done something special, but each variable’s > t-stat is around 0.00006. My point isn’t that > this problem occurs regularly, but that you should > be aware that it may be a case of > overspecification rather than multicollinearity. > Multicollinearity still provides meaningful > forecasts. Overspecification using irrelevant > variables doesn’t. Haha, sadly enough, your example helped me understand this a bit more.

system · May 25, 2009, 4:59pm

Dreary Wrote: ------------------------------------------------------- > > It is possible for R^2 to remain constant, but > it will not decline. > > This is my earlier question, why does R^2 > necessarily go up when you add another X? I know > this is true and it’s why you need adjusted R^2 to > adjust for the effect of the increased number of > indepedent variables. Isn’t it possible that the > added X variable has a negative correlation with > the dependent variable (Y), thus reducing R^2 when > added? Sorry…I went to the Coca-Cola 600 yesterday… First, think of the single variable case. In this case R2 is the square of the correlation. A correlation of -1 would result in an R2 of 1, so there is a problem with your argument right there. Now…to extend to the multivariate case…I think you are viewing R2 incorrectly. It is not a measure of a directional relationship between variables. Since our dependent variable is a random variable, it has a distribution. That distribution has a variance. We regress variables on this dep. variable, and R2 tells us how much of the variation in Y is explained by the variation in the X’s. The directional effect is irrelevant (i.e. whether corr is positive or negative). This directional effect is captured by the sign on the coefficients. The only thing that matters is the presence of explanatory power. So, if we have a 3 variable regression with an R2 of, say .30, then the variation of the three variables explain 30% of the variation in the dep. variable. Adding another variable to the equation will not take away from that percentage. Those variables still explain the variation. The addition of another variable may increase R2, but it cannot cause a change in the relationship among the other variables.

Dreary · May 25, 2009, 5:42pm

That’s it… squaring to get to R^2 will always lead to an increase in R^2 (or at least no change if the added variable has zero correlation with Y). Final question is why the standard error of the residuals increases when you add another X? If currently your predicted Y values are giving you a cerrtain amount of residual error, why would adding another X which is strongly correlated with another existing X necessarily increase the residual error? In fact, I’m inclined to think that the residual error would go down, as the regression is getting better, i.e., more RSS. Thoughts?

system · May 25, 2009, 6:49pm

What makes you think that the std. error of residuals necessarily rises?

financial_novice · May 25, 2009, 6:58pm

Adding more variables alsmost always increase R^2 and that is why we need adjusted R^2 as a measure… R^2 is a measure of how much is explained (whether it is a postive exaplination or negative explanation)… IN the end if you keep adding variables that will be correlated to each other, R^2 will increase (due to adding lot of variables) but this will be a artifical increase owing to the multicollinarity issue…

Dreary · May 25, 2009, 6:58pm

I think “somewhere” it says that t-scores become small (i.e., large errors) which leads you to conclude that you have insignificant regression coefficients.

financial_novice · May 25, 2009, 7:02pm

This condition distorts the stand error of estimate and leads to wrong conclusions… and then in next paragraph talks about a higher probability or type II errors, not always though…

system · May 25, 2009, 8:33pm

Dreary Wrote: ------------------------------------------------------- > I think “somewhere” it says that t-scores become > small (i.e., large errors) which leads you to > conclude that you have insignificant regression > coefficients. That is true, but there is a difference between std. error of residuals, and std. error of coefficients.

Dreary · May 25, 2009, 8:42pm

Oh I see, so you are saying that the std. errors of all coefficients (?) do get large as you add more correlated X’s? It does I guess, but need to think about that post June 6.

TheAliMan · May 25, 2009, 9:25pm

No Dreary, no one needs to think about anything post June 6 I will be a walking zombie… EBITDA, WACC, and PRE2 will be erased from my mind