One of the assumptions of linear regression is that the variance of the error terms across all the observations is constant.
However, if this assumption is violated, we talk about heteroscedasticity .
But why exactly do we want constant variance of error terms across all the observations? I get it that if the variances of error terms change , thats a problem , but isnt it be more representative to give more weights to larger variance of error terms rather than giving all the variance of error terms the same weight?
You’re essentially answering your own question. The standard derivations of the variance assumes constant variance and uses a certain formula, however, when this doesn’t hold we need a different formula to accurately calculate the variance in a way that recognizes the mean-variance relationship (which is what heteroskedasticity implies).
An OLS regression model calculates the best mean line in the middle of a dispersed data group. The assumption of homoscedasticity resides in the inherent mathematical procedure of applying a “mean”. This mean is meant to be the best simplified representation of a dispersed data, so if the dispersion of the data is not constant along the data set, then the mean line fails to be the best fit.
The math behind OLS model can be improved to correct for heteroskedasticity of data: a MOLS regression model.
Yes, it did, you just failed to understand it. Under the usual derivation, an assumption is involved for ease of calculation and for inferences; when you don’t have this assumption in the derivation you can see there is more complexity to calculating the correct variance which then is appropriate to create confidence intervals and test statistics. The key of the assumption is that the mean of the distribution is unrelated to its variance. When this isn’t true you need different modeling to account for this. So we don’t need homoscedasticity, it just makes things a lot easier and you should work on deriving OLS to see when it comes into play. Without homoscedasticity you lose efficiency as well.
Your question is basically, “why do we wear seatbelts in cars?” to which I said, “So you are less likely to die in an accident.” You then came back and said “That doesn’t answer my question. Why do we need seatbelts in cars?”
Let me see if I can explain this the way I understand it. I could be wrong and misleading then hopefully someone will correct me.
In a linear regression,
These models try to model the relationship between two variables. You are trying to see how an Independent variable explains a dependent variable. In order to use linear regression, there should be a linear relationship.
A good example online is say you are trying to model the relationship between income and age group. You find that people under the age of 20 often earn minimum wage so there is no variability in their earnings. If you were to model this relationship, you decide that a linear regression is fitting.
Now, you have to model this relationship for people age 20 - 30. From there, you now start to see that there’s no clear picture. Heck of a lot of variability. The IV is no longer much of a predictor of the DV. When you plot your points on a graph and you draw a line, you are not seeing a relationship.
Approach it in this way, If I am to use a linear regression, the variance of the error terms must be constant. If I find that is not the case, then my results may mean there’s a problem with my data or I must use a different model. Think of this as a real world scenario where you have the data and are trying to figure out what model to use.
Hope, this is what you are asking…
I hope someone corrects me here if this is a flawed example as I am also lowkey struggling through quant.