I’m not a statistician, and I am hoping someone out there can tell me where I’m wrong in the assertion represented by the above title. Or, if you know someone expert in statistics, please forward this post to them.
In regression analysis we use statistics to estimate the strength of the relationship between two variables, say X and Y.
Standard least-squares linear regression estimates the strength of the relationship (regression slope “m”) in the equation:
Y = mX + b, where b is the Y-intercept.
In the simplest case of Y = X, we can put in a set of normally distributed random numbers for X in Excel, and the relationship looks like this:
Now, in the real world, our measurements are typically noisy, with a variety of errors in measurement, or variations not due, directly or indirectly, to correlated behavior between X and Y. Importantly, standard least squares regression estimation assumes all of these errors are in Y, and not in X. This issue is seldom addressed by people doing regression analysis.
If we next add an error component to the Y variations, we get this:
In this case, a fairly accurate regression coefficient is obtained (1.003 vs. the true value of 1.000), and if you do many simulations with different noise seeds, you will find the diagnosed slope averages out to 1.000.
But, if there is also noise in the X variable, a low bias in the regression coefficient appears, and this is called “regression attenuation” or “regression dilution”:
This becomes a problem in practical applications because it means that the strength of a relationship diagnosed through regression will be underestimated to the extent that there are errors (or noise) in the X variable. This issue has been described (and “errors in variables” methods for treatment have been advanced) most widely in the medical literature, say in quantifying the relationship between human sodium levels and high blood pressure or heart disease. But the problem will exist in any field of research to the extent that the X measurements are noisy.
One can vary the relative amounts of noise in X and in Y to see just how much the regression slope is reduced. When this is done, the following relationship emerges, where the vertical axis is the regression attenuation coefficient (the ratio of the diagnosed slope to the true slope) and the horizontal axis is how much relative noise is in the X variations:
What you see here is that if you know how much of the X variations are due to noise/errors, then you know how much of a low bias you have in the diagnosed regression coefficient. For example, if noise in X is 20% the size of the signals in X, the underestimate of the regression coefficient is only 4%. But if the noise is the same size as the signal, then the regression slope is underestimated by about 50%.
Noise in Y Doesn’t Matter
But what the 3 different colored curves show is that for Y noise levels ranging from 1% of the Y signal, to 10 times the Y signal (a factor of 1,000 range in the Y noise), there is no effect on the regression slope (except to make its estimate more noisy when the Y noise is very large).
There is a commonly used technique for estimating the regression slope called Deming regression, and it assumes a known ratio between noise in Y versus noise in X. But I don’t see how the noise in Y has any impact on regression attenuation. All one needs is an estimate of the relative amount of noise in X, and then the regression attenuation follows the above curve(s).
Anyway, I hope someone can point out errors in what I have described, and why Deming regression should be used even though my analysis suggests regression attenuation has no dependence on errors in Y.
Why Am I Asking?
This impacts our analysis of the urban heat island (UHI) where we have hundreds of thousands of station pairs where we are relating their temperature difference to their difference in population density. At very low population densities, the correlation coefficients become very small (less than 0.1, so R2 less than 0.01), yet the regression coefficients are quite large, and — apparently — virtually unaffected by attenuation, because virtually all of the noise is in the temperature differences (Y) and not the population difference data (X).