## Regression attenuation only depends upon the relative noise in “X”

October 11th, 2023 by Roy W. Spencer, Ph. D.

I’m not a statistician, and I am hoping someone out there can tell me where I’m wrong in the assertion represented by the above title. Or, if you know someone expert in statistics, please forward this post to them.

In regression analysis we use statistics to estimate the strength of the relationship between two variables, say X and Y.

Standard least-squares linear regression estimates the strength of the relationship (regression slope “m”) in the equation:

Y = mX + b, where b is the Y-intercept.

In the simplest case of Y = X, we can put in a set of normally distributed random numbers for X in Excel, and the relationship looks like this:

Now, in the real world, our measurements are typically noisy, with a variety of errors in measurement, or variations not due, directly or indirectly, to correlated behavior between X and Y. Importantly, standard least squares regression estimation assumes all of these errors are in Y, and not in X. This issue is seldom addressed by people doing regression analysis.

If we next add an error component to the Y variations, we get this:

In this case, a fairly accurate regression coefficient is obtained (1.003 vs. the true value of 1.000), and if you do many simulations with different noise seeds, you will find the diagnosed slope averages out to 1.000.

But, if there is also noise in the X variable, a low bias in the regression coefficient appears, and this is called “regression attenuation” or “regression dilution”:

This becomes a problem in practical applications because it means that the strength of a relationship diagnosed through regression will be underestimated to the extent that there are errors (or noise) in the X variable. This issue has been described (and “errors in variables” methods for treatment have been advanced) most widely in the medical literature, say in quantifying the relationship between human sodium levels and high blood pressure or heart disease. But the problem will exist in any field of research to the extent that the X measurements are noisy.

One can vary the relative amounts of noise in X and in Y to see just how much the regression slope is reduced. When this is done, the following relationship emerges, where the vertical axis is the regression attenuation coefficient (the ratio of the diagnosed slope to the true slope) and the horizontal axis is how much relative noise is in the X variations:

What you see here is that if you know how much of the X variations are due to noise/errors, then you know how much of a low bias you have in the diagnosed regression coefficient. For example, if noise in X is 20% the size of the signals in X, the underestimate of the regression coefficient is only 4%. But if the noise is the same size as the signal, then the regression slope is underestimated by about 50%.

Noise in Y Doesn’t Matter

But what the 3 different colored curves show is that for Y noise levels ranging from 1% of the Y signal, to 10 times the Y signal (a factor of 1,000 range in the Y noise), there is no effect on the regression slope (except to make its estimate more noisy when the Y noise is very large).

There is a commonly used technique for estimating the regression slope called Deming regression, and it assumes a known ratio between noise in Y versus noise in X. But I don’t see how the noise in Y has any impact on regression attenuation. All one needs is an estimate of the relative amount of noise in X, and then the regression attenuation follows the above curve(s).

Anyway, I hope someone can point out errors in what I have described, and why Deming regression should be used even though my analysis suggests regression attenuation has no dependence on errors in Y.

This impacts our analysis of the urban heat island (UHI) where we have hundreds of thousands of station pairs where we are relating their temperature difference to their difference in population density. At very low population densities, the correlation coefficients become very small (less than 0.1, so R2 less than 0.01), yet the regression coefficients are quite large, and — apparently — virtually unaffected by attenuation, because virtually all of the noise is in the temperature differences (Y) and not the population difference data (X).

1. Ross McKitrick has responded to my email to him on this subject, and it turns out he has one paper published, and another soon to be published, on this subject. Pretty technical. He claims that climate researchers using climate model output are actually getting over-inflated regression relationships by using “errors-in-variables” regression models that make improper assumptions regarding the source of “noise” in climate model data.

Deming regression uses paired measurements (Xi, Yi) and their errors (εi, δi).

It assumes that the errors in both X and Y are normally distributed, and the ratio of their variances is constant (λ = V(ε)/V(δ)). It also assumes a linear relationship between X and Y.

While Ordinary Linear Regression (OLS) minimizes the vertical distances between observed values of Y and the regression line, Deming regression minimizes the perpendicular distances between data points and the regression line.

OLS> vs Deming Regression Model

Deming’s method was first published in 1948. I suspect that climate researchers have heard of it by now, no?

• Ed Reid says:

I just read through most of the responses to this issue and I can’t help think that the problem is in choosing the X variable. In this case where you are relating some measured quantity (Y) versus some variable measuring urban effect (X) which in this case is population size, the problem is how well does population size measure that urban effect. Not very well as population size is a rather poor correlate of the urban effect simply because cities of equal size do not necessarily share similar characteristics that contribute to the urban effect. Hence the variance in X is so large leading to a potential attenuation effect. Furthermore, it also seems unlikely that the variance in X is constant over population size leading to further variance in X. What you need to do is find a better measurement of the urban effect. Perhaps a measurement of some sort of albedo effect although I suspect there may be many ways to find a better X variable.

2. Nate says:

The Excel fit is minimizing y residuals, not x residuals.

So it is confused when x has big errors and gives an erroneous fit.

3. Tim S says:

It seems from your final paragraph that the problem is the way temperature is being measured and not necessarily the urban density. It seems to me it could be the physical location, such as the difference between a park setting or store front, or it could be a problem with the shielding of the instrument to avoid the effects of direct sunlight, etc.

4. Steven M Mosher says:

This impacts our analysis of the urban heat island (UHI) where we have hundreds of thousands of station pairs where we are relating their temperature difference to their difference in population density.

you probably need to revist your thinking about density as Oke did in later years, realizing it was dimensionally wrong. and could never be used in a proper regression.

further, consider how density estimates are made in EVERY population dataset.

compare accurate counts like us census with estimates. bottom line population is useless because human bodies dont cause uhi.

consider airports which typically are assigned 0 population, and industrial sites like mining operations in australia which have 0 population.

• Anon for a reason says:

Steven, population of what? Just people or farms animals as they give off enough heat.

If urban heat islands include any man made source of heat that include indirect sources such as farms then it be even more difficult to resolve.

• Swenson says:

Thermal imaging won’t tell you much, in all probability. An example of possible error might be geologic hot-spots, apparently chaotic both as to location and intensity, as well as size.

Another might be biomass accumulation, either as a result of human activities – cattle etc., or more subtle – subterranean termites or other insects. The termite biomass is estimated to be at least equal to that of humans. Is there a known relationship between humans and termites?

All a bit tricky, I would suggest.

• Anon for a reason says:

Totally agree that it is tricky, but surely with thousands of sites around the world it should be possible to find some ideal scenarios. Of course it will take time but would be worth while. One of the issues that makes it easier in North America & Europe is that population density quickly reduces away from towns & cities, often over a short distance. Can’t say the same about the uk.

Dr Roy Spencer & Co have done marvellous research that helps refine one of the more misleading areas, but of course more research is needed.

• RLH says:

“population is useless because human bodies dont cause uhi”

Are you saying that ‘human’ caused warmth does not leak into the atmosphere?

• Entropic man says:

I doubt that our body heat is contributing significantly to the global energy budget. We are only releasing by respiration energy earlier trapped by photosynthesis.

• Swenson says:

EM,

Doubts are fine, but experimental results either confirm or contradict assumptions – which are what doubts are!

It’s a fact that humans give off heat. It’s a fact that thermometers are designed to respond to heat. Generally, a thermometer will show an internal human temperature of around 37 C, as will an appropriately calibrated IR thermometer pointed at, say, your forehead.

You may claim that the heat produced by eight billion humans through converting hydrocarbons to water and carbon dioxide is insignificant, but in freezing temperatures at night, human heat production can prevent death, as in a snow cave, igloo, or sleeping bag. Quite significant.

Heat from the sun is absent, so comparisons with solar input are irrelevant.

Measurements will settle your doubts, but I doubt anyone has actually performed such measurements. I doubt that you can describe the GHE.

Ah, many doubts, few resolutions.

• Tim Folkerts says:

“Its a fact that humans give off heat…
Ah, many doubts, few resolutions.”

So why not provide some resolutions? As a Fermi problem this is easy. As an order of magnitude estimate …
A person eats ~2E3 kcal/day.
There are ~4E3 J/kcal
Thats ~ 1E7 J/day

There are ~ 1E5 s/day
That’s ~100 W per person.

There are ~20 people/km^2 (averaged over the whole earth)
That’s ~5000 W/km^2 from people
That’s ~0.005 W/m^2 from people.

So yes, even 8,000,000,000 people is insignificant in terms of earth’s over all energy flows. (And if people weren’t consuming the plants, other animals or bacteria would, so it is a net zero contribution.

For individual cities, population densities can be ~1000x higher, or ~5 W/m^2. Then it is a significant local energy source. Basically, the plants (and animals) we eat collect solar energy from farmlands and deliver it cities.

Of course, all of these are approximations, but they do give order of magnitude estimates, that anyone here should be able to do.

• Anon for a reason says:

Entropic Man, the heat added to the environment by sedentary people in the UK is 6.8GW and much higher than that when any movement is involved.

Now add I pets, farm animals etc and heat added will only increase. So your assumption is not based on any evidence that can be found by heating/service engineers who need to rightsize heating & cooling equipment for the built environment.

• Bill Hunter says:

Add to that climate science hasn’t established that the GHE is greater than 8 to 10C.

they have a whole lot of explaining to do.

• Bill Hunter says:

Entropic man says:

I doubt that our body heat is contributing significantly to the global energy budget. We are only releasing by respiration energy earlier trapped by photosynthesis.

——————————-
gee now all you need to do is realize that all co2 does is rerelease energy previously trapped by another molecule too.

i sense we are making some progress.

• RLH says:

Who said that body heat was the main component?

• Swenson says:

I don’t know. Who did?

• Anon for a reason says:

Quick bit of googling and an amusing fact that seems to have escaped you. The proportion of cattle to people seems to be increasing as more people aspire to middle class meat eaters. Heat energy given off by cows is 1400 watt so cattle farms near weather stations could also add to the global warming data.

1:10 people:cattle spread out across the world may not be a lot but in condensed areas might be noticeable. So rather than urban heat index there might need to be cattle heat index.

5. Mark S. says:

If you look at the closed form equation for the slope in the linear regression model, it is easy to see the difference between noise in x and noise in y.

The denominator contains just two summations depending only on x: Sum(x_i^2) and (Sum(x_i))^2. If you add symmetric noise e_i to the x_i, the second sum will not change on average, but the first will pick up a term like Sum(e_i^2).

Adding symmetric noise terms to the y_i in the numerator will cancel in the limit.

Your plot is “in spirit” 1/(1 + e^2)

• Bindidon says:

Thank you.

• Mark S. says:

You’re welcome

6. Peter Langfelder says:

To answer your question, the attenuation can be expressed in several different equivalent forms, one of which only depends on the noise in the x variable. But most of the time you don’t know the noise of a variable because you don’t know the underlying “true” (latent, i.e., unobserved) variable. You may, e.g., have an estimate of the ratio of the noise in x vs. noise in y and one can also express the attenuation in terms of that ratio. It is true that if the noise in x is much smaller than the noise in y, the attenuation is weaker (i.e., the regression coefficient is closer to the true proportionality constant), but the attenuation, when expressed in terms of the ratio, also depends on the correlation of the two variables. When the correlation is very small, the attenuation can be large _and_ tends to be difficult to estimate from data, i.e., the uncertainty in the estimated true proportionality coefficient is large. Feel free to contact me for more details.

• Bindidon says:

2u2 thx for the explanation which I had overlooked.

7. Anon for a reason says:

Years ago I had to measure the flow of a fluid over an obstacle using a simple instrument that had to be held parallel to the flow. Of course I got distracted and took all measurements aligned in the same plain. As the great philosopher Homer said “doh!!”

I then tried to calibrate the instrument to see if I could work back to the velocity reading I should have made. Interestingly, I found that not only was it impossible but also other readings from more competent people were also likely to be wrong.

In chaotic system I doubt any removal of noise/bad data will be reliable.

8. Nate says:

If fitting y = mx +b, and most of the noise is in x, then should be fitting x = (1/m)y + c.

A LS fit to this will minimize the x residuals, and gives a correct slope 1/m.

• RLH says:

Least squares is sensitive to outliers.

• Gordon Robertson says:

nate…the question is, how is the line fitted to the data?

Think about it. You have a set of x,y data points plotted on an x – y axis. How do you fit a line to the data? You could eyeball it then determine the slope of the line drawn, which would be in error. But you can’t arbitrarily declare the line without some manipulations behind the scenes.

As far as I understand, regression techniques do the same, by averaging the plotted points in an algorithm. Of course, you could start with a guessed line slope then refine it to get a more accurate slope.

What if you have so many data points that it is impractical to average them?

9. Bindidon says:

The magnificent outlier sensitivity of linear estimates based on Ordinary Least Squares

Months ago I compared, for the UAH 6.0 LT anomaly Globe time series, the linear estimates of

– the source itself
– its simple 13 month running mean (SRM)
– its cascaded triple (C3RM) resp. quintuple (C5RM) 13 month running means (with window sizes according to Vaughan Pratt’s numbers)

for the months Dec 1978 – May 2023.

*
The Pratt numbers and associated window sizes:

C3RM: 1.2067, 1.5478 -> 13 / 11 / 8
C5RM: 1.0832, 1.2343, 1.4352, 1.6757 -> 13 / 12 / 11 / 9 / 8

The linear estimates of all series were computed wrt the active window of the C5RM, in C/decade, as it is the smallest one:

Source: 0.143 +- 0.007
SRM: 0.144 +- 0.005
C3RM: 0.144 +- 0.004
C5RM: 0.143 +- 0.004

These estimates show that despite increasingly eliminating outliers present in the source, the simple running mean and the two cascaded running means nevertheless show nearly the same trend as this source.

Eliminating outliers (wherever strong deviations of a mean are viewed as such) may be a wishful task; but it does not necessarily have an influence on the computation of linear estimates based on Ordinary Least Squares.

*
When now computing, instead of linear estimates, the coefficients of e.g. a third order polynomial mean, you obtain indeed numbers differing a little bit more.

An evaluation of the coefficients in ‘y = ax^3 + bx^2 +cx + d’ for the source and the running means over the 484 months of C5RM’s active window (Mar 1981 – Jun 2021) gives the following temperature increases during this period:

Source: 0.745 (C)
SRM: 0.731
C3RM: 0.723
C5RM: 0.720

*
Finally, when comparing these numbers to those obtained from linear estimates for the same period

Source: 0.578 (C)
SRM: 0.581
C3RM: 0.579
C5RM: 0.577

you suddenly discover that there must be some nice little acceleration in UAH’s data.

*
I would enjoy commenters like Mark B or bdgwx doing the same job and confirming/informing the numbers above.

• RLH says:

I just quote what the literature says. Least squares is sensitive to outliers.

• RLH says:

“Because of the extreme sensitivity of least square, a single outlier in a large sample is sufficient to deviate the regression fit totally as its breakdown point is 1/n which tends to zero with the increase in sample size n”

• RLH says:

“Note Least squares regression is not resistant to the presence of outliers.”

• RLH says:

Blinny still thinks that SRM are a valid statistical technique.

• Bindidon says:

It was 100% predictable: ‘RLH’ alias Blindsley H00d is unable to technically contradict my results:

Source: 0.143 +- 0.007
SRM: 0.144 +- 0.005
C3RM: 0.144 +- 0.004
C5RM: 0.143 +- 0.004

Results which clearly show that – contrary to what Blindsley H00d always claims: not just the SRMs, the simple running means but the cascaded ones as well – have almost exactly the same linear estimates as these poor sources, all thoroughly contaminated by terrible outliers.

*
Instead of a valid, convincing contradiction, Blindsley H00d, as always, resorts to unnecessary appeals to authority as well as insinuations and lies like

” Blinny still thinks that SRM are a valid statistical technique. ”

I never claimed such a nonsense but have always said and repeat that simple running means are a very good tool to show, within a community of laypersons, the essence of sometimes very cryptic, disparate time series – without however scraping away details relevant to their comparison, like e.g. here:

*
Only overly opinionated persons like Blindsley H00d would argue that these many, ever-fading bumps – which perfectly signal an escape from a large glacial isostatic rebound zone – are nothing more than troubling ripples and distortions that need to be entirely eliminated.

*
And since Blindsley H00d says he is a highly qualified IT specialist with a master’s degree, it will certainly not be difficult for him at all to create for us, out of this trivial data source

https://www.psmsl.org/data/obtaining/rlr.monthly.data/rlr_monthly.zip

the same graphic as above – but of course with impressing cascaded running means, instead of these simple-minded SRMs, which he always disapproves of and even pursues mercilessly for purely ideological reasons.

He very probably never has ever used any cascaded running mean as input data, and merely proudly, endlessly shows them on his blog.

*
No need for any further comment: Blindlsey H00d is always right and never has ever admitted being wrong (some small lexical mistakes excepted).

• RLH says:

“simple running means are a very good tool” which contain various distortions such that someone claimed that he wouldn’t wish them on his worst enemy.

• RLH says:

Care to predict what a C3RunningMedian trend is for 12 months periods of the UAH Data?

• Bindidon says:

As always, Blindsley H00d:

Stop talking, arguing, asking others for what you yourself should do.

Start working on your ‘C3RunningMedian trend for 13 months periods of the UAH Data’, and present the results on this blog.

I repeat: 13 months, Blindsley H00d.

As do the scientists you discredit because they use simple 13 month running means, like Roy Spencer or… the Belgian SILSO team.

• RLH says:

Why 13 months when a year has only 12 months?

• Mark S says:

You can think of doing linear regression on running averages as linear regression on the original data plus a term equal to (running average minus original). Then see my comment above. The “noise” term here is not explicitly generated from a symmetric PDF, but for the UAH data will to a great extent cancel in the Sum(y_i) terms (plot it and see) since the dominant signal is oscillating.

It makes no sense to me to fit a cubic polynomial to the data. If you want to show “acceleration” use a quadratic and show the statistical significance of “b”. Or take the log and fit with linear regression for an exponential growth model.

• Bindidon says:

Mark S

” It makes no sense to me to fit a cubic polynomial to the data. ”

You are of course right.

But often enough, cubic splines follow their data source better than quadratic ones, and that was for me layman the reason to use them.

I see that for example when looking at SILSO’s monthly Sun Spot Number data and looking for a projection into the near future.

Neither second let alone fourth order polynomials look good to me, but third order gives a good result.

This has nothing to do with math. Just layman’s gut feeling :–)

• Mark S says:

Higher-order polynomial interpolation is more accurate if the data is smooth (in the mathematical sense).

Higher-order polynomial extrapolation is a very dangerous game. The higher order terms will always dominate once the extrapolation window is long enough and then quickly head toward + or – infinity.

I had students do a project to predict the Dow Jones index one day in the future using extrapolation and interpolation (the latter under the fantasy scenario that they knew what the index would be the day after tomorrow, but not tomorrow). It is a useful exercise in futility.

• Bindidon says:

Mark S

” It is a useful exercise in futility. ”

At least 100% agree, but…

I never intentionally use polynomial fits (beginning with the first degree) to predict anything.

I use them only for the comparison of existing data, e.g.

to understand how much for example cascaded running means really differ from the simpler ones, by evaluating the polynomial equation coefficients provided by the spreadsheet calculator, and comparing the amazingly similar results.

*
It’s nice however to look at what these polynomials do when you extend their scope beyond existing data, like here:

The red 3rd order poly is not so very far from McIntosh/Leamon’s prediction based on complex evaluations of the Hale cycles :–)

Just for fun!

• skeptikal says:

Bindidon says… “you suddenly discover that there must be some nice little acceleration in UAHs data.”

Now break down the UAH data and you’ll find out why.

• Bindidon says:

‘… break down… ‘

Sounds a bit cryptic to me. Could you explain?

• skeptikal says:

Break the data down into segments. Compare the segments and you’ll see why you detected an acceleration.

• Bindidon says:

Nothing against sound skep~ticism!

But ‘Break the data down into segments’ is typical pse~udo~skep~ticism, reminding me how people broke global surface time series into segments and computed their local trend.

That’s the cheap escalator trick.

It’s also ‘reasoning’ like the Robertson guy who (indirectly, without naming him of course) discredits Roy Spencer’s trends and claims the trend is only an artificial calculation. What a load of nonsense.

You must analyze time series as a whole, and compare for example the (currently tiniest) difference between linear and quadratic fit.

• skeptikal says:

Anything that can be used for good, can also be used for evil.

If you don’t like segmenting data, then that’s alright. Do what works for you.

10. Chas says:

Not a mathematician, but it seems that the sd(Y) terms cancel out when the equation for the correlation coefficient ‘r’ is put into the equation for the OLS slope:

The OLS slope is r * (sd(Y)/sd(x))
And r is COV(X,Y) / (sd(X) * sd(Y))
Where r is the correlation coefficient.
?

11. Gordon Robertson says:

I would like to see statisticians stop using the word noise in statistics since it is misleading. The word noise, used more correctly in electronics circuits, is a reference to an unwanted signal. Where in statistics is a data signal unwanted? It has to mean the data is in error to some degree.

Of course, the various filters and estimations used can introduce noise due to the inaccuracy they can introduce. Remember, when you do a least squares fit, you are estimating, and the error produced could be regarded as noise. However, it used to be called an error, not noise.

In Roy’s straight-line X-Y example there can be no noise because X = Y and produces a straight line. However, in the other example with a multitude of data points though which a trend line has been drawn, there is no longer an accurate x – y relationship. It is an estimate, no matter how well the trend line represents the average of the data.

A I understand regression, it is simply a numerical method of finding an average for a set of data points as opposed to eyeballing a line through the plotted points. The thing I don’t like about it is number-crunching blindly without understand the meaning of the results.

• RLH says:

“Where in statistics is a data signal unwanted?”

When you mix 2 signals together.

• Gordon Robertson says:

Do you mean averaging the signals statistically? If so, any error produced is human error and not an error in the data.

Is that what noise means, human error due to the statistical methods employed?

• RLH says:

You asked “Where in statistics is a data signal unwanted?”. I replied with an example.

• Tim Folkerts says:

“The word noise, used more correctly in electronics circuits, is a reference to an unwanted signal. ”

Which is exactly what we have here. There is a ‘wanted signal’ for how population impacts temperature. Then there are MYRIAD ‘unwanted signals’ that ALSO impact temperature (El Nino, volcanoes, CO2 …)

• Bindidon says:

… and please don’t forget the absolutely unwanted signal that results from processing the data :–(

• skeptikal says:

Ha ha, you’re closer to the truth than you realize.

• Bindidon says:

Ho ho, here is someone who may be underestimating that anyone who processes data every day inevitably learns to understand that :–))

• Gordon Robertson says:

Tim…the difference is that we are adding the noise with our analytical tools then claiming it is noise in the data. We need to make it clear that such noise is human-induced error and not an error in the measurements that produced the data.

The noise in electrical circuits is incidental and very natural. Shot noise is produced by collisions between free electrons and atomic and other particles like electrons in the conductor. EMI noise could be noise introduced by fluorescent light EM radiation or by sparking brushes in the commutator of an electric motor.

• Tim Folkerts says:

“adding the noise with our analytical tools then claiming it is noise in the data.”

I would say there is plenty of natural ‘noise’ in climate data, too. ‘Our analytic tools’ introduce small amounts of noise relative to day to day weather changes.

• Gordon Robertson says:

tim f…”I would say there is plenty of natural noise in climate data…”

***

If you re talking about data from NOAA, GISS, Had-crut, or BoM, that’s true. It’s full of fudged data. But UAH data is clean, straight from the sat telemetry. UAH does not need to fudge the data because the sat scanners cover 95% of the entire planetary surface whereas thermometers cover, on average, about 1 thermometer per 100,000 square kilometres.

If you take a reading from a thermometer in the field, say in a Stevenson screen, there is always a built in error to the reading taken by human eyes, but it can be stated with an error margin with confidence. However, if you take two reading per day and average them as a high and a low, then you introduce error that can vary day to day depending on the range of high and low. In other words, how accurate is the average claimed?

Now take that kind of error and spread it over NOAA’s 1500 thermometers to measure the planet’s solid surface and you get an essentially unknown global average.

I don’t think we are talking about such errors with regression, we are talking about errors built into the regression algorithm that creates an even deeper error. That’s a basic problem with statistical methods, understanding the context in which the data was produced and how the statistical method employed applies to it.

Another problem with such methods is the bs factor. Gallop polls claim, using a sample of a huge population in comparison that they can predict an outcome, they they can claim a confidence level of 90% or higher that the results are correct. They may have that degree of accuracy based on the math applied but whether that outcome represents the huge population is another matter.

Sample size is everything in statistics and claiming a 90% confidence level based on a sample size of 1000 from a population six of 40 millions is not only absurd it is is hysterically laughable.

• Tim Folkerts says:

Gordon, the “natural noise” is “weather”.

Things like data analysis by NOAA, GISS, Had-crut, or BoM would actually tend to REDUCE noise. The very word “homogenize” tells you the data is smoothed using other nearby stations. This would mean less noise in the resulting temperature records.

“claiming a 90% confidence level based on a sample size of 1000 from a population six of 40 millions is not only absurd it is is hysterically laughable.”
Sample size is NOT the problem. An accurate, random sample of 1000 from 40 thousand or 40 million or 40 billion is sufficient to draw statistically significant conclusions.

The issue is not sample size, but “accurate” and “random”. If you don’t get a true random sample, that can badly skew the results (eg “Dewey Beats Truman”). If people lie or hang up on Gallop so that they get incorrect data, that can badly skew the results. THOSE are the potential problems, NOT sample size per se.

• Gordon Robertson says:

tim…”Things like data analysis by NOAA, GISS, Had-crut, or BoM would actually tend to REDUCE noise. The very word homogenize tells you the data is smoothed using other nearby stations. This would mean less noise in the resulting temperature records.

***

I’ll give you this, you are very loyal to the cheaters at NOAA, GISS, Had-crut, and Bom. Before they homogenize, in a climate model, they interpolate. That means they synthesize temperatures that were not measured, based on measured temperatures at locations up to 1200 km away. That is cheating right there. Then they homogenize manufactured temperatures with real temperatures to make temperatures look even, not to hide errors.

“Sample size is NOT the problem. An accurate, random sample of 1000 from 40 thousand or 40 million or 40 billion is sufficient to draw statistically significant conclusions”.

***

If you truly believe that, your understanding of statistical methods is seriously flawed.

After a class in probability and statistics I cornered the prof and asked him about gallops polls, whether they were accurate with such a small sample size. His reply was “Oh, no you don’t, you explain to me first why you think they are inaccurate”. I told him the sample size was way too small for the population, and he agreed.

Unlike others, and to his credit, he did not try to defend the nefarious methods and insinuations offered by polling companies, that their results have a 95% confidence level, based on a sample size of 1000 for a population of 38 million.

Gallop polls are a perfect example of Mark Twain’s claim that there are three kinds of lies: Lies, damned lies, and statistics.

• Tim Folkerts says:

Oh, I agree that the sample size is pretty small. They have to balance better statistics vs higher costs (and slower turn-around time). But they explain that there are statistical uncertainties in the results, and these can be large. Its not their fault per se that people don’t pay attention and/or don’t understand statistics. (But there might be fault with the media for not being up front about it).

The pure statistics are pretty cut and dried. You need to understand the binomial distribution (and maybe the hypergeometric distribution if you REALLY want to dig into details about sample) 1000 out of 10,000 instead of 1000 out of 10,000,000).

The far bigger problem — as I pointed out before — is getting an accurate and random sample. If you do a phone poll, you only get people with phones. And further, you only get people with phones who are willing to answer a pollster. If you do a poll on the street, you only get people who are on that street in that city at that time. And maybe those people lie. THESE sorts of errors are far more troublesome than the purely statistical errors.

• Bindidon says:

Exactly, Tim Folkerts

And we see once more how clueless Robertson is with regard to a difficult engineering discipline like data processing, for example extracting such unwanted signals out of the data of ten thousands of weather stations.

” Multiple linear regression? What’s that? A new CO2 effect accelerator? ”

No wonder when we recall his stubborn, perverse claim that NOAA for example would still use only 1500 stations worldwide these days (was has been true in 2010 for a few weeks).

In my native tongue we say in such incurable cases ‘Plus bête tu meurs.’

• Gordon Robertson says:

binny…”for example extracting such unwanted signals out of the data of ten thousands of weather stations…”

***

There would be no unwanted signal if scientists used a slide rule or an abacus to compute the averages. The errors come into it when they use new fangled algorithms, over which they have no control, and which they cannot verify as they go along, as would a good scientist, that add noise due to their averaging methods.

The output would improve dramatically if they measured temps every hour, or more frequently, rather than taking two temps a day and averaging them.

12. Chas says:

“At very low population densities, the correlation coefficients become very small…. yet the regression coefficients are quite large, and apparently virtually unaffected by attenuation”

Biologists use(d?) Reduced Major Axis regression (RMA) when there is error in the X variable and a quick stab at coming up with the RMA slope was to divide the OLS slope by the correlation coefficient (r)
So it might just be that the low population communities have, for example, more of the things actually cause this warming (maybe more outbuildings per person or perhaps more hard standing per person).

RMA minimises the area of triangles between the datapoints and the regression line (whereas Demming type regressions work on minimising the length of right angle lines between the the regression line and the datapoints).

RMA has some nice features… the slope of the regression of X on Y is the inverse of Y on X. It was also known as geometric mean regression.

If you want to give it a whirl with a bit of your dataset there is a nice free standalone .exe program called PAST (that does lots of other stats). Under Model->Linear-> Bivariate regression thereare radio buttons that allow you to switch between various regression methods and compare thier slopes and plots. It has spreadsheet data entry too. Just highlight the two columns (using shift) and click through to Bivariate regression:

It has an online help file , so click on the ‘help’ tab.

If there is a ‘defect’ with RMA it is that totally uncorrelated variables have a regression slope of 1.

• Bindidon says:

Thank you too.

13. George J Kamburoff says:

Instead of arguing over this stuff, let’s discuss the actual science. We can start with Ocean Acidification and move on to changes in the Atlantic Meridional Overturning Circulation.

How any of you can do that?

14. Swenson says:

Without meaning to sound like a doomsayer, I suppose I will.

A previous commenter, “Anon for a reason”, mentioned a difficulty with data collection where fluid dynamics is Involved.

There are published studies which show that separating “errors” from normal chaotic processes is not easy – if possible at all.

A deterministic chaotic system does not require external influences to exhibit chaotic behaviour involving strange attractors. Curve fitting merely imposes human desires upon data, writing off anything that does not fit with preconceived notions as “errors”. Bad move. Ignoring what appears to be the naturally chaotic nature of universal physical laws will not bend nature to your will.

Nothing wrong with observing, though. Tycho Brahe was an incredible observer, capable of unprecedented accuracy, for his time, even though his ideas about what he observed turned out to be erroneous, to say the least. Kepler used Tycho’s records to develop Kepler’s Laws, etc.

Maybe Dr Spencer will show something that hasn’t been noticed before. Nothing wrong with that.

15. Tim Folkerts says:

I was curious about this result, and after a little investigation, I have one insight (that I think is valuable). There is an important distinction between uncertainty in the measurements themselves, and variations intrinsic in the system being measured.

Let me use an analogy. Suppose you want statistical information about the density of several object of varying sizes, so you measure volume (x) and mass (y) and do a regression fit.
* if you can accurately measure the volume but not the mass, you don’t need “regression attenuation” because the error is in y
* if you can accurately measure the mass but not the volume, you DO need “regression attenuation” because the error is in x.
* if you can accurately measure the volume and the mass, but you have objects of varying composition (and varying actual density), then you again do not need “regression attenuation”.
All of these will lead to scatter in the plotted results, but they would require different corrections for “regression attenuation”. (And of course, all three could be contributing at once.)

So much of the issue would come down to what you think is causing the variations you see.
* If you think the variations are due to how well you know the temperatures, you do NOT need “regression attenuation”.
* If you think the variations are due to how well you know populations, you DO need “regression attenuation”.
* If you think the variations are due to the system itself (which frankly seems like it would be a large part of the variation), again you do NOT need “regression attenuation”.

I think you misunderstand regression attenuation.

Regression attenuation is a statistical phenomenon that occurs when there is measurement error in one or both of the variables used in a regression analysis. This measurement error can lead to underestimation or “attenuation” of the true relationship between the variables.

Attenuation bias is never a good thing.

• Tim Folkerts says:

I think you misunderstand what I said.

“Regression attenuation is a statistical phenomenon that occurs when there is measurement error in one or both of the variables used in a regression analysis.”
Actually, it just occurs with the ‘x’ variable. When there is measurement uncertainty in x (eg, volume, eg population), then the slope is ‘diluted’. When there is measurement uncertainty in y (eg, mass, eg temperature), then the slope is not ‘diluted’.

“Attenuation bias is never a good thing.”
Right. I don’t think I implied it was ‘good’. Just that it can be present.

You did say “…you DO need ‘regression attenuation’.”

P.s.: when the dependent variable (Y) is measured with error, it can also result in attenuation bias. In this case, the relationship between the variables may appear weaker than it truly is.

As you were.

• Tim Folkerts says:

More specifically, I meant you do need TO TAKE INTO ACCOUNT regression attenuation when there is noise/uncertainty in the x data.

Also, simply “being weaker” (smaller R^2) is not “regression attenuation”.

Noise in the y signal reduces R^2 but leaves the slope the same.
Noise in the x signal reduces R^2 AND reduces the slope.

Do not confuse the correlation coefficient with the regression coefficient. They serve slightly different purposes and have distinct characteristics.

The strength or weakness of the regression coefficient refers to how much change in the dependent variable is caused by a one-unit change in the independent variable.

The correlation coefficient indicates the strength and direction of the relationship between the two variables but does not provide specific information about the size of the effect or causation.

That is about all there is to say on this topic, no?

Charlatan, there are two lines of evidence saying you are wrong:

1/ Measured UV-C above the TOA and at sea level.

2/ The stratospheric temperature inversion.

Unless you can scientifically disprove those two sets of observations, you are full of sh!t.

Your opinion that “4 ppm cannot possibly absorb UV-C” is just that, an opinion.

Clown!

“…the amount of UV-C is declining rapidly in that EM frequency range because the Sun is simply not out-putting as much of it as UV-B and -C.”

Hahahahhahahahahahahaha!

Clown

• Gordon Robertson says:

ark…don’t call me a clown when you obviously lack the scientific background to understand what I said. Look at Planck’s curve you ijit and what I claimed is obvious.

Thanks for the good laughs!

• Swenson says:

A,

“Lines of evidence” do not “say” anything.

You have no clue, and are just trying to appear clever – but failing.

Next thing, you will be claiming that you could describe the GHE if you felt like doing so!

What a strange SkyDragon cultist you are!

37. David Brewer says:

This discussion is mostly going over my head, but why are the trend lines at the wrong angle? In the “noise added to y” graph the correct trend line would appear to be about y=1.5x instead of 1.03x and in the “noise added to y and x” graph the correct trend line would appear to be about y=x, not y=0.499x. How were the x-values generated?