12. Regression Inference

Some bivariate data sets are complete populations — there is no larger underlying population of which the data are representative. The 'individuals' in such data sets commonly have names or other labels that are an inherent part of the data.

More often, we have no interest in the specific individuals from which the data are collected. The individuals are 'representative' of a larger population or process, and our main interest is in this underlying population.

Salaries of human resources managers

The scatterplot below shows the average salaries of human resources managers in each of the mainland states of the USA (except Washington DC) and their population densities.

There is a tendency for states with high population densities to have relatively high salaries. However our main interest is in the names of the states with high or low values. Click on the crosses to identify the states.

Bank branches and minorities

In order to investigate whether banks serve all communities equally, a New Jersey newspaper compiled data from each of New Jersey's 21 counties. The scatterplot below shows the number of people per bank branch in each county and the percentage of minority groups in the county.

Local residents might be interested in the specific counties, but most outsiders would want to generalise from the data to describe the relationship in a way that might describe other similar areas in the Eastern USA. How strong is the evidence that banks tend to have fewer branches in areas with large minority groups?

12.1.2 Distribution of Y for each X

In experiments, the values of the explanatory variable, X, are controlled by the experimenter. Several response measurements are often made at each distinct value of X. Experimental data are rare in business, but similar data arise when X is discrete — there are several response values corresponding to each distinct x-value.

At any single value of X, the repeated response measurements can be considered as a univariate data set and can be modelled as a random sample from some distribution — commonly a normal distribution. The characteristics of the distribution will often depend on the value of X.

The collection of distributions of Y at different values of X comprise a model for the complete bivariate data set called a regression model.

House prices and bathrooms

The sale prices of all houses sold in an area were collected. How does the sale price relate to the number of bathrooms in the houses?

The diagram below shows the resulting data. The crosses have been jittered a little (randomly moved) to separate them in the scatterplot.

This diagram is 3-dimensional. Position the mouse in the middle of the diagram and drag towards the top left of the screen to rotate the plot (or click the 3D rotation button). The histogram at each x-value describes the distribution of house prices with that number of bathrooms.

Possible model for house prices

The next diagram shows a possible model for the data above— a normal distribution for each number of bathrooms (X).

You may use the mouse (or the buttons at the top right) to rotate the 3-dimensional diagram. Click Take sample to show a random sample of values from each of these normal distributions. Our model claims that the observed data are a data set of this form.

12.1.3 Normal linear model

A regression model for the response in a bivariate data set describes how the response distribution depends on X. The most commonly used regression model is a normal linear model. This model involves:

Example of a normal linear model

A typical normal linear model is shown below.

Drag the slider to see how the distribution of Y depends on the value of X. Observe that...

The centre of the response's distribution is the green line on the diagram, called the regression line. In this example, it is described by the equation
The spread of the response distribution is the same for all X,

Click Take sample a few times to observe typical data from this model when 5 response measurements are made at each of X = 1, 2, 3 and 4.

The model can also be used in situations where the values of X are not repeated. Select the option Regular X then take a few more samples to see typical data if the values of X are chosen to be 0.6, 0.8, 1.0, ..., 4.4.

Select the option Random X and take a few more samples to see typical data if the values of X are irregularly spaced.

12.1.4 Another way to describe the model

The normal linear model describes the distribution of Y for any value of X. It can be expressed in the form...

which is the vertical distance between the cross on a scatterplot and the regression line.

The 70-95-100 rule states that approximately 95% of values in any sample are within 2 standard deviations of the mean. In the context of a normal linear model, approximately 95% of the errors will therefore be within 2 standard deviations of zero — i.e. between ±2σ.

Since the errors are the vertical distances of points from the regression line, this means that...

There is therefore a band 2σ on each side of the regression line that contains approximately 95% of the crosses on a scatterplot of the data.

Example

The diagram below shows a normal linear model with parameters

The blue regions in the tails of the normal probability density function are more than 2σ (i.e. 1.6 for this model) on each side of µ_y. They are approximately 5% of the normal distribution's area, so about 95% of y-values sampled from this distribution will be within the bounds. This is true for each X — drag the slider to verify — so approximately 95% of sampled values will lie in the gray band on the x-y plane.

Click Take sample a few times to verify that approximately 95% of values are within the grey band.

Finally, click the button at the top right of the diagram to look down on the x-y plane. In later pages, we will represent a normal linear model with a 2-dimensional diagram of this form.

12.1.5 Model parameters

involves 3 parameters, β₀, β₁ and σ. These parameters provide considerable flexibility to the model.

Drag the three red arrows to adjust the parameters of the normal linear model.

Click Take sample a few times to verify that approximately 95% of values are within the grey band.

(Note that the values of X are not fixed in this example — they vary from sample to sample. The normal linear model does not attempt to describe variability in X, though a standard univariate distribution such as a normal distribution might fit the distribution of X in this example.)

The most important parameters of a linear model are its slope, β₁, and intercept, β₀. These can be interpreted in a similar way to the slope and intercept of the least squares lines that were fitted to data in an earlier chapter.

Context	Interpretation of β₁	Interpretation of β₀
Y = Sales of music CD ($) X = Money spent on advertising ($)	Increase in mean sales for each extra dollar spent on advertising	Mean sales if there was no advertising
Y = Exam mark X = Hours of study by student before exam	Increase in expected mark for each additional hour of study	Expected mark if there is no study
Y = Hospital stay (days) X = Age of patient	Average extra days in hospital per extra year of age	Average days in hospital at age 0. Not particularly meaningful here.

The regression line (i.e. the straight line showing how the mean depends on x) and the band that is 2σ above and below it are a good way to understand the normal linear model. Indeed they can be used as an informal way to estimate the parameters of the model 'by eye'. (We will give better methods in the next section.)

Artificial data

A normal linear model might be used to describe how the response depends on the explanatory variable.

Firstly drag the two red arrows on the regression line to position it centrally in the data.
Then drag the third red arrow to adjust the parameter σ. You should aim for the grey region to include about 95% of the crosses.

In the next section, we will explain how to objectively estimate the parameters to match a data set. Click Best values to see these 'best' parameter values.

If there are fewer values or if the relationship is weaker, it is harder to position the band by eye.

Auction price for grandfather clocks

The data set below shows the auction prices of 32 grandfather clocks and their ages (years). The data might be used to predict the sale price of another clock from its age.

A normal linear model might be used to describe how auction price depends on age.

Firstly drag the two red arrows on the regression line to position it centrally in the data.
Then drag the third red arrow to adjust the parameter σ. You should aim for the grey region to include about 95% of the crosses.

In the next section, we will explain how to objectively estimate the parameters to match a data set.

12.2 Estimating parameters

12.2.1 Estimating the slope and intercept

In practical situations, the three parameters of the normal linear model, β₀, β₁ and σ, are unknown values — all that we have available is a single data set that we believe comes from a model of this form. Although we cannot hope to determine the values of these unknown parameters exactly, we can obtain estimates of them from the data.

We previously examined bivariate data of this form and fitted a line by least squares. The slope and intercept of the least squares line are estimates of the slope and intercept of the regression line.

Since b₀and b₁ are functions of a data set that we assume to be a random sample from the normal linear model, b₀ and b₁ are themselves random quantities — they would be different if a different data set was collected.

Auction price for grandfather clocks

In practice, only a single data set is available. The scatterplot below shows the auction prices for a sample of grandfather clocks and their ages.

Our 'best guesses' for β₀ and β₁ are the least squares estimates shown in the blue equation.

Variability of the least squares slope and intercept

The diagram below represents a normal linear model. (The band is 2σ_y above and below the regression line that shows how µ_y depends on X.)

Click Take sample a few times to generate different data from the model. Observe the variability of the least squares lines fitted to these data sets.

The two parameter estimates (the values in the blue equation) are usually close to the model values (in the top equation), but they vary from sample to sample.

The sample-to-sample variability of the least squares estimates means that the least squares slope and intercept in the grandfather clock data are unlikely to be exactly equal to the underlying β₀ and β₁.

12.2.2 Estimating the error standard devn

In practice, the slope and intercept of the regression line are unknown, so the errors are also unknown values. However just as the least squares line gives estimates of β₀ and β₁, the least squares residuals provide estimates of the unknown errors.

The third unknown parameter of the normal linear model, σ, is the standard deviation of the errors,

A sensible estimate of σ is therefore the sample standard deviation of the residuals,

It can be proved mathematically that the least squares residuals always have mean zero, so this formula is equivalent to

Unfortunately, this estimate tends to be a little too low, and a better estimate is

Volume of wood from trees

The value of a hardwood tree depends on the volume of timber that can be harvested from it. However the volume of timber cannot be measured easily before a tree is cut down, so forestry managers must estimate it from other measurements that are easier to make. A common measurement is the diameter of the tree at breast height, 4.5 feet above ground level. Data were obtained from 31 black cherry trees that were harvested in the Allegheny National Forest in Pennsylvania. The volume of timber (cubic feet) is plotted against the area at breast height (square inches, determined from the diameter).

The residuals from the grey line on the scatterplot are shown in a jittered dot plot on the right. Drag the line (by moving the red arrows) to make the residuals small.

Click Least squares to show the least squares line (and hence the best estimates of β₀ and β₁). The best estimate of σ is found from the least squares residuals and is shown on the bottom right.

12.2.3 Distn of least squares estimates

The least squares estimates b₀ and b₁ of the two linear model parameters β₀ and β₁ vary from sample to sample. Each has a distribution that can be described by a probability density function.

Sampling variability of the least squares line

The diagram below shows a normal linear model and a data set that is sampled from this model. The least squares line is shown in blue.

Click Take sample a few times to observe the variability in the least squares lines.

Click the checkbox Accumulate then click Take sample about 10 times. The variability of the least squares lines is shown on the right. (Click on any line to show the sample to which it belongs.)

Sampling distributions of b₀ and b₁

We have not yet seen ways to describe the variability of complex summaries such as least squares lines. It is much easier to describe the separate distributions of b₀and b₁.

Click the checkbox Accumulate then click Take sample several times. The variability in each parameter estimate is shown in a stacked dot plot. (Click on any cross on the right to show the sample to which it belongs.)

Each parameter estimate has a univariate distribution. Click the checkbox above to superimpose its theoretical normal distribution on each stacked dot plot.

12.2.4 Standard error of least squares slope

In the previous page, we explained that the least squares slope has a normal distribution with mean

When b₁ is used as an estimate of β₁, there is therefore an error that has a normal distribution centred on zero,

Since σ is unknown, the above formulae for the standard deviation of b₁ cannot be evaluated. However we can approximate it by replacing σ with an estimate from the data,

Examples

For each data set below, the least squares estimate of the slope is shown. The distribution of the error in this estimate, (b₁ - β₁) is evaluated on the right.

The normal distribution in the bottom right describes how far the least squares estimate is likely to be from β₁.

12.2.5 95% confidence interval for slope

The slope of the least squares line, b₁, is a good estimate of the normal linear model's slope, β₁, and the error in this estimate has a normal distribution,

The estimate b₁ has probability 0.95 of being within 1.96 standard deviations of β₁, suggesting a 95% confidence interval of the form

Unfortunately the standard error depends on σ and therefore cannot be determined exactly. However we can obtain an approximation

If this approximation is used, the constant 1.96 must be replaced by a larger value, t_n-2, which is obtained by looking up t-tables with (n - 2) degrees of freedom.

Most statistical software will evaluate b₁ and its standard error for you when you fit a normal linear model, so it is fairly easy to evaluate the confidence interval in practice — you will not need to use any of the formulae above!

Tourist arrivals in Hawaii

Consider again the tourist arrival data for Hawaii between 1990 and 2002.

eqn

Since there are n = 13 data points, we look up t tables with 11 degrees of freedom to get the value 2.201. A 95% confidence interval for the slope is therefore

In words, we are 95% confident that tourism is increasing at a rate of between 232,000 and 704,000 per year.

Warning: It would be dangerous to extrapolate this trend many years into the future — a linear trend may not continue.

12.2.6 Properties of confidence interval

Confidence intervals for a linear model's slope have the same properties as the confidence intervals that we examined earlier for population means and proportions.

Since the interval is evaluated from random sample data, it will vary from sample to sample. In 95% of such samples, the 95% confidence interval will include the true population slope, but in 5% of samples it will not.

Simulation

The diagram below shows a sample from a normal linear model in which the true value of β₁ is 0.75. (In real data sets, β₁ is an unknown value but, by simulating data from a situation where it is known, we can examine the accuracy of our estimates.)

On the right, the 95% confidence interval for β₁ based on this data set is displayed. Click Take sample a few times to observe the variability in the confidence intervals.

Click Accumulate then take about 100 samples. You should observe that approximately 95% of the resulting confidence intervals include the true value of β₁, 0.75.

The confidence intervals that do not include β₁ are drawn in red. You may click on any interval to display the data set that produced it.

12.2.7 Influences on accuracy

We gave a formula for the standard deviation of b₁ earlier in this section. It can be rewritten as

It is interesting to observe how these three quantities influence the accuracy of the least squares slope as an estimate of β₁.

The first two influences on accuracy are not surprising but the third needs a little more thought.

Demonstration

The diagram below shows the distribution of the least squares slope for samples from a normal linear model.

Use the pull-down menu to alter the sample size. Observe that the spread of the distribution of b₁ is lowest when the sample size is large.

Change the sample size back to 20, then adjust the response standard deviation. Observe that the spread of the distribution of b₁ is lowest when the response standard deviation is small.

Change the response standard deviation back to a medium value, then adjust the spread of X. Observe that the spread of the distribution of b₁ is lowest when the spread of X is high.

(Click Accumulate then take a few samples at any combination of the three characteristics to verify that the blue normal distributions are indeed correct!)

There are important consequences when designing experiments that will generate regression data. In order to increase the accuracy of the estimate of the least squares slope,

Although many relationships are acceptably linear over a limited range of x-values, at extreme x-values the relationship often becomes nonlinear. Although a good spread of x-values is desirable, the normal linear model is not appropriate if there is curvature. A compromise is needed.

Even when you have decided on a range of x-values that will be used in the experiment, it is important to avoid using only values at the two ends of this range, even though this maximises s_x. Without intermediate values, it is impossible to assess whether the data are linear or not.

12.3 Testing regression parameters

12.3.1 Importance of zero slope

In a normal linear model, the response has a distribution whose mean, µ_y, depends linearly on the explanatory variable,

If the slope parameter, β₁, is zero, then the response has a normal distribution that does not depend on X.

In experimental data where lurking variables have been avoided, we can further say that X does not affect Y.

This can be tested formally with a hypothesis test for whether β₁ is zero. The methodology is similar to that for tests about a population mean or proportion and will be described in the rest of this section.

It is important to remember that a single data set can provide evidence about whether β₁ = 0, but it usually does not allow a definite conclusion to be reached.

Model for the effect of price on sales of a New Zealand wine

We consider linear models for how the price of a popular New Zealand cabernet sauvignon red wine affect its sales in a supermarket chain, measured as a proportion of total red wine sales in a week. The relationship between price and sales will be nonlinear at high prices, but is expected to be reasonably linear within a price range of $12 to $20 per bottle.

Adjust the model slope to β₁ = -0.03 (drag the slider): If this model is appropriate, the proportion of total sales will decrease on average by 0.03 (3 percent) for every extra dollar in price.
Adjust the model slope to β₁ = 0.0: If this model is appropriate, the distribution of sales is the same for prices between $12 and $20 per bottle.

Click the button at the top right to rotate to a display of density against y. Drag the slider for X to verify that the response distribution does not depend on the price — the price has no effect on sales when β₁ = 0.

Testing whether β₁ is zero therefore tests whether price has any effect on sales.

The diagram below shows the same range of models, but allows us to see typical data from the models. These are data that might be observed if each of the 4 prices were tried in the supermarket chain for 4 separate weeks, randomised over a 16-week period.

The slider again allows the model's slope to be altered. Change the slope to zero (so that price has no effect on sales).

Click Take sample a few times to see typical experimental data from the model.

The least squares line usually has non-zero slope, so a single data set cannot immediately tell you whether β₁ is zero.

12.3.2 Testing whether slope is zero

To assess whether the explanatory variable affects the response, we test the hypotheses

The least squares slope from a sample, b₁, is the obvious statistic to throw light on the value of β₁, but b₁ varies from sample to sample. We must therefore take account of its standard deviation to assess its distance from zero.

and z would have a standard normal distribution (mean 0 and sd 1) if β₁ is really zero (H₀). The p-value for the test would therefore be the probability of getting a value from the standard normal distribution that is as far from zero as the z-value that you evaluated from your data.

Unfortunately σ is usually unknown and the standard deviation of b₁ must be estimated from the sample data. We therefore use a test statistic of the form

The only change to the method when using the test statistic t rather than z is that a t distribution with n - 2 degrees of freedom must be used to obtain the p-value instead of the standard normal distribution.

Consider a data set with least squares slope b₁ and corresponding p-value, 0.0023. The p-value tells us that the probability of getting a least squares slope as far from zero as b₁ would be only 0.0023 if H₀ was true (i.e. if Y and X were not related). Since this is very unlikely, the data give strong evidence that the linear model slope is not zero and therefore that the response is related to the explanatory variable.

Similarly, if we calculate that the p-value for b₁ is 0.4, this tells us that a least squares slope as far from zero as b₁ would occur with probability 0.4, even if Y and X were not related. Since this is fairly high, our conclusion should be that there is no reason to doubt the null hypothesis — there is no evidence of a relationship between the response and explanatory variables.

Examples

The following data sets show hypothesis tests for a few data sets and the conclusions that are reached.

12.3.3 Strength of evidence and relationship

The strength of the relationship between two variables, X and Y, is usually summarised by their correlation coefficient, r.

When the data are sampled from some population, there is a corresponding underlying population correlation coefficient, ρ, that r approximates. (ρ is the Greek letter r, pronounced 'rho'.) The sample correlation coefficient r is an estimate of the unknown population parameter, ρ.

The size of the correlation coefficient is therefore not dependent on the sample size.

It is important to distinguish between the correlation coefficient, r, and the p-value for testing whether there is a relationship between X and Y.

It is important not to confuse these two values when interpreting the p-value for a test.

The interpretation of the p-value is helped by giving an alternative formula for the test statistic,

This can be rewritten in terms of the correlation coefficient, r, and the sample size as...

If r stays the same, the test statistic becomes further from zero (and the p-value for the test therefore becomes smaller) as the sample size increases.

The following examples illustrate how both the sample size and the correlation coefficient affect the p-value.

In the data set at the top left, there is no evidence of a relationship between the variables — a correlation coefficient of 0.24 could easily have arisen by chance even if the variables were not related at all.

The data set on the top right has the same correlation coefficient, 0.24. However its sample size is much higher and a correlation coefficient this far from zero is now very unlikely, so the p-value is small. There is almost certainly a relationship between X and Y, even though the relationship is weak.

With a sample size of 30, the relationship would need to be stronger for us to detect it. The data set on the bottom left shows that r = 0.63 gives strong evidence of a relationship with a data set of this size.

12.3.4 Properties of p-values

P-values for testing whether a linear model's slope is zero have the same properties as p-values for other hypothesis tests. In particular,

Simulation of distribution of p-values

The diagram below builds up the distribution of p-values for testing whether the slope is zero.

With β₁ = 0, click Take sample several times and verify that the p-values are rectangularly distributed between 0 and 1.

(Click on any p-value to see the data set that gave rise to it.)

Change the linear model slope to β₁ = 0.5, then take several more samples. The p-values tend to be closer to zero.

Repeat with β₁ = 1.0.

As shown by the above simulation, when Y and X are not related (β₁ = 0), it is still possible to get small p-values, suggesting that β₁ is not zero. However there is only probability 0.01 of getting a p-value as low as 0.01 — it is unlikely but possible. Such a p-value is more likely if the variables are related so we interpret it as giving strong evidence of a relationship.

12.4 Predicting the response

12.4.1 Estimated response distn at X

Bivariate data sets contain response measurements corresponding to a few specific values of X, whereas a normal linear model provides a response distribution for all X. By fitting a normal linear model to the data, it is therefore possible to estimate the response distribution at x-values for which we do not have data.

Tree diameter and timber volume

The value of hardwood trees depends on the volume of timber that can be obtained when the trees are harvested. However the volume of timber cannot be easily measured when the tree is standing, so volume is usually estimated from measurements that are easier to make, such as the tree diameter 4.5 feet above ground level. The diagram below plots the cross-sectional area at this height against the volume of timber for 31 black cherry trees that were harvested in the Allegheny National Forest in Pennsylvania.

The relationship seems reasonably linear, so we will try to fit a normal linear model to the data. The least squares line is shown in the diagram below with the grey band representing ± twice the estimate of σ.

Drag the slider to display the estimated normal distribution of the volume of timber obtained from a tree of any cross-sectional area.

The mean of this estimated distribution (i.e. the least squares line) provides a prediction of the volume of timber for a different black cherry tree of any cross-sectional area.

12.4.2 Variability of estimate at X

depends on the least squares estimates, b₀ and b₁, it also varies from sample to sample. The prediction has a normal distribution whose mean is

The standard deviation of the prediction describes its likely distance from this underlying population value. It depends on:

The diagram below shows a sample from a normal linear model and the least squares line that is fitted to these data.

Click Accumulate, then take approximately 20 further samples. The variability of the least squares lines is shown on the right.

Now drag the slider on the right to expand the scales in the diagrams. Observe that the least squares lines (and hence the predictions that are made from them) are least variable near the centre of the data, but become increasingly variable as you extrapolate from the data.

The next diagram concentrates on the errors that result from using the estimate

of the true mean response,

Click Accumulate, then take about 50 further samples. The jittered dot plot on the right shows the distribution of the errors that are obtained when using a least squares line to estimate the mean response at X.

(Click on any cross in this plot to see the data set that gave rise to it.)

Drag the slider to observe the distribution of the errors at other x-values. Observe that the errors are least variable when predicting near x = 2.5.

Finally, click the checkbox below to display the theoretical distribution of the errors and again drag the slider to adjust the value of X.

12.4.3 Estimating the mean vs prediction

In the timber volume example, the timber volume obtained from harvested trees was related to the cross-sectional area at chest height. The manager of a forest would be interested in estimating the mean timber volume that could be obtained from trees with different cross-sectional areas,

Since both b₀ and b₁ become less variable (and hence more accurate estimates of β₀ and β₁) as the sample size increases,

In contrast, the manager might want to predict the timber volume that could be obtained from a single tree that has cross-sectional area X ft². The same prediction would be used as above,

However, no matter how accurately we estimate the mean volume from trees with this cross-sectional area, the single tree will also have a distribution with standard deviation σ around this mean. As a result, the errors in predicting the volume from a single tree will be greater.

Difference between estimating a mean and predicting a new value

We will perform a simulation from a normal linear model with β₀ = 3.3 and β₁ = 0.75. Data from the model will be used to estimate the mean response when X = 5.5 and also to estimate a new individual's response value at this x-value. The same value is used both for estimation and prediction,

but the error is different in the two situations.

The true mean response is 7.43. (We can evaluate this since we know the values of β₀ and β₁ in the simulation — in practice we would not be able to determine the mean response.) The top half of the diagram shows the error in estimating this from the least squares line.

The bottom half of the diagram shows the error from predicting a new response value at X = 5.5.

Click Accumulate then take several samples from this linear model. Observe that the prediction error has greater spread than the estimation error at the top.

Use the pop-up menu to increase the sample size to 210. Observe that the error in estimating the mean becomes very small, but the prediction error is still quite large. Although we can estimate the mean response accurately, we have no information about how far the new value will be from this.

(In practice, it would be unwise to estimate or predict at X = 5.5 since the highest x-values in the data are about 4 — we are not sure that the relationship will remain linear at high X. However it makes the diagram above clearer.)

12.4.4 Confidence and prediction intervals

is used both to estimate the mean response at x and to predict a new individual's response at x. However the errors are different in the two situations — the errors tend to be larger for predicting a new value.

In both situations, it is more informative to give an interval of 'likely' values rather than a single value.

We do not provide a formula for the standard error on the right of this formula — the details are not important and statistical software will tell you the value.

where the value k is greater than the corresponding standard error for the confidence interval. (Again we do not provide a full formula. Statistical computer software will perform the calculations for you.)

A 95% prediction interval is wider than a 95% confidence interval for the distribution's mean.

These 95% confidence intervals and 95% prediction intervals are valid within the range of x-values about which we have collected data, but they should not be relied on for extrapolation. Both intervals assume that the normal linear model describes the process, but we have no information about linearity beyond the x-values that have been collected.

From a scatterplot, we can check that there is approximate linearity within the observed range of x-values but there is no way to check linearity beyond the observed data.

Inertia welding experiment.

Manufacturers use inertia welding to join different metals that cannot easily be joined by other means (e.g. aluminium to steel). One part of a workpiece is attached to a flywheel that is rotated at speed and forced into contact with another piece that is restrained from rotating. The heat generated by friction at the interface produces a hot-pressure weld.

The diagram below shows data from an inertia welding experiment. The two variables in the experiment were the velocity (ft per minute) of the rotating workpiece and the breaking strength of the weld.

Drag the slider to display...

the 95% confidence interval for the mean breaking strength of weld made with different velocities, and
the 95% prediction interval for the breaking strength of a single weld made with that velocity.

Because of uncertainty about linearity of the relationship, the confidence intervals and prediction intervals are unreliable if X is greater than 3.0 or less than 2.0.

12.5 Linear model assumptions

12.5.1 Assumptions in a normal linear model

Although a normal linear model is often used to describe how an explanatory variable, X, affects the distribution of a response, Y, it is not a suitable model for all bivariate data.

In particular, the following four requirements are implicit in the model but may be violated.

Linearity In some data sets, the response mean does not change linearly with X. The relationship is then called nonlinear. In the diagram on the right, the response levels off as X increases, so a normal linear model is not appropriate.
Constant standard deviation Sometimes the response standard deviation is different for different values of X. In the diagram on the right, the variability of the response is higher at large values of X.
Normal distribution for errors Sometimes distribution of the response (at any value of X) is skew or differs in shape from a normal distribution in other ways. In the diagram on the right, the response has a skew distribution with occasional very large values.
Independent errors All observations (and hence all errors) are assumed to be independently obtained. When the observations are ordered in time, successive errors may be correlated, with big values tending to be followed by others big values, etc. This is most commonly seen when the explanatory variable is time — i.e. when using a linear model to fit the trend in a time series. In the diagram on the right, crosses on one side of the least sqrs line are often followed by other crosses on the same side.

The above problems may be evident in a scatterplot of the raw data, but a residual plot often highlights any problems.

Examples

The first example below shows a data set that satisfies the assumptions for a normal linear model.

Observe that the plot of residuals against X is a horizontal band of constant width. (Click on any point to see how the residual relates to the plot of the raw data on the left.)

Select other data sets from the pop-up menu at the top. These are data sets for which different linear model assumptions are violated. Observe how the problems are reflected in the residual plots.

In the remainder of this section, we will look in more detail at the four assumptions underlying the normal linear model.

12.5.2 Curvature — transforming X

Even though two variables, X and Y, are nonlinearly related, applying a nonlinear transformation to one or other of the variables can linearise the relationship. In other words, some transformation of X may be linearly related to some transformation of Y.

For example, it may happen that the Y² is linearly related to log(X), satisfying the following model.

The parameters of this model could again be estimated by least squares, based on the transformed values of the two variables, and confidence intervals and hypothesis tests would be valid.

In the following example, only transformations of the explanatory variable, X, will be considered.

Vitamin B and weight gain of rats

The following data set was obtained from an experiment in which 18 rats were given diets containing different quantities of riboflavin (vitamin B2). The doses used in the experiment were 2.5, 5, 10 and 20 mug per day and the weight gains of the rats (grams) were recorded over a period of 4 weeks. The relationship between weight gain and dose is nonlinear — weight gains seem to be less affected by increasing the dose once it is over 10 mug per day.

Drag the red line on the horizontal axis towards the right to apply a power transformation to the dose. Observe that a log transformation (between a power of 0.01 and -0.01) linearises the relationship reasonably well. (Use the arrow keys on the keyboard to make fine adjustments to the power.)

A normal linear model explaining weight gain in terms of log(dose) is therefore reasonable. Note that a linear model between weight gain and log(dose) implies a nonlinear model between weight gain and dose.

Again drag the red line to apply a log transformation to the dose of vitamin B2. The least squares line is drawn on the diagram on the left and its equation is shown below. The diagram on the right shows this equation on a the original untransformed axes; observe that it is curved.

In the next page, we will examine how transformations of the response, Y, may also be used when there is curvature.

12.5.3 Curvature and non-constant variance

In a scatterplot of Y against X, transforming X moves the crosses horizontally, but does not affect the spread of response values at each value of X.

Transformation of X therefore does not affect whether or not the linear model's assumption of constant error standard deviation holds.

Prices of second-hand Mazda cars

The scatterplot below shows the retail prices of 124 Mazda cars, obtained from the newspaper The Melbourne Age on 8 February 1992. The grey line is the least squares line fitted to the data.

The residual plot on the right highlights two problems. Firstly there is clearly nonlinearity — the prices level off at high ages. Also, the standard deviation of the price is much lower for cars over 10 years old — there is non-constant error standard deviation.

Drag the red line on the vertical axis upwards to apply a power transformation to the price. Observe that a log transformation (between a power of 0.01 and -0.01) both linearises the relationship reasonably well and also gives residuals with fairly constant spread. (Use the arrow keys on the keyboard to make fine adjustments to the power.)

12.5.4 Transformations and prediction

If a transformation of Y follows a normal linear model with an explanatory variable that is a transformation of X, a least squares line that is fitted to the transformed data is used for predictions.

For example, if the square root of Y is linearly related to X, we use the least squares line to obtain a prediction of sqrt(Y), then square this to get a prediction of Y itself.

House prices in Palmerston North

The scatterplot below shows the sale prices ($thousand) and floor areas (square metres) from a sample of 143 houses that were sold in two suburbs of Palmerston North, New Zealand in 1999. All houses had been built in the previous 30 years.

The prices of large houses are more variable than these of small houses, suggesting that a transformed sale price might satisfy a normal linear model better.

Although the error variance now seems reasonably constant, there is an indication of nonlinearity at high house areas. A log transformation of the house areas fixes this problem too.

A normal linear model for the transformed variables implies that

and this can be re-expressed in the form

so the model implies that prices are proportional to area raised to a power. (If b = 1, there is a constant price per m², on average — a hypothesis test can assess this.)

The transformed variables are shown on the diagram below with the least squares line, both on the transformed scatterplot and on a scatterplot of the original data.

Click on the scatterplot at any house area (or log-area) to see how log(price) is predicted using the least squares slope and intercept. From this, the price is predicted by raising 10 to this power — the inverse transformation to log₁₀.

For example, if the square root of Y is linearly related to X, we find a prediction interval for sqrt(Y), then square both ends of this interval to get a prediction interval for Y itself.

Although the prediction interval for the transformed Y has a similar width over the range of x-values in the data, the resulting prediction interval for Y itself may vary much more in width.

House prices in Palmerston North

The red band on the scatterplot of log(price) against log(area) on the left below shows 95% prediction intervals. Use the slider under the diagram to display the prediction interval for the price of a house of any area.

The scatterplot on the right shows the corresponding data and prediction intervals on a plot of the untransformed variables. Observe that the prediction interval for the price of a large house is much wider than that for a small house.

12.5.5 Outliers and leverage

An outlier is a measurement that does not fit in with the pattern exhibited by the rest of the data. By definition, an outlier does not satisfy the normal linear model that fits the rest of the data, so it should be omitted from the analysis.

In a scatterplot, the point is unusually far above or below the regression line.

Unfortunately, in a real data set, the errors are unknown, so we must use the residuals from the least squares line as estimates of the errors. The residuals can be used in a similar way to give information about whether there is an outlier.

Hopefully the large error will correspond to an unusually large residual that will stand out from the distribution of the other residuals.

To help assess the residuals, it is common to standardise them — dividing by an estimate of the standard deviation of each. (The details of the standardisation are not important, but it is worth noting that the errors, ε, all have standard deviation σ. The residuals have a standard deviation that is a bit smaller than this.)

The standardised residuals are each approximately normally distributed with mean 0 and standard deviation 1 if the normal linear model fits. From the properties of the standard normal distribution, only about 5% of the standardised residuals will be outside the range ±2, and hardly any outside the range ±3. Most statistical software will evaluate standardised residuals for you when you fit a line by least squares and automatically report any outside these ranges.

It is worth remembering however that there is still a probability 0.003 that a value from the standard normal distribution will be outside ±3. In a data set of 1,000 values, it would therefore be expected that 3 values would be labelled as 'outliers' by this rule.

Standardised residuals when there are no outliers

The scatterplot below shows a data set that is sampled from a normal linear model. A plot of the standardised residuals is shown on the right.

Click Another data set a few times and observe that standardised residuals are occasionally outside ±2. When the sample size is increased, there are often some standardised residuals outside this range.

It is unusual for standardised residuals to be outside ±3 when the sample size is small, but even this is not uncommon when the sample size is large.

Standardised residuals would need to be outside ±3.5 or ±4 for us to be really confident that they are outliers.

All data points pull the least squares line towards themselves — the line is positioned to minimise the sum of squares of the residuals

Imagine the blue residuals below as rubber bands, all pulling the least squares line. The further a cross from the line, the stronger its pull on the line.

Large residuals pull very strongly on the line since they are squared in the least squares criterion (the rubber band is extremely tight). As a result,

This effect is strongest when the x-value of a point is very large or small. Using the analogy of rubber bands pulling the least squares line, points with extreme x-values have more leverage on the position of the least squares line.

Illustration

The scatterplot below shows a data set and the corresponding residuals.

The cross on the far right can be dragged with the mouse. Initially, the diagram shows what we would ideally have hoped to see in the residuals — the other points are close to a straight line, so if the final cross is dragged away from this line, we would have hoped that it would result in a large residual.

This is not what actually happens. Choose What you actually get... from the pop-up menu at the top and drag the point again. The least squares line is pulled towards the point, so when it is dragged away from the line followed by the other points, its residual is smaller than might be expected and the residuals for the other points are larger.

This is especially evident when the point being dragged has an x-value of around 4 — i.e. when it is a high leverage point. Drag it down to a y-value of about 40 and observe that its residual is no more extreme than those of the other points.

Do not rely on an extreme residual to tell you whether a high-leverage point is an outlier.

12.5.6 Non-normal errors

Another assumption in the normal linear model is that the model errors are normally distributed.

If the model holds, the least squares residuals will also be normally distributed, so a histogram of the residuals can be examined for normality.

A better way to graphically examine a data set for normality is with a normal probability plot of the residuals. As with other probability plots, if the residuals are from a normal distribution, the crosses in the normal probability plot should lie close to a straight line.

In some data sets, linearity or nonlinearity in the probability plot is clear. In practice however, the randomness of real data means that the probability plot will not be exactly straight even for values that are sampled from a normal population.

This is a difficult question to answer. There are formal tests of normality that can be used in conjunction with a probability plot. (We discussed one in an earlier chapter about hypothesis tests.) We however take a less formal approach in the example below.

Share prices and volume traded

The scatterplot below shows the volume of British Airways shares traded in each of the first 57 trading days of 2002 — between 2nd January and 21st March — and the closing share price. A probability plot of the residuals from the least squares line are also shown on the right.

There is curvature in the probability plot, suggesting that the error distribution is skew with a long tail towards the high values.

Could this amount of curvature have occurred by chance? Select Random Normal Data from the pop-up menu to generate random data from a normal linear model whose parameters β₀, β₁and σ are the same as the least squares estimates from our data. Click Take Sample several times to see the variability in the probability plot when a normal linear model does hold.

The probability plot from the data seems more curved than the random ones, suggesting a problem with the model assumptions.

If the assumptions of linearity and constant variance are violated, or if there are outliers, the probability plot of residuals will often be curved, irrespective of the error distribution.

Share price and volume traded

In the British Airways example above, there appears to be greater variability in the volumes traded when the share price is high. This suggests that a transformation of the response might improve the fit of the model. The scatterplot below shows that a normal linear model would be a better description of the relationship between the logarithms of the volume traded and the share price — the distribution of points around the line is more symmetrical and there are no obvious problems with the other assumptions.

12.5.7 Correlated errors

Although we have not stressed it earlier, an important assumption in the normal linear model is that the different errors are uncorrelated with each other. Occasionally some individuals are 'close' in a way that means an 'unusually' high response measurement will be associated with unusually high measurements in nearby individuals — their errors are correlated.

For example, in an experiment in a greenhouse, adjacent plants will be grown in similar conditions (light, moisture, air flow) so an unusually high growth rate for one plant may be associated with environmental conditions that also cause unusually high growth rates in adjacent plants.

Correlated errors are most common when the observations are made sequentially in time. There are often influences on the response that are not explicitly recorded or modelled but that change gradually over time resulting in successive errors being correlated. This is called serial correlation.

Serial correlation is especially common when modelling time series — i.e. when we are using a normal linear model with time as the explanatory variable.

Strong serial correlation may be visible in a plot of residuals against time, but it is often difficult to assess whether the pattern could have arisen by chance. A test statistic called the Durbin-Watson statistic is often used to assess whether there is serial correlation. Writing the successive residuals as e₁, e₂, ..., e_n, this statistc is defined as:

When the serial correlation is high, successive residuals will be similar, their differences will be small and the test statistic will be close to zero.

The p-value for the test based on the Durbin-Watson statistic is the probability of getting such a low value of d when there is no serial correlation. An approximate p-value can be obtained from special statistical tables, but it can also be determined with a simulation, as described in the example below.

World rice production

The time series below shows the total world rice production (million tonnes) between 1961 and 2001.

The scatterplot on the left shows the data and a least squares line. The residuals on the right do not indicate any problems with curvature or non-constant variance.

To assess whether there is serial correlation in the errors, the Durbin-Watson statistic has been evaluated. Is a value of 1.291 indicative of serial correlation?

Select Random Normal Data from the pop-up menu. This shows a randomly generated data set from a linear model with β₀, β₁ and σ the same as the least squares estimates from the actual time series.

Click Accumulate, then click Simulate about 100 times to build up the sampling distribution of the Durbin-Watson statistic. You should see that a value as low as 1.291 is very unlikely for data like this from a normal linear model, so there is strong evidence that successive errors are correlated.

There is strong evidence that years with unusually high (or low) yield tend to follow each other.

If a linear model is used for a time series, but the relationship is actually nonlinear, successive residuals also tend to be similar and the Durbin-Watson statistic will also be small.

The test only suggests serial correlation if you are sure that the data are linear.

Chapter 12 Regression Inference

12.1 Linear regression models

12.1.1 Interest in generalising from data

12.1.2 Distribution of Y for each X

12.1.3 Normal linear model

12.1.4 Another way to describe the model

12.1.5 Model parameters

12.2 Estimating parameters

12.2.1 Estimating the slope and intercept

12.2.2 Estimating the error standard devn

12.2.3 Distn of least squares estimates

12.2.4 Standard error of least squares slope

12.2.5 95% confidence interval for slope

12.2.6 Properties of confidence interval

12.2.7 Influences on accuracy

12.3 Testing regression parameters

12.3.1 Importance of zero slope

12.3.2 Testing whether slope is zero

12.3.3 Strength of evidence and relationship

12.3.4 Properties of p-values

12.4 Predicting the response

12.4.1 Estimated response distn at X

12.4.2 Variability of estimate at X

12.4.3 Estimating the mean vs prediction

12.4.4 Confidence and prediction intervals

12.5 Linear model assumptions

12.5.1 Assumptions in a normal linear model

12.5.2 Curvature — transforming X

12.5.3 Curvature and non-constant variance

12.5.4 Transformations and prediction

12.5.5 Outliers and leverage

12.5.6 Non-normal errors

12.5.7 Correlated errors