10. Testing Hypotheses

The term statistical inference describes statistical techniques that obtain information about a population parameter (or parameters) based on a single random sample from that population. There are two different but related types of question about the population parameter (or parameters) that we might ask:

This branch of inference is called estimation and its main tool is a confidence interval. We described confidence intervals in the previous chapter.

A manufacturer of muesli bars needs to describe the average fat content of the bars (the mean of the hypothetical population of fat contents that would be produced using the recipe). Several bars are analysed and their fat contents are measured.

The sample mean is a point estimate of the population mean, and a 95% confidence interval can also be found.

This branch of inference is called hypothesis testing and is the focus of this chapter.

A particular brand of meusli bar is claimed by the manufacturer to have a fat content of 3.4g per bar. A consumer group suspects that the manufacturer is understating the fat content, so a random sample of bars is analysed.

The consumer group must assess whether the data are consistent with the statement (hypothesis) that the underlying population mean is 3.4g.

When we studied parameter estimation, we saw that a population parameter cannot be determined exactly from a single random sample — there is a 5% chance that a 95% confidence interval will not include the true population parameter.

In a similar way, a single random sample can rarely provide enough information about a population parameter to allow us to be sure whether or not any hypothesis about that parameter will be true. The best we can hope for is an indication of the strength of the evidence against the hypothesis.

The remainder of this chapter explains how this evidence is obtained and reported.

10.1.2 Soccer league simulation

Although we like to think that the 'best' team wins in sports competitions, there is actually considerable variability in the results. Much of this variability can be considered to be random — if the same teams play again, the results are often different. The most obvious examples of this randomness occur when a series of matches is played between the same two teams.

Since the teams are virtually unchanged in any series, the variability in results can only be explained through randomness.

When we look at sports results, can we tell whether all teams are equally matched with the same probability of winning? Or do some teams have a higher probability of winning than others?

There are different ways to examine this question, depending on the type of data that is available. The following example assesses an end-of-year league table.

English Premier Soccer League, 2008/09

In the English Premier Soccer league, each of the 20 teams plays every other team twice (home and away) during the season. Three points are awarded for a win and one point for a draw. The table below shows the wins, draws, losses and total points for all teams at the end of the 2008/09 season.

	Team	Wins	Draws	Losses	Points
1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20.	Manchester_U Liverpool Chelsea Arsenal Everton Aston_Villa Fulham Tottenham West_Ham Manchester_C Wigan Stoke_City Bolton Portsmouth Blackburn Sunderland Hull_City Newcastle Middlesbrough West_Brom_Albion	28 25 25 20 17 17 14 14 14 15 12 12 11 10 10 9 8 7 7 8	6 11 8 12 12 11 11 9 9 5 9 9 8 11 11 9 11 13 11 8	4 2 5 6 9 10 13 15 15 18 17 17 19 17 17 20 19 18 20 22	90 86 83 72 63 62 53 51 51 50 45 45 41 41 41 36 35 34 32 32

We observed in an earlier simulation that there is considerable variability in the points, even when all teams are evenly matched. However, ...

If some teams are more likely to win their matches than others, the spread of final points is likely to be greater — the top and bottom teams are likely to be more extreme.

A simulation

To assess whether there is any difference in skill levels, we can therefore run a simulation of the league, assuming evenly matched teams and generating random results with probabilities 0.372, 0.372 and 0.255 for wins, losses and draws. (A proportion 0.255 of games in the actual league resulted in draws.)

Click Simulate to simulate the 380 games in a season. The standard deviation of the final points is shown below the table. Click Accumulate then run the simulation about 100 times. (Hold down the Simulate button to speed up the process.)

The standard deviation of the points in the actual league table was 18.2. Since most simulated standard deviations are between 5 and 12, we conclude that such a high spread would be extremely unlikely if the teams were evenly matched.

There is strong evidence that the top teams are 'better' than the bottom teams.

10.1.3 Simulation to test a proportion

Simulations can help us to answer questions about a variety of other models (or populations). The following example was suggested by Kay Lipson from Swinburne University of Technology in Australia.

Does Australia Post deliver on time?

The Herald-Sun newspaper published the following article on November 25 1992.

Doubt has been cast over Australia Post's claim of delivering 96 per cent of standard letters on time.

A survey conducted by the Herald-Sun in Melbourne revealed that less than 90 per cent of letters were delivered according to the schedule.

Herald-Sun staff posted 59 letters before the advertised...

Campbell Fuller, Herald-Sun, 25 November 1992.

Is the author justified in disputing Australia Post's claim that 96% of letters are delivered on time?

A simulation

If Australia Post's claim is correct, and every letter independently has probability 0.96 of being delivered on time, we know that the number delivered on time out of 59 letters will be a random quantity. From the information in the article, we can deduce that 52 out of the Herald-Sun's 59 letters arrived on time (a proportion 52/59 = 0.881).

How unlikely is it to get as few as 52 out of 59 letters arriving on time if Australia Post's claim that the probability of letters arriving on time is 0.96 is correct?

A simulation helps to answer this question.

Click Simulate to randomly 'deliver' 59 letters, with each independently having probability 0.96 of arriving on time. Click Accumulate then run the simulation between 100 and 200 times. (Hold down the Simulate button to speed up the process.)

Observe the distribution of the number of letters arriving on time. The proportion of simulations with 52 or fewer letters arriving on time is shown to the right of the dot plot. Observe that this rarely happens.

We therefore conclude that the article is justified — only 52 letters being delivered on time is most unlikely if Australia Post's claim is correct.

We will return to this example later.

10.1.4 Test for a mean

In this example, we ask whether a sample mean is consistent with the underlying population mean having a target value.

Quality control for cornflake packets

In a factory producing packets of cornflakes, the weight of cornflakes that a filling machine places in each packet varies from packet to packet. From extensive previous monitoring of the operation of the machine, it is known that the net weight of '500 gm' packets is approximately normal with standard deviation σ = 10 gm.

The mean net weight of cornflakes in the packets is controlled by a single knob. The target is for a mean of µ = 520 gm to ensure that few packets will contain less than 500 gm. Samples are regularly taken to assess whether the machine needs to be adjusted. A sample of 10 packets was weighed and contained an average of 529 gm. Does this indicate that the underlying mean has drifted from µ = 520 and that the machine needs to be adjusted?

A simulation

If the filling machine is working to specifications, each packet would contain a weight that is sampled from a normal distribution with µ = 520 and σ = 10.

How unlikely is it to get the mean of a sample of size 10 that is as far from 520 as 529 if the machine is working correctly?

A simulation helps to answer this question.

Click Simulate to randomly generate the weights of 10 packets from a normal (µ = 520, σ = 10) distribution. Click Accumulate then run the simulation between 100 and 200 times. (Hold down the Simulate button to speed up the process.)

Observe that although many of the individual cornflake packets weigh more than 529 gm, it is rare for the mean weight to be as far from the target as 529 gm (i.e. either ≥529 gm or ≤511 gm).

There is therefore strong evidence that the machine is no longer filling packets with a mean weight of 520 gm and needs adjusting — a sample mean of 529 gm would be unlikely if the machine was filling packets to specifications.

We will return to this example later.

10.1.5 Randomisation tests

Simulation and randomisation are closely related techniques. Both are based on assumptions about the model underlying the data and involve randomly generated data sets.

If random samples are taken from two populations, we are often interested in whether the populations have the same means.

If the two populations were identical, any allocation of the sample values to the two groups would have been as likely as the observed sample data. By observing the distribution of the difference in means from such randomised allocations of values to groups, we can get an idea of whether the actual difference in sample means is unusually large.

Characteristics of failed companies

A study in Greece compared characteristics of 68 healthy companies with those of another 33 that had recently failed. The jittered dot plots on the left below show the ratio of current assets to current liabilities for each of the 101 companies.

The mean asset-to-liabilities ratio for the sample of failed companies is 0.902 lower than that for the healthy companies, but the distributions overlap. Might this difference be simply a result of randomness, or can we conclude that there is a difference in the underlying populations?

Click Randomise to randomly pick 33 of the the 101 values for the failed group. If the underlying distribution of asset-to-liabilities ratios was the same for healthy and failed companies, each such randomised allocation would be as likely as the observed data.

Click Accumulate and repeat the randomisation several more times. Observe that the difference in means would rarely be as far from zero as -0.902 when we assume the same distribution for both groups. This strongly suggests that the distributions must be different.

Since the actual difference is so unusually large, ...

We can conclude that there is strong evidence that the mean asset-to-liability ratio is lower for failed companies than healthy ones.

10.1.6 Randomisation test for correlation

In this page, another example of randomisation is described to assess whether teams in a soccer league are evenly matched.

English Premier Soccer League, 2007/08 and 2008/09

We saw earlier that the distribution of points in the 2008/09 English Premier Soccer League Table was not consistent with all teams being evenly matched — the spread of points was too high. We will now investigate this further.

If some teams are better than others, the positions of teams in the league in successive years will tend to be similar. The table below shows the points for the teams in two seasons. (Note that the bottom three teams are relegated each year and three teams are promoted from the lower league, so we cannot compare the positions of six of the teams.)

	Points
Team	2007/08	2008/09
ManchesterU Chelsea Arsenal Liverpool Everton AstonVilla Blackburn Portsmouth ManchesterC WestHam Tottenham Newcastle Middlesbro Wigan Sunderland Bolton Fulham Reading Birmingham DerbyCounty StokeCity HullCity WestBromA	87 85 83 76 65 60 58 57 55 49 46 43 42 40 39 37 36 36 35 11 - - -	90 83 72 86 63 62 41 41 50 51 51 34 32 45 36 41 53 - - - 45 35 32

Manchester United, Chelsea, Arsenal and Liverpool were the top four teams in both years. However, ...

Excluding Manchester United, Chelsea, Arsenal and Liverpool, do there seem to be any differences in ability between the other teams?

Randomisation

If all other teams have equal probabilities of winning against any opponent, the 2008/09 points of 45 (which was actually obtained by Wigan) would have been equally likely to have been obtained by any of the teams in that year. Indeed, any allocation of the points (63, 62, 41, ..., 53) to the teams (Everton, Aston Villa, Blackburn, ..., Fulham) would be equally likely.

The diagram below performs this randomisation of the results in 2008/09.

Click Randomise to shuffle the 2008/09 points between the teams (excluding the top four teams and those that were only in the league for one of the seasons). If the teams were of equal ability, these points would have been as likely as the actual ones.

The correlation coefficient between the points in the two seasons gives an indication of how closely they are related. Click Accumulate and repeat the randomisation several more times. Observe that the correlation for the randomised values is only as far from zero as the actual correlation (r = 0.537) in about 5% of randomisations. Since a correlation as high as 0.537 is fairly unusual for equally-matched teams, ...

There is moderately strong evidence of a difference in skill between teams, even when the top four have been excluded.

10.1.7 Common patterns in tests

The examples in earlier pages of this section involved different types of data and different analyses. Indeed, you may find it difficult to spot their common theme!

All analyses were examples of hypothesis testing. We now describe the general framework of hypothesis testing within which all of these examples fit. This general framework is the basis for important applications in later sections of CAST.

It is extremely important that you understand that hypothesis testing addresses this question — make sure that you remember it well!!

p-value	Interpretation
over 0.1	no evidence that the null hypothesis does not hold
between 0.05 and 0.1	very weak evidence that the null hypothesis does not hold
between 0.01 and 0.05	moderately strong evidence that the null hypothesis does not hold
under 0.01	strong evidence that the null hypothesis does not hold

Use the pop-up menu below to check how the earlier examples in this section fit into the hypothesis testing framework.

Soccer league in one season

Data (and model): Some random mechanism underlies the actual results in the matches during a season. The probabilities of winning may vary from team to team and there may be a home-team advantage, so there are a lot of unknowns about this model! Our data are a single set of results — the league table at the end of the season.
Null hypothesis: The null hypothesis is that all teams are equally matched — i.e. that they all have the same probability of winning each match.
Alternative hypothesis: The alternative hypothesis is that all teams do not have the same probabilities of winning.
Test statistic: The standard deviation of final points is used. It will be low if the teams have the same abilities (null hypothesis) and higher otherwise (alternative hypothesis).
P-value: We simulated the soccer league, assuming that all teams had the same probability of winning. The p-value was the probability of getting a standard deviation of final points as high as 16.7 (the actual data).
Interpreting the p-value: The p-value was 0.000 (or close). Since there is virtually no chance of getting a standard deviation of points as high as that in the actual league from equally matched teams, we conclude that the teams are not equally matched — the null hypothesis is false.

10.2 Tests about proportions

10.2.1 Inference about parameters

The examples in the previous section involved a range of different types of model for the observed data. In the remainder of this chapter, we concentrate on one particular type of model — random sampling from a population.

When the observed data are a random sample, inference asks questions about characteristics of the underlying population distribution — unknown population parameters.

When the population distribution is categorical, the unknowns are the population probabilities for the different categories. To simplify, we consider populations for which one category is of particular interest ('success') and we denote the unknown probability of success by π.

Australia Post example

A journalist trying to assess Australia Post's assertion that 96 percent of letters arrive 'on time' posted 59 letters and observed that only 52 arrived on time.

We model delivery of these letters as a random sample of 59 categorical values from a population with probability π of success (arrival on time). The null hypothesis of interest is therefore...

H₀: π = 0.96

The alternative hypothesis is

H_A: π < 0.96

Design of CD case

A company intends to manufacture a case that will hold 20 CDs. The design team are particularly keen on a design that is more expensive to manufacture than two competing designs. The manager wants to be sure that customers will prefer the more expensive case before starting production — the price is determined by competitors' CD cases so the more expensive one will have a reduced profit margin and can only be justified if sales are considerably higher.

To assess whether customers prefer the more expensive case, a limited number of each of the three designs is manufactured and placed for sale at the same price in a CD store. Out of the first 90 cases sold, the more expensive case was bought 36 times.

This situation can be modelled as random sampling of 90 values (the three case designs) from a categorical population in which the probability of picking the expensive case is π. The null hypothesis of interest is therefore...

H₀: π = ¹/₃ (no preference)

The alternative hypothesis is

H_A: π > ¹/₃ (preference for the expensive case)

Other data sets arise as random samples from different kinds of population. For example, numerical data sets are often modelled as random samples from a normal distribution. Again, the hypotheses of interest are usually expressed in terms of the parameters of this distribution.

For example, to test whether the mean of a normal distribution is zero, the hypotheses would be...

In the remainder of this section, we show how to test a population probability, and in the next section we will describe tests about a population mean.

10.2.2 P-value for testing proportion

When testing the value of a probability, π, the obvious statistic to use from our random sample is the corresponding sample proportion, p.

It is however more convenient to use the number of successes, x, rather than p since we know that X has a binomial distribution with parameters n (the sample size) and π.

When we know the distribution of the test statistic (at least after the null hypothesis has fixed the value of the parameters of interest), it becomes much easier to obtain the p-value for the test.

As in all other tests, the p-value is the probability of getting such an 'extreme' set of data if the null hypothesis is true. Depending on the null and alternative hypotheses, the p-value is therefore the probability that X is as big (or sometimes as small) as the recorded value.

Since we know the binomial distribution of X when the null hypothesis holds, the p-value can therefore be obtained by adding binomial probabilities.

Note that the p-value can be obtained exactly without need for simulations or randomisation.

Australia Post example

A journalist trying to assess Australia Post's assertion that 96 percent of letters arrive 'on time' posted 59 letters and observed that only 52 arrived on time.

H₀: π = 0.96

H_A: π < 0.96

In the diagram below, click Accumulate then hold down Simulate until about 100 samples of 59 letters have been generated. The proportion of these simulated samples in which 52 or fewer letters arrived on time is an approximation to the p-value for the test.

Since we know that the number arriving on time has a binomial (52, 0.96) distribution when the null hypothesis holds, the simulation is unnecessary. Select Binomial distribution from the pop-up menu. This binomial distribution is displayed, and the probability of 52 or fewer letters being delivered on time is shown to be 0.009 — the p-value for the test.

Since the p-value is so small, there would have been very little chance of the observed data arising if Australia Post's assertion had been correct. We can therefore conclude that there is strong evidence against their assertion. Note that this can be done without any simulations.

10.2.3 Another example

The following example shows again how the binomial distribution can be used to obtain the p-value for a test about a population probability.

Design of CD case

In the trial of three CD cases that was described at the start of this section, all three were offered for sale at the same price in a CD store. Out of the first 90 cases that were purchased, 36 chose the design that was more expensive to manufacture than the other two. Since more than a third chose this case design, is there strong evidence that customers prefer it?

The null and alternative hypotheses are...

H₀: π = ¹/₃ (no preference)

H_A: π > ¹/₃ (preference for the expensive case)

The p-value is the probability of 36 or more expensive cases being purchased when π = ¹/₃. This can be obtained directly from a binomial distribution with π = ¹/₃ and n = 90.

Use the slider below to obtain the p-value for this test.

The p-value for the test is 0.1103, meaning that there is a probability of 0.1103 of the expensive case being purchased 36 of more times even if there is no real preference for it. We therefore conclude that there is no evidence of any preference from the data.

If the p-value for a test is very small, the data are 'inconsistent' with the null hypothesis. (The observed data may still be possible, but are at least extremely unlikely.)

However a high p-value cannot allow us to conclude that the null hypothesis is correct — only that the observed data are consistent with it. For example, if exactly 30 expensive cases (a third) were purchased in the CD-case example above, it would be wrong to conclude that there was no preference for it. The data are also consistent with other values of π near ¹/₃, so we cannot conclude that π is not 0.32 or 0.34.

10.2.4 One- and two-tailed tests

p-value	Interpretation	Conclusion
p > 0.1	x is not unusually high. It would be as high in more than 10% of samples if π = ¹/₃.	There is no evidence against π = ¹/₃.
0.05 < p < 0.1	We would find x as high in only 5% to 10% of samples if π = ¹/₃.	There is only slight evidence against π = ¹/₃.
0.01 < p < 0.5	We would find x this high in only 1% to 5% of samples if π = ¹/₃.	There is moderately strong evidence against π = ¹/₃.
p < 0.01	We would find x this high in under 1% of samples if π = ¹/₃.	There is strong evidence against π = ¹/₃.

The Australia Post hypothesis test involved a random sample of size n from a population with probability π of success (delivery on time). The data collected were x successes, and we tested the hypotheses...

where π₀ was the constant of interest (e.g. 0.80 in this example). The following steps were followed to obtain the p-value for the test.

The telepathy example was similar, but the alternative hypothesis involved high values of π and the p-value was found by counting upper tail probabilities.

The appropriate tail probability to use depends on the alternative hypothesis. If the alternative hypothesis allows either high or low values of x, the test is called a two-tailed test,

The p-value is then double the smaller tail probability since values of x in both tails of the binomial distribution would provide evidence for H_A.

Ethics codes in companies

In 1999, The Conference Board surveyed 124 companies and found that 97 had their own ethics codes ("Business Bulletin", Wall Street Journal, Aug 19, 1999). In 1997, it was believed that 72% of companies had ethics codes, so is there any evidence that the proportion has changed?

This question is equivalent to asking whether a sample proportion of 97 out of 124 is consistent with sampling from a population with π = 0.72. This can be expressed as the hypotheses

We would expect about (0.72 x 124) = 89 of the companies to have ethics codes. A sample count that is either much greater than 89 or much less than 89 would suggest that the probability had changed. Use the slider below to obtain the p-value.

The probability of getting as many as 97 is 0.0718. Since this is a 2-tailed test, we must also take account of the probability of getting a count that is as unusually low, so the p-value is twice this, 0.1436. Getting 97 companies with ethics codes is therefore not unlikely, so we conclude that there is no evidence from these data of a change in the proportion of companies with ethics codes since 1997.

10.2.5 Normal approximation

To find the p-value for a hypothesis test about a proportion, tail probabilities for a binomial distribution must be summed.

If the sample size n is large, there may be a huge number of probabilities to add together and this is both tedious and may result in numerical errors.

Home-based businesses owned by women

A recent study that was reported in the Wall Street Journal sampled 899 home-based businesses and found that 369 were owned by women.

Are home-based businesses less likely to be owned by females than by males? This question can be expressed as a hypothesis test. If the population proportion of home-based businesses owned by females is denoted by π, the hypotheses can be written as...

If the null hypothesis is true, the sample number owned by females will have a binomial distribution with parameters n = 899 and π = 0.5. The p-value for the test is therefore the sum of binomial probabilities,

A lot of probabilities must be evaluated and summed! And all are close to zero.

We saw earlier that the normal distribution may be used as an approximation to the binomial when n is large. Both the sample proportion of successes, p, and the number of successes, x = np, are approximately normal when n is large.

The best-fitting normal distribution can be used to obtain an approximation to any binomial tail probability. In particular, it can be used to find an approximate p-value for a hypothesis test.

A large random sample of size n is selected from a population with probability π of success and x successes are observed. We will again test the hypotheses

The normal approximation to the distribution of x can be used to find the tail probability,

Home-based businesses owned by women

In this example, the sample size, n = 899 is large, so we can use a normal approximation to obtain the probability of 369 or fewer businesses owned by females if the underlying population probability was 0.5 (the null hypothesis).

Click Accumulate then simulate sampling of 899 businesses about 300 times. (Hold down the button Simulate.) From the simulation, it is clear that the probability of obtaining 369 or fewer businesses owned by females is extremely small — there is strong evidence against the null hypothesis.

The same conclusion can be reached without a simulation.

Select Bar chart from the pop-up menu, then select Normal approximation. From the normal approximation, we can determine that the p-value for the test (the tail area below 369) is extremely close to zero.

To find this tail probability, any value of x between 369 and 370 might have been used when evaluating the z-score. The p-value can be more accurately estimate by using 369.5. This is called a continuity correction.

Be careful about whether to add or subtract — the probability statement should be unchanged. For example, P(X ≥ 410) = P(X ≥ 409.5), so 0.5 should be subtracted from x = 410 as a continuity correction in order to find this probability using a normal approximation and z-score.

10.2.6 Statistical distance

It is natural to base a test about such a parameter on the corresponding sample statistic:

If the value of the sample statistic is close to the hypothesised value of the parameter, there is no reason to doubt the null hypothesis. However if they are far apart, the data are not consistent with the null hypothesis and we should conclude that the alternative hypothesis holds.

How do we tell what is a large distance between, say, p and a hypothesised value for the population proportion, π₀? The empirical rule says that we expect p to be within two standard errors of π₀ (about 95% of the time). If we measure the distance in standard errors, we know that 2 (standard errors) is a large distance, 3 is a very large distance, and 1 is not much.

In general, the statistical distance of an estimate to a hypothesised value of the underlying parameter is

If this comes to more than 2, or less than -2, it suggests that the hypothesized value is wrong: the estimate is not consistent with the hypothesised parameter value. If, on the other hand, z is close to zero, the data are giving a result reasonably close to what we expected based on the hypothesis.

10.2.7 Tests based on statistical distance

The statistical distance of an estimate to a hypothesised value of the underlying parameter is

This can be used at test statistic. If the null hypothesis holds, it approximately has a standard normal distribution — a normal distribution with zero mean and unit standard deviation.

The p-value for a test can be determined from the tail areas of this standard normal distribution.

In the above diagram, the null hypothesis is consistent with with estimates close to the hypothesised value and the alternative hypothesis is suggested by estimates that are either much bigger or smaller than this value (called a two-tailed test). For a two-tailed test, the p-value is the red tail area and can be looked up using either normal tables or in Excel.

If the standard error of the estimate must itself be estimated from the sample data, the above test statistic is only approximately normally distributed. In some tests that we will describe in later sections, the test statistic has a t distribution (which has slightly longer tails than the standard normal distribution). This refinement will be described fully in the next section.

Home-based businesses owned by women

The diagram below repeats the simulation that we used earlier to test whether the proportion of home-based businesses owned by women was less than 0.5:

The proportion owned by women in a sample of n = 899 businesses was ³⁶⁹/₈₉₉ = 0.410.

Again click Accumulate and hold down the Simulate button until about 100 samples of 899 businesses have been generated with a population probability of being owned by women of 0.5.

Select Statistical distance from 0.5 from the top pop-up menu to translate the proportions of female owners in the simulated samples into z-scores. Observe that most of these 'statistical distances from 0.5' are between -1 and +1.

The observed proportion owned by females was 0.410, corresponding to a statistical distance of z = -5.37, an unlikely value if the population proportion was 0.5.

Select Normal distribution from the lower pop-up menu to show the theoretical distribution of the z-scores. The p-value for the test is the tail area of this normal(0, 1) distribution below -5.37 and is virtually zero, so we again conclude that:

It is almost certain that π is less than 0.5.

The p-value obtained in this way using a 'statistical distance' as the test statistic is identical to the p-value that was found from a normal approximation to the number of successes without a continuity correction. (The p-value is slightly different if a continuity correction is used.)

The use of 'statistical distances' does not add anything when testing a sample proportion, but it is a general method that will be used to obtain test statistics in many other situations later in this e-book.

10.3 Tests about means

10.3.1 Introduction

The most important characteristic of a numerical population is usually its mean, µ. Hypothesis tests therefore usually question the value of this parameter.

Blood pressure of executives

The medical director of a large company looks at the medical records of 72 male executives aged between 35 and 44 and observes that their mean blood pressure is = 126.07. We model these 72 blood pressures as a random sample from an underlying population with mean µ (blood pressures of similar executives) .

Published national health statistics report that in the general population for males aged 35-44, blood pressures have mean 128 and standard deviation 15. Do the executives conform to this population? Focusing on the mean of the blood pressure distribution, this can be expressed as the hypotheses,

Filling milk containers

In a bottling plant, plastic containers are filled with a nominal 2 litres of milk. However the containers are filled so quickly that it is impossible to ensure that each contains exactly 2 litres. The volume of milk in a container is approximately normally distributed with standard deviation 0.005 litres, and the machinery is adjusted to give a mean volume of 2.012 litres. (Using the normal distribution, you can check that only 1% of containers should contain less than the nominal 2 litres of milk.)

At regular intervals, ten containers are sampled and the volume of milk in each is measured accurately to assess whether the machinery needs adjustment. (Overfilling wastes milk, but underfilling is illegal.) One sample is shown below.

Volume of milk (litres)
2.024 2.015 2.022	2.025 2.008 2.024	2.021 2.018 2.020	2.023 2.005 2.016

Are the data consistent with the target mean volume of 2.012 litres? This can be expressed as a hypothesis test comparing...

where µ₀ is the constant that we think may be the true mean. These are called two-tailed tests. In other situations, the alternative hypothesis may involve only high (or low) values of µ (one-tailed tests), such as

10.3.2 Test for mean (known σ)

In both examples in the first page of this section, there was knowledge of the population standard deviation σ (at least when H₀ was true). This greatly simplifies the problem of finding a p-value for the test.

Blood pressure of executives: From published information, the national distribution of blood pressure in males aged 35-44 is known to have a standard deviation σ = 15.
Filling milk containers: The variataion caused by the filling machinery is well understood and the standard deviation of the volume of milk is known to be σ = 0.005 litres.

The first step in finding a p-value for the test is to identify a summary statistic that throws light on whether H₀ or H_A is true. When testing the population mean, µ, the obvious summary statistic is the sample mean,

, and the hypothesis tests that will be described here are based on this.

We saw earlier that sample mean has a distribution with mean and standard deviation

Furthermore, the Central Limit Theorem states that the distribution of the sample mean is approximately normal, provided the sample size is not small. (The result holds even for small samples if the population distribution is also normal.)

The p-value for the test is the probability of getting a sample mean as 'extreme' as the one that was recorded when H₀ is true. It can be found directly from the distribution of the sample mean.

Since we know the distribution of the sample mean (when H₀ is true), the p-value can be evaluated as the tail area of this distribution.

10.3.3 P-value from statistical distance

The p-value for testing a hypothesis about the mean, µ, when σ is known, is a tail area from the normal distribution of the sample mean and can be evaluated in the usual way using a z-score. This calculation can be expressed in terms of the statistical distance between the parameter and its estimate,

Since z has a standard normal distribution (zero mean and unit standard deviation) when the null hypothesis holds, it can be used as a test statistic.

The p-value for the test can be determined from the tail areas of the standard normal distribution.

Quality control for cornflake packets

The diagram below repeats the simulation that we used earlier to test whether a sample mean weight of 10 cornflake packets of 529 gm is consistent with a packing machine that is set to give normally distributed weights with µ = 520 gm and σ = 10 gm.

Again click Accumulate and hold down the Simulate button until about 100 samples of 10 packets have been selected and weighed. The p-value is the probability of getting a sample mean further from 520 gm than 529 gm — either below 511 gm or above 529 gm and the simulation provides an estimate. However a simulation is unnecessary since we can evaluate the p-value exactly.

Select Normal distribution from the pop-up menu on the bottom right to replace the simulation with the normal distribution of the mean,

From its tail area, we can calculate (without a simulation) that the probability of getting a sample mean as far from 520 as 529 is exactly 0.0044. This is the exact p-value for the test.

P-value from statistical distance

Finally, consider the statistical distance of our estimate of µ, 529 gm, from the hypothesised value, 520 gm.

z-score for mean

Select 'Statistical distance' from 520 from the middle pop-up menu to show how the p-value is found using this z-score.

Since the p-value is so small (0.0044), we conclude that there is strong evidence that the population mean, µ, is not 520.

Weights of courier packages

A courier company suspected that the weight of recently shipped packages had dropped. From past records, the mean weight of packages was 18.3 kg and their standard deviation was 7.1 kg. These figures were based on a very large number of packages and can be treated as exact.

Thirty packages were sampled from the previous week and their mean weight was found to be 16.8 kg. The data are displayed in the jittered dot plot below.

If the null hypothesis was true, the sample mean would have the normal distribution shown in pale blue. Although the sample mean weight is lower than 18.3 kg, it is not particularly unusual for this distribution, so we conclude that there is no evidence that the mean weight has reduced.

The right of the diagram shows how the p-value is calculated from a statistical distance (z-score).

Choose Modified Data from the pop-up menu. The slider allows you to investigate how low the sample mean must become in order to give strong evidence that µ is less than 18.3.

10.3.4 The t distribution

In the examples on the previous page, the population standard deviation, σ, was a known value. Unfortunately this is rarely the case in practice, so the previous test cannot be used.

Returns from Mutual Funds

Investing in the share market can be risky for small investors since the value of individual companies can fluctuate greatly, especially over short periods of time. These risks can be reduced by buying shares in a mutual fund that spreads the investment amoung a wide portfolio of companies.

Different mutual funds invest in companies of different types and with different inherent risks of losing and (hopefully) gaining value. Some funds have been categorised as 'high-risk' funds and a sample of 25 of these is shown in the table below. The percentage return paid by these funds over a 3-year period (April 1997 to March 2000) is also shown. (The stock market did particularly well over this period!)

The corresponding annualised return from Federal Constant Maturity Rate Bonds over this period was 5.64%. Did the high-risk funds do any better on average than this 'safe' investment?

High-risk mutual fund

Annualised 3-year return
(1997-2000)

Alliance Quasar
Alliance Tech
Amer Cent Gl Gold
Berger Sm Co Gr
Blackrock Sm Cp Gr
CGM Cap Devel
Dreyfus Aggressive Growth
Evergreen Aggressive growth A
Federated Small cap Strat A
Fidelity emerging markets
Fidelity Selects Comp
Franklin Value A
Goldman Sachs small cap val A
Hotchkiss and Wiley Small Cap
JP Morgan Sm Co
J Hancock Small cap Growth B
Kemper Samall cap equity A
MFS Emerg Gr
Montgomery Small cap R
Oakmark Sm Cap
O'Shaughnessy Crn Gr
PBHG Emerging Growth
Putnam OTC Emerg Gr
State St. Res Emer Gr A
USAA Aggressive Gr

8.76%
58.71%
-22.82%
49.02%
43.97%
13.91%
-2.89%
39.64%
17.91%
-10.55%
68.58%
-0.33%
4%
0.14%
23.87%
38.23%
26.6%
36.02%
29.51%
1.62%
28.91%
29.32%
54.43%
30.76%
49.67%

The hypotheses of interest are similar to those in the initial pages of this section,

However we no longer know the population standard deviation, σ. The only information we have about σ comes from our sample.

When the population standard deviation, σ, was a known value, we used a test statistic

where s is the sample standard deviation. This test statistic has greater spread than the standard normal distribution, due to the extra variability that results from estimating s, especially when the sample size n is small.

The diagram below generates random samples from a normal distribution. Click Take sample a few times to see the variability in the samples.

Click Accumulate then take about 50 random samples. Observe that the stacked dot plot of the t statistic conforms reasonably with a standard normal distribution.

Now use the pop-up menu to reduce the sample size to 5 and take a further 50-100 samples. You will probably notice that there are more 'extreme' t-values (less than -3 or more than +3) than would be expected from a standard normal distribution.

Reduce the sample size to 3 and repeat. It should now be clearer that the distribution of the t-statistic has greater spread than a standard normal distribution. Click on the crosses for the most extreme t-values and observe that they correspond to samples in which the 3 data values happen to be close together, resulting in a small sample standard deviation, s.

We have seen that the t statistic does not have a standard normal distribution, but it does have another standard distribution called a t distribution with (n - 1) degrees of freedom. In the next page, we will use this distribution to obtain the p-value for hypothesis tests.

The diagram below shows the shape of the t distribution for various different values of the degrees of freedom.

Drag the slider to see how the shape of the t distribution depends on the degrees of freedom. Note that

The t distribution's shape becomes close to the standard normal distribution (the red curve) as the degrees of freedom increase.
When the degrees of freedom are small, the t distribution has much longer tails than the standard normal distribution.
When the degrees of freedom are around 10, the centre of the t distribution looks close to a standard normal distribution. However an important use of the t distribution is in hypothesis tests where the tails of the distribution are particularly important. The tails of the distributions are quite different.

A standard normal distribution can be used as an approximation to a t distribution if the degrees of freedom are large (say 30 or more) but the t distribution must be used for smaller degrees of freedom.

10.3.5 The t test for a mean

The p-value for any test is the probability of getting such an 'extreme' test statistic when H₀ is true. When testing the value of a population mean, µ, when σ is unknown, the appropriate test statistic is

Since this has a t distribution (with n - 1 degrees of freedom) when H₀ is true, the p-value is found from a tail area of this distribution. The relevant tail depends on the alternative hypothesis. For example, if the alternative hypothesis is for low values of µ, the p-value is the low tail area of the t distribution since low values of

(and hence t) would support H_A over H₀.

Returns from Mutual Funds

The example on the previous page asked whether the average annualised return on high-risk mutual funds was higher than that from Federal Bonds (5.64%) over the period April 1997 to March 2000. The population standard deviation was unknown and the hypotheses of interest were,

The diagram below shows the calculations for obtaining the p-value for this test from the t distribution with (n - 1) = 24 degrees of freedom.

Since the probability of obtaining such a high mean return from 25 funds is 0.000 (to 3 decimal places) if the underlying population mean is 5.64, we conclude that there is extremely strong evidence that the mean return on high-risk funds was over 5.64 percent.

Select Modified Data from the pop-up menu and use the slider to investigate the relationship between the sample mean and the p-value for the test.

In some hypothesis tests, the alternative hypothesis allows both low and high values of µ.

In this type of two-tailed test, the p-value is the sum of the two tail areas, as illustrated below.

10.4 Decisions and significance

10.4.1 Hypothesis tests and decisions

We have explained how p-values describe the strength of evidence against the null hypothesis.

Saturated fat content of cooking oil

It has been claimed that the saturated fat content of soybean cooking oil is no more than 15%. A clinician believes that the saturated fat content is greater than 15% and randomly samples 13 bottles of soybean cooking oil for testing.

**Percentage saturated fat in soybean cooking oil**
15.2 12.4	15.4 13.5	15.9 17.1	16.9 14.3	19.1 18.2	15.5 16.3	20.0

The clinician is interested in the following hypotheses.

The p-value of 0.04 means that there is moderately strong evidence against H₀ — i.e. moderately strong evidence that the mean saturated fat content is greater than 15%.

Many hypothesis tests are followed by some action that depends on whether we conclude from the test results that H₀ or H_A is true. This decision depends on the data.

Decision	Action
accept H₀	some action (often the status quo)
reject H₀	a different action (often a change to a process)

However the decision that is made could be wrong. There are two ways in which an error might be made — wrongly rejecting H₀ when it is true (called a Type I error), and wrongly accepting H₀ when it is false (called a Type II error). These are represented by the red cells in the table below:

		Decision
		accept H₀	reject H₀
True state of nature	H₀ is true	correct	Type I error
H_A (H₀ is false)	Type II error	correct

A good decision rule about whether to accept or reject H₀ (and perform the corresponding action) will have small probabilities for both kinds of error.

Saturated fat content of cooking oil

The clinician who tested the saturated fat content of soybean cooking oil was interested in the hypotheses.

If H₀ is rejected, the clinician intends to report the high saturated fat content to the media. The two possible errors that could be made are described below.

			Decision
			accept H₀ (do nothing)	reject H₀ (contact media)
Truth	H₀:	µ is really 15% (or less)	correct	wrongly accuses manufacturers
Truth	H_A:	µ is really over 15%	fails to detect high saturated fat	correct

Ideally the decision should be made in a way that keeps both probabilities low.

10.4.2 Decision rules

We now introduce the idea of decision rules with a test about whether a population mean is a particular value, µ₀, or greater. We assume initially that the population is normally distributed and that its standard deviation, σ, is known.

The decision about whether to accept or reject H₀ should depend on the value of the sample mean,

. Large values throw doubt on H₀.

Data	Decision
< k	accept H₀
is k or higher	reject H₀

We want to choose the value k to make the probability of errors low. This is however complicated because of the two different types of error.

Increasing the value of k to make the Type I error probability small (top right) also increases the Type II error probability (bottom left) so the choice of k for the decision rule is a trade-off between the acceptable sizes of the two types of error.

Illustration

The diagram below relates to a normal population whose standard deviation is known to be σ = 4. We will test the hypotheses

The test is based on the sample mean of n = 16 values from this distribution. The sample mean has a normal distribution,

This normal distributions can be used to calculate the probabilities of the two types of error. The diagram below illustrates how the probabilities of the two types of error depend on the critical value for the test, k.

Drag the slider at the top of the diagram to adjust k. Observe that making k large reduces the probability of a Type I error, but makes a Type II error more likely. It is impossible to simultaneously make both probabilities small with only n = 16 observations.

Note also that there is not a single value for the probability of a Type II error — the probability depends on how far above 10 the mean µ lies. Drag the slider on the row for the alternative hypothesis to observe that:

The probability of a Type II error is always high if µ is close to 10, but is lower if µ is far above 10.

This is as should be expected — the further above 10 the population mean, the more likely we are to detect that it is higher than 10 from the sample mean.

10.4.3 Significance level and p-values

The decision rule affects the probabilities of Type I and Type II errors and there is always a trade-off between these two probabilities. Selecting a critical value to reduce one error probability will increase the other.

In practice, we usually concentrate on the probability of a Type I error. The decision rule is chosen to make the probability of a Type I error equal to a pre-chosen value, often 5% or 1%. This probability is called the significance level of the test and its choice should depend on the type of problem. The worse the consequence of incorrectly rejecting H₀, the lower the significance level that should be used.

Reducing the significance level of the test increases the probability of a Type II error.

The worse the consequence of incorrectly rejecting H₀, the lower the significance level that should be used. In many applications the significance level is set at 5%.

Illustration

The diagram below is identical to the one on the previous page.

With the top slider, adjust k to make the probability of a Type I error as close as possible to 5%. This is the decision rule for a test with significance level 5%.

From the normal distribution, the appropriate value of k for a test with 5% significance level is 11.64.

Drag the top slider to reduce the significance level to 1% and note that the critical value for the test increases to about k = 12.3.

The critical value for a hypothesis test about a population mean (known standard deviation) with any significance level (e.g. 5% or 1%) can be obtained from the quantiles of normal distributions. For other hypothesis tests, it is possible to find similar critical values from quantiles of the relevant test statistic's distribution

It would seem that different methodology is needed to find decision rules for different types of hypothesis test, but this is only partially true. Although some of the underlying theory depends on the type of test, the decision rule for any test can be based on its p-value. For example, for a test with significance level 5%, the decision rule is always:

	Decision
p-value > 0.05	accept H₀
p-value < 0.05	reject H₀

For a test with significance level 1%, the null hypothesis, H₀, should be rejected if the p-value is less than 0.01.

Illustration

The following diagram again investigates decision rules for testing the hypotheses

based on a sample of n = 16 values from a normal population with known standard deviation σ = 4.

In the diagram, the decision rule is based on the p-value for the test. Use the slider to adjust the critical p-value and observe that the significance level (probability of Type I error) is always equal to the p-value used in the decision rule. Adjust the critical p-value to 0.01.

Although the probability of a Type II error on the bottom row of the above table varies depending on the type of test, the top row in the diagram is the same for all kinds of hypothesis test.

10.4.4 Sample size and power

A decision rule about whether to accept or reject H₀ can result in one of two types of error. The probabilities of making these errors describe the risks involved in the decision.

Instead of the probability of a Type II error, it is common to use the power of the test, defined as one minus the probability of a Type II error,

When the alternative hypothesis includes a range of possible parameter values (e.g. µ ≠ 0), the power depends on the actual parameter value.

It is clearly desirable to use a test whose power is as close to 1.0 as possible. There are three different ways to increase the power.

When the significance level is fixed, increasing the sample size is therefore usually the only way to improve the power.

		Decision
		accept H₀	reject H₀
Truth	H₀ is true		Significance level = P (Type I error)
H_A (H₀ is false)	P (Type II error)	Power = 1 - P (Type II error)

Illustration

The following diagram again investigates decision rules for testing the hypotheses

based on a samples from a normal population with known standard deviation σ = 4. We will fix the significance level of the test at 5%.

The top half of the diagram shows the normal distribution of the mean for a sample of size n = 16. Use the slider to increase the sample size and observe that:

The spread of the mean's distribution decreases.
The critical value for rejecting H₀ decreases to keep the significance level at 5%.
The probability of a Type II error decreases and the power of the test increases.

10.5 Properties of p-values

10.5.1 Null and alternative hypotheses

In some situations there is a kind of symmetry between the two competing hypotheses. The sample data provide information about which of the two hypotheses is true.

Shareholder meeting vote

Two candidates, Mike Smith and Sarah Brown, stand for election as chairperson of the board of directors of a large company. Just before the shareholders' meeting at which the election will be held, 56 randomly selected shareholders are asked about their voting intentions. If the proportion intending to vote for Mike Smith is denoted by π, the hypotheses of interest are

H0:mu=20, HA:mu!=20

The diagram below illustrates how the poll results might weigh the evidence for each candidate winning.

Drag the slider to see how different sample numbers choosing Mike Smith affect the evidence. Unless either candidate receives (say) three quarters of the sample vote, we should admit that there is some doubt about who will win — the sample may not accurately reflect the population proportions.

In statistical hypothesis testing, the two hypotheses are not treated symmetrically in this way. We must distinguish in a much more fundamental way between them.

Instead, we ask whether the sample data are consistent with one particular hypothesis (the null hypothesis, denoted by H₀). If the data are not consistent with the null hypothesis, then we can conclude that the competing hypothesis (the alternative hypothesis, denoted by H_A) must be true.

This distinction between the hypotheses is important. Depending on the sample data, it may be possible to conclude that H_A is true. However, regardless of the data, the strongest we can say supporting H₀ is that the data are consistent with it.

Market share estimation through audits

The traditional retail store audit is a widely used marketing research tool among consumer packaged goods companies. The retail stort audit involves periodic audits of a sample of retail audits to monitor inventory and purchases of a particular product. Another auditing procedure, weekend selldown audits, has been proposed as a less expensive alternative.

The market shares of 10 brands of fruit juice were estimated using both of the store audit methods. Do the two methods result in the same estimates, on average? The data are paired, so we analyse the difference in estimates for each product (traditional minus weekend selldown) and test whether the underlying population mean of these values is zero.

H0:mu=0, HA:mu!=0

The diagram below illustrates the evidence obtained from a set of sample data.

Drag the slider to see the conclusions that might be reached for data sets with different means. The further the sample mean is from zero (on either side), the stronger the evidence that µ is not zero. We can get very strong evidence that H₀ does not hold if the sample mean is far from zero.

However even = 0 does not provide strong evidence that µ = 0

If = 0, µ could just as easily be 0.0001 or -0.0002 (which correspond to H_A). We cannot distinguish, so the best we can say is that the data are consistent with the null hypothesis — the data provide no information against the µ being zero.

In the context of this example, the conclusion from a sample mean of zero would be that the experiment gave no evidence of any difference between the mean estimates from the two auditing methods. The mean estimates might be different, but the data did not detect the effect.

The distinction between the null and alternative hypotheses is so important that we repeat it below.

10.5.2 Consistency with null hypothesis

In the previous page, a diagram with scales illustrated how the evidence against H₀ was 'weighed' for different data sets. A p-value is a numerical description of this evidence that can give a scale to this diagram.

Market share estimation through audits

In the example on the previous page, two different auditing methods were used to estimate market share. Is there any difference between the mean estimate of market share for the two methods?

The diagram below weighs the evidence using the p-value from a t-test of whether the mean difference, µ = 0.

The p-value is an index of credibility for the null hypothesis, µ = 0.

10.5.3 Distribution of p-values

Many different types of hypothesis test are commonly used in advanced statistics, but all share common features.

A p-value is a statistic that is evaluated from a random sample, so it has a distribution in the same way that a sample mean has a distribution. This distribution also has features that are common to all hypothesis tests. Understanding the distribution of p-values is the key to understanding how they are interpreted.

To illustrate these properties, we use a test for whether a population mean is zero.

H0:mu=0, HA:mu!=0

In the diagram below, you will take random samples from a normal population for which H₀ is true and, separately, from populations for which H_A is true.

When H₀ holds

Initially the population mean is zero, so H₀ holds. A single sample from this population is shown on the left and the p-value for testing whether the population mean is zero is shown as a cross on the jittered dot plot on the bottom right.

Click the button Take sample a few times to take other samples from this population and add their p-values to the display on the bottom right. After taking 50 or more samples, you should observe that the p-values are spread evenly between 0 and 1. This supports our assertion that the p-values have a rectangular distribution between 0 and 1 when H₀ holds.

When H_A holds

Now use the slider to change the true population mean to 2.0. We are still testing whether the mean is zero, so H_A now holds. Take 40 or 50 samples and observe that the p-values are usually closer to 0 than to 1.

Click on some of the larger p-values on the jittered dot plot to display the samples that gave rise to them. The sample means vary and, by chance, some samples have means that are near 0.0, even when the population mean is 2.0; these samples result in larger p-values.

Repeat this exercise with different population means (try at least 1.0, 2.0, 3.0 and -2.0). The further the population mean from the value targetted by H₀, 0.0, the more tightly clustered the p-values are around 0.0.

Although it is possible to obtain a low p-value when H₀ holds and a high p-value when H_A holds, low p-values are more likely under H_A than under H₀.

10.5.4 Interpretation of a p-value

We saw in the last page that p-values have a rectangular distribution between 0 and 1 when H₀ holds. A consequence of this is that the probability of obtaining a p-value of 0.1 or lower is exactly 0.1 (when H₀ holds). This is illustrated on the left of the diagram below.

Similarly, the probability of obtaining a p-value of 0.01 or lower is exactly 0.01, etc. (when H₀ holds).

Again, we use the specific hypothesis test for

H0:mu=0, HA:mu!=0

in order to demonstrate these general results.

Click the button Take sample 50 or more times to take samples from this population and add their p-values to the display on the right. From the diagram on the top right, we can read off the proportion of p-values that are less than any value. Approximately 50% of p-values are less than 0.5, 20% are less than 0.2, etc. when the null hypothesis is true.

Use the slider to change the true population mean to 1.5 and repeat. From the diagram on the top right, you should observe that more than 50% of p-values are less than 0.5, more than 20% are less than 0.2, etc. when the alternative hypothesis holds.

Remembering that low p-values favour H_A more than H₀, we can give the following interpretation to a p-value.

If a data set gives rise to a p-value of say 0.0023, we can state that the probability of getting a data set with such a low p-value is only 0.0023 if H₀ is true. Since such a low p-value is so unlikely, the data give strong evidence that H₀ does not hold.

Of course, we may be wrong. A p-value of 0.0023 could arise when either H₀ or H_A holds. However it is unlikely when H₀ is true and more likely when H_A is true.

Similarly, p-value that is as low as 0.4 occurs with probability 0.4 when the null hypothesis holds. Since this is fairly high, we conclude from a data set that gave rise to a p-value of 0.4 that there is no evidence that the null hypothesis does not hold.

Although it may be regarded as an over-simplification, the table below may be used as a guide to interpreting p-values.

10.5.5 P-values for other tests

The properties of p-values (and hence their interpretation) have been demonstrated in the context of a hypothesis test about whether a population mean was zero.

p-value	Interpretation
over 0.1	no evidence that the null hypothesis does not hold
between 0.05 and 0.1	very weak evidence that the null hypothesis does not hold
between 0.01 and 0.05	moderately strong evidence that the null hypothesis does not hold
under 0.01	strong evidence that the null hypothesis does not hold

P-values for all hypothesis tests have the same properties. As a result, we can interpret any p-value if we know the null and alternative hypotheses that it tests, even if we do not know the formulae that underlies it. (In practice, a statistical computer program is generally used to perform hypothesis tests, so knowledge of formulae is of little importance.)

In particular, for any test where the null hypothesis restricts a parameter to a single value,

p-value	Interpretation
over 0.1	no evidence that the null hypothesis does not hold
between 0.05 and 0.1	very weak evidence that the null hypothesis does not hold
between 0.01 and 0.05	moderately strong evidence that the null hypothesis does not hold
under 0.01	strong evidence that the null hypothesis does not hold

Another type of test

The normal distribution is often used as a hypothetical population from which a set of data are assumed to be sampled. But are the data consistent with an underlying normal population, or does the population distribution have a different shape?

One popular test for assessing whether a random sample come from a normal population is the Shapiro-Wilkes W test. The theory behind the test is advanced and the formula for the p-value cannot be readily evaluated by hand. However most statistical programs will perform the test.

A random sample of 40 values from a normal population is displayed in a jittered dot plot on the left of the diagram. The p-value for the Shapiro-Wilkes W test is shown under the dot plot and also graphically on the right.

Click Take sample a few times to take more samples and build the distribution of the p-values for the test. You should observe that the p-values have a rectangular distribution between 0 and 1 when the null hypothesis is true (i.e. if the samples are from a normal distribution).

Drag the slider on the top left of the diagram to change the shape of the population distribution. Repeat the exercise above and observe that when the null hypothesis does not hold, the p-values tend to be closer to 0.

Click on crosses on the display of p-values in the bottom right to display the sample that produced that p-value. P-values near zero usually correspond to samples that have very long tails to one or both sides, or have very short tails to one or both sides.

Returns from Mutual Funds

As a numerical example, the table below gives the annual returns for a sample of 137 mutual funds in 1999 (a period of rapid growth in the US economy).

41.9
90.6
29.9
10.2
33.7
26.9
88.5
6.5
16.6
19.2
12.6
32.0
3.6
8.1

68.1
57.9
-3.0
42.2
14.5
25.7
28.1
78.4
126.2
42.0
66.6
20.6
54.6
31.7

2.3
45.5
55.5
37.2
51.6
97.1
80.3
41.1
7.3
31.0
30.2
1.7
27.0
38.0

144.9
27.8
121.9
26.0
-11.5
15.5
16.9
27.3
23.9
61.1
68.2
10.0
37.8
77.1

24.3
63.2
-0.6
1.0
12.1
134.5
53.8
60.4
9.0
-6.4
31.0
-2.8
114.6
19.8

11.5
39.6
59.0
20.7
37.3
23.1
32.7
13.0
70.6
87.3
-3.2
-20.8
119.1
-0.1

104.4
-4.6
72.5
7.7
31.4
36.9
47.2
74.7
29.1
70.5
77.7
81.0
191.8
1.6

-0.8
59.4
-2.2
-12.5
81.6
44.0
63.6
114.3
33.6
83.0
70.8
50.1
55.8
28.3

-7.9
51.3
37.7
48.3
88.9
59.4
126.9
35.0
51.0
91.1
-2.7
79.2
0.1
12.9

16.2
23.0
22.4
64.4
10.2
7.6
27.7
8.0
23.5
25.3
22.5

A histogram of the data is shown below.

The best-fitting normal distribution (with mean and standard deviation equal to those of the data) has been superimposed on the histogram. There is a suggestion of skewness in the distribution of returns. Are the data really skew, or might this amount of skewness be possible in random samples from a normal distribution?

Applying the Shapiro-Wilkes W test to the data using the statistical program Minitab gives a p-value of "under 0.01". We conclude that there is strong evidence that the distribution is not normal. Even after deleting the 'outlier' — the First American Technology fund had a return of 191.8% — there is still strong evidence of skewness in the distribution of returns.

You should be able to interpret p-values that computer software provides for a wide variety of hypothesis tests using the properties that we have described in this section.

Chapter 10 Testing Hypotheses

10.1 Introduction to hypothesis tests

10.1.1 Inference

10.1.2 Soccer league simulation

10.1.3 Simulation to test a proportion

10.1.4 Test for a mean

10.1.5 Randomisation tests

10.1.6 Randomisation test for correlation

10.1.7 Common patterns in tests

10.2 Tests about proportions

10.2.1 Inference about parameters

10.2.2 P-value for testing proportion

10.2.3 Another example

10.2.4 One- and two-tailed tests

10.2.5 Normal approximation

10.2.6 Statistical distance

10.2.7 Tests based on statistical distance

10.3 Tests about means

10.3.1 Introduction

10.3.2 Test for mean (known σ)

10.3.3 P-value from statistical distance

10.3.4 The t distribution

10.3.5 The t test for a mean

10.4 Decisions and significance

10.4.1 Hypothesis tests and decisions

10.4.2 Decision rules

10.4.3 Significance level and p-values

10.4.4 Sample size and power

10.5 Properties of p-values

10.5.1 Null and alternative hypotheses

10.5.2 Consistency with null hypothesis

10.5.3 Distribution of p-values

10.5.4 Interpretation of a p-value

10.5.5 P-values for other tests