If you don't want to print now,

Chapter 10   Testing Hypotheses

10.1   Introduction to hypothesis tests

  1. Inference
  2. Soccer league simulation
  3. Simulation to test a proportion
  4. Test for a mean
  1. Randomisation tests
  2. Randomisation test for correlation
  3. Common patterns in tests

10.1.1   Inference

Statistical inference

The term statistical inference describes statistical techniques that obtain information about a population parameter (or parameters) based on a single random sample from that population. There are two different but related types of question about the population parameter (or parameters) that we might ask:

What parameter values would be consistent with the sample data?

This branch of inference is called estimation and its main tool is a confidence interval. We described confidence intervals in the previous chapter.

A manufacturer of muesli bars needs to describe the average fat content of the bars (the mean of the hypothetical population of fat contents that would be produced using the recipe). Several bars are analysed and their fat contents are measured.

The sample mean is a point estimate of the population mean, and a 95% confidence interval can also be found.

Are the sample data consistent with some statement about the parameters?

This branch of inference is called hypothesis testing and is the focus of this chapter.

A particular brand of meusli bar is claimed by the manufacturer to have a fat content of 3.4g per bar. A consumer group suspects that the manufacturer is understating the fat content, so a random sample of bars is analysed.

The consumer group must assess whether the data are consistent with the statement (hypothesis) that the underlying population mean is 3.4g.

Errors and strength of evidence

When we studied parameter estimation, we saw that a population parameter cannot be determined exactly from a single random sample — there is a 5% chance that a 95% confidence interval will not include the true population parameter.

In a similar way, a single random sample can rarely provide enough information about a population parameter to allow us to be sure whether or not any hypothesis about that parameter will be true. The best we can hope for is an indication of the strength of the evidence against the hypothesis.

The remainder of this chapter explains how this evidence is obtained and reported.

10.1.2   Soccer league simulation

Randomness in sports results

Although we like to think that the 'best' team wins in sports competitions, there is actually considerable variability in the results. Much of this variability can be considered to be random — if the same teams play again, the results are often different. The most obvious examples of this randomness occur when a series of matches is played between the same two teams.

Since the teams are virtually unchanged in any series, the variability in results can only be explained through randomness.

Randomness or skill?

When we look at sports results, can we tell whether all teams are equally matched with the same probability of winning? Or do some teams have a higher probability of winning than others?

There are different ways to examine this question, depending on the type of data that is available. The following example assesses an end-of-year league table.

English Premier Soccer League, 2008/09

In the English Premier Soccer league, each of the 20 teams plays every other team twice (home and away) during the season. Three points are awarded for a win and one point for a draw. The table below shows the wins, draws, losses and total points for all teams at the end of the 2008/09 season.

 
Team
Wins Draws Losses Points
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
17.
18.
19.
20.
Manchester_U
Liverpool
Chelsea
Arsenal
Everton
Aston_Villa
Fulham
Tottenham
West_Ham
Manchester_C
Wigan
Stoke_City
Bolton
Portsmouth
Blackburn
Sunderland
Hull_City
Newcastle
Middlesbrough
West_Brom_Albion
28
25
25
20
17
17
14
14
14
15
12
12
11
10
10
9
8
7
7
8
6
11
8
12
12
11
11
9
9
5
9
9
8
11
11
9
11
13
11
8
4
2
5
6
9
10
13
15
15
18
17
17
19
17
17
20
19
18
20
22
90
86
83
72
63
62
53
51
51
50
45
45
41
41
41
36
35
34
32
32

We observed in an earlier simulation that there is considerable variability in the points, even when all teams are evenly matched. However, ...

If some teams are more likely to win their matches than others, the spread of final points is likely to be greater — the top and bottom teams are likely to be more extreme.

A simulation

To assess whether there is any difference in skill levels, we can therefore run a simulation of the league, assuming evenly matched teams and generating random results with probabilities 0.372, 0.372 and 0.255 for wins, losses and draws. (A proportion 0.255 of games in the actual league resulted in draws.)

Click Simulate to simulate the 380 games in a season. The standard deviation of the final points is shown below the table. Click Accumulate then run the simulation about 100 times. (Hold down the Simulate button to speed up the process.)

The standard deviation of the points in the actual league table was 18.2. Since most simulated standard deviations are between 5 and 12, we conclude that such a high spread would be extremely unlikely if the teams were evenly matched.

There is strong evidence that the top teams are 'better' than the bottom teams.


10.1.3   Simulation to test a proportion

Other uses of simulation

Simulations can help us to answer questions about a variety of other models (or populations). The following example was suggested by Kay Lipson from Swinburne University of Technology in Australia.

Does Australia Post deliver on time?

The Herald-Sun newspaper published the following article on November 25 1992.

Doubt has been cast over Australia Post's claim of delivering 96 per cent of standard letters on time.

A survey conducted by the Herald-Sun in Melbourne revealed that less than 90 per cent of letters were delivered according to the schedule.

Herald-Sun staff posted 59 letters before the advertised...

Campbell Fuller, Herald-Sun, 25 November 1992.

Is the author justified in disputing Australia Post's claim that 96% of letters are delivered on time?

A simulation

If Australia Post's claim is correct, and every letter independently has probability 0.96 of being delivered on time, we know that the number delivered on time out of 59 letters will be a random quantity. From the information in the article, we can deduce that 52 out of the Herald-Sun's 59 letters arrived on time (a proportion 52/59 = 0.881).

How unlikely is it to get as few as 52 out of 59 letters arriving on time if Australia Post's claim that the probability of letters arriving on time is 0.96 is correct?

A simulation helps to answer this question.

Click Simulate to randomly 'deliver' 59 letters, with each independently having probability 0.96 of arriving on time. Click Accumulate then run the simulation between 100 and 200 times. (Hold down the Simulate button to speed up the process.)

Observe the distribution of the number of letters arriving on time. The proportion of simulations with 52 or fewer letters arriving on time is shown to the right of the dot plot. Observe that this rarely happens.

We therefore conclude that the article is justified — only 52 letters being delivered on time is most unlikely if Australia Post's claim is correct.

We will return to this example later.

10.1.4   Test for a mean

Assessing a claim about a mean

In this example, we ask whether a sample mean is consistent with the underlying population mean having a target value.

Quality control for cornflake packets

In a factory producing packets of cornflakes, the weight of cornflakes that a filling machine places in each packet varies from packet to packet. From extensive previous monitoring of the operation of the machine, it is known that the net weight of '500 gm' packets is approximately normal with standard deviation σ = 10 gm.

The mean net weight of cornflakes in the packets is controlled by a single knob. The target is for a mean of µ = 520 gm to ensure that few packets will contain less than 500 gm. Samples are regularly taken to assess whether the machine needs to be adjusted. A sample of 10 packets was weighed and contained an average of 529 gm. Does this indicate that the underlying mean has drifted from µ = 520 and that the machine needs to be adjusted?

A simulation

If the filling machine is working to specifications, each packet would contain a weight that is sampled from a normal distribution with µ = 520 and σ = 10.

How unlikely is it to get the mean of a sample of size 10 that is as far from 520 as 529 if the machine is working correctly?

A simulation helps to answer this question.

Click Simulate to randomly generate the weights of 10 packets from a normal (µ = 520, σ = 10) distribution. Click Accumulate then run the simulation between 100 and 200 times. (Hold down the Simulate button to speed up the process.)

Observe that although many of the individual cornflake packets weigh more than 529 gm, it is rare for the mean weight to be as far from the target as 529 gm (i.e. either ≥529 gm or ≤511 gm).

There is therefore strong evidence that the machine is no longer filling packets with a mean weight of 520 gm and needs adjusting — a sample mean of 529 gm would be unlikely if the machine was filling packets to specifications.

We will return to this example later.

10.1.5   Randomisation tests

Simulation and randomisation

Simulation and randomisation are closely related techniques. Both are based on assumptions about the model underlying the data and involve randomly generated data sets.

Simulation
New data sets are generated directly from the model.
Randomisation
Modifications to the actual data are identified that would have the same probability of arising if the model held. New data sets are randomly picked from these.

Randomisation is understood most easily through an example.

Comparing two groups

If random samples are taken from two populations, we are often interested in whether the populations have the same means.

If the two populations were identical, any allocation of the sample values to the two groups would have been as likely as the observed sample data. By observing the distribution of the difference in means from such randomised allocations of values to groups, we can get an idea of whether the actual difference in sample means is unusually large.

An example helps to explain this method.

Characteristics of failed companies

A study in Greece compared characteristics of 68 healthy companies with those of another 33 that had recently failed. The jittered dot plots on the left below show the ratio of current assets to current liabilities for each of the 101 companies.

The mean asset-to-liabilities ratio for the sample of failed companies is 0.902 lower than that for the healthy companies, but the distributions overlap. Might this difference be simply a result of randomness, or can we conclude that there is a difference in the underlying populations?

Click Randomise to randomly pick 33 of the the 101 values for the failed group. If the underlying distribution of asset-to-liabilities ratios was the same for healthy and failed companies, each such randomised allocation would be as likely as the observed data.

Click Accumulate and repeat the randomisation several more times. Observe that the difference in means would rarely be as far from zero as -0.902 when we assume the same distribution for both groups. This strongly suggests that the distributions must be different.

Since the actual difference is so unusually large, ...

We can conclude that there is strong evidence that the mean asset-to-liability ratio is lower for failed companies than healthy ones.


10.1.6   Randomisation test for correlation

In this page, another example of randomisation is described to assess whether teams in a soccer league are evenly matched.

English Premier Soccer League, 2007/08 and 2008/09

We saw earlier that the distribution of points in the 2008/09 English Premier Soccer League Table was not consistent with all teams being evenly matched — the spread of points was too high. We will now investigate this further.

If some teams are better than others, the positions of teams in the league in successive years will tend to be similar. The table below shows the points for the teams in two seasons. (Note that the bottom three teams are relegated each year and three teams are promoted from the lower league, so we cannot compare the positions of six of the teams.)

  Points
Team
2007/08 2008/09
ManchesterU
Chelsea
Arsenal
Liverpool
Everton
AstonVilla
Blackburn
Portsmouth
ManchesterC
WestHam
Tottenham
Newcastle
Middlesbro
Wigan
Sunderland
Bolton
Fulham
Reading
Birmingham
DerbyCounty
StokeCity
HullCity
WestBromA
87
85
83
76
65
60
58
57
55
49
46
43
42
40
39
37
36
36
35
11
-
-
-
90
83
72
86
63
62
41
41
50
51
51
34
32
45
36
41
53
-
-
-
45
35
32

Manchester United, Chelsea, Arsenal and Liverpool were the top four teams in both years. However, ...

Excluding Manchester United, Chelsea, Arsenal and Liverpool, do there seem to be any differences in ability between the other teams?

Randomisation

If all other teams have equal probabilities of winning against any opponent, the 2008/09 points of 45 (which was actually obtained by Wigan) would have been equally likely to have been obtained by any of the teams in that year. Indeed, any allocation of the points (63, 62, 41, ..., 53) to the teams (Everton, Aston Villa, Blackburn, ..., Fulham) would be equally likely.

The diagram below performs this randomisation of the results in 2008/09.

Click Randomise to shuffle the 2008/09 points between the teams (excluding the top four teams and those that were only in the league for one of the seasons). If the teams were of equal ability, these points would have been as likely as the actual ones.

The correlation coefficient between the points in the two seasons gives an indication of how closely they are related. Click Accumulate and repeat the randomisation several more times. Observe that the correlation for the randomised values is only as far from zero as the actual correlation (r = 0.537) in about 5% of randomisations. Since a correlation as high as 0.537 is fairly unusual for equally-matched teams, ...

There is moderately strong evidence of a difference in skill between teams, even when the top four have been excluded.


10.1.7   Common patterns in tests

A general framework

The examples in earlier pages of this section involved different types of data and different analyses. Indeed, you may find it difficult to spot their common theme!

All analyses were examples of hypothesis testing. We now describe the general framework of hypothesis testing within which all of these examples fit. This general framework is the basis for important applications in later sections of CAST.

The concepts in this page are extremely important — make sure that you understand them well before moving on.

Data, model and question

Data (and model)
Each example dealt with a data set that was assumed to arise from some random mechanism. We may be able to specify some aspects of this random mechanism (model), but it also has unknown characteristics
Null hypothesis
All models had unknown characteristics, and we want to know whether the model has particular properties — the null hypothesis.
Alternative hypothesis
If the null hypothesis is not true, we say that the alternative hypothesis holds. (You can understand most of hypothesis testing without paying much attention to the alternative hypothesis however!)

Either the null hypothesis or the alternative hypothesis must be true.

Approach

We assess whether the null hypothesis is true by asking ...

Are the data consistent with the null hypothesis?

It is extremely important that you understand that hypothesis testing addresses this question — make sure that you remember it well!!

Answering the question

Test statistic
This is some function of the data that throws light on whether the null or alternative hypothesis holds.
P-value
Testing whether the data are consistent with the null hypothesis is based on the probability of obtaining a test statistic value as 'extreme' as the one recorded if the null hypothesis holds. This is called the p-value for the test.
Interpreting the p-value
Although it may be regarded as an over-simplification, the table below can be used as a guide to interpreting p-values.
p-value Interpretation
over 0.1 no evidence that the null hypothesis does not hold
between 0.05 and 0.1 very weak evidence that the null hypothesis does not hold
between 0.01 and 0.05 moderately strong evidence that the null hypothesis does not hold
under 0.01 strong evidence that the null hypothesis does not hold

Use the pop-up menu below to check how the earlier examples in this section fit into the hypothesis testing framework.

Soccer league in one season

Data (and model)
Some random mechanism underlies the actual results in the matches during a season. The probabilities of winning may vary from team to team and there may be a home-team advantage, so there are a lot of unknowns about this model! Our data are a single set of results — the league table at the end of the season.
Null hypothesis
The null hypothesis is that all teams are equally matched — i.e. that they all have the same probability of winning each match.
Alternative hypothesis
The alternative hypothesis is that all teams do not have the same probabilities of winning.
Test statistic
The standard deviation of final points is used. It will be low if the teams have the same abilities (null hypothesis) and higher otherwise (alternative hypothesis).
P-value
We simulated the soccer league, assuming that all teams had the same probability of winning. The p-value was the probability of getting a standard deviation of final points as high as 16.7 (the actual data).
Interpreting the p-value
The p-value was 0.000 (or close). Since there is virtually no chance of getting a standard deviation of points as high as that in the actual league from equally matched teams, we conclude that the teams are not equally matched — the null hypothesis is false.

10.2   Tests about proportions

  1. Inference about parameters
  2. P-value for testing proportion
  3. Another example
  4. One- and two-tailed tests
  1. Normal approximation
  2. Statistical distance
  3. Tests based on statistical distance

10.2.1   Inference about parameters

Inference and random samples

The examples in the previous section involved a range of different types of model for the observed data. In the remainder of this chapter, we concentrate on one particular type of model — random sampling from a population.

We assume now that the observed data are a random sample from some population.

When the observed data are a random sample, inference asks questions about characteristics of the underlying population distribution — unknown population parameters.

For random samples, the null and alternative hypotheses specify values for the unknown population parameters.

Inference about categorical populations

When the population distribution is categorical, the unknowns are the population probabilities for the different categories. To simplify, we consider populations for which one category is of particular interest ('success') and we denote the unknown probability of success by π.

The null and alternative hypotheses are therefore specified in terms of π.

Australia Post example

A journalist trying to assess Australia Post's assertion that 96 percent of letters arrive 'on time' posted 59 letters and observed that only 52 arrived on time.

We model delivery of these letters as a random sample of 59 categorical values from a population with probability π of success (arrival on time). The null hypothesis of interest is therefore...

H0:   π = 0.96

The alternative hypothesis is

HA:   π < 0.96

Design of CD case

A company intends to manufacture a case that will hold 20 CDs. The design team are particularly keen on a design that is more expensive to manufacture than two competing designs. The manager wants to be sure that customers will prefer the more expensive case before starting production — the price is determined by competitors' CD cases so the more expensive one will have a reduced profit margin and can only be justified if sales are considerably higher.

To assess whether customers prefer the more expensive case, a limited number of each of the three designs is manufactured and placed for sale at the same price in a CD store. Out of the first 90 cases sold, the more expensive case was bought 36 times.

This situation can be modelled as random sampling of 90 values (the three case designs) from a categorical population in which the probability of picking the expensive case is π. The null hypothesis of interest is therefore...

H0:   π = 1/3       (no preference)

The alternative hypothesis is

HA:   π > 1/3       (preference for the expensive case)

Tests about parameters of other populations

Other data sets arise as random samples from different kinds of population. For example, numerical data sets are often modelled as random samples from a normal distribution. Again, the hypotheses of interest are usually expressed in terms of the parameters of this distribution.

For example, to test whether the mean of a normal distribution is zero, the hypotheses would be...

H0:   µ = 0

HA:   µ ≠ 0

In the remainder of this section, we show how to test a population probability, and in the next section we will describe tests about a population mean.

10.2.2   P-value for testing proportion

Test statistic

When testing the value of a probability, π, the obvious statistic to use from our random sample is the corresponding sample proportion, p.

It is however more convenient to use the number of successes, x, rather than p since we know that X has a binomial distribution with parameters n (the sample size) and π.

When we know the distribution of the test statistic (at least after the null hypothesis has fixed the value of the parameters of interest), it becomes much easier to obtain the p-value for the test.

P-value

As in all other tests, the p-value is the probability of getting such an 'extreme' set of data if the null hypothesis is true. Depending on the null and alternative hypotheses, the p-value is therefore the probability that X is as big (or sometimes as small) as the recorded value.

Since we know the binomial distribution of X when the null hypothesis holds, the p-value can therefore be obtained by adding binomial probabilities.

The p-value is a sum of binomial probabilities

Note that the p-value can be obtained exactly without need for simulations or randomisation.

Australia Post example

A journalist trying to assess Australia Post's assertion that 96 percent of letters arrive 'on time' posted 59 letters and observed that only 52 arrived on time.

H0:   π = 0.96

HA:   π < 0.96

In the diagram below, click Accumulate then hold down Simulate until about 100 samples of 59 letters have been generated. The proportion of these simulated samples in which 52 or fewer letters arrived on time is an approximation to the p-value for the test.

Since we know that the number arriving on time has a binomial (52, 0.96) distribution when the null hypothesis holds, the simulation is unnecessary. Select Binomial distribution from the pop-up menu. This binomial distribution is displayed, and the probability of 52 or fewer letters being delivered on time is shown to be 0.009 — the p-value for the test.

Since the p-value is so small, there would have been very little chance of the observed data arising if Australia Post's assertion had been correct. We can therefore conclude that there is strong evidence against their assertion. Note that this can be done without any simulations.

10.2.3   Another example

Another example

The following example shows again how the binomial distribution can be used to obtain the p-value for a test about a population probability.

Design of CD case

In the trial of three CD cases that was described at the start of this section, all three were offered for sale at the same price in a CD store. Out of the first 90 cases that were purchased, 36 chose the design that was more expensive to manufacture than the other two. Since more than a third chose this case design, is there strong evidence that customers prefer it?

The null and alternative hypotheses are...

H0:  π = 1/3       (no preference)

HA:  π > 1/3       (preference for the expensive case)

The p-value is the probability of 36 or more expensive cases being purchased when π = 1/3. This can be obtained directly from a binomial distribution with π = 1/3 and n = 90.

Use the slider below to obtain the p-value for this test.

The p-value for the test is 0.1103, meaning that there is a probability of 0.1103 of the expensive case being purchased 36 of more times even if there is no real preference for it. We therefore conclude that there is no evidence of any preference from the data.

Interpretation of p-values

If the p-value for a test is very small, the data are 'inconsistent' with the null hypothesis. (The observed data may still be possible, but are at least extremely unlikely.)

From a very small p-value, we can conclude that the null hypothesis is probably wrong.

However a high p-value cannot allow us to conclude that the null hypothesis is correct — only that the observed data are consistent with it. For example, if exactly 30 expensive cases (a third) were purchased in the CD-case example above, it would be wrong to conclude that there was no preference for it. The data are also consistent with other values of π near 1/3, so we cannot conclude that π is not 0.32 or 0.34.

A hypothesis test can never conclude that the null hypothesis is correct.

The correct interpretation of p-values for the CD-case experiment would be...

p-value Interpretation Conclusion
p >  0.1 x is not unusually high. It would be as high in more than 10% of samples if π = 1/3. There is no evidence against π = 1/3.
0.05 < p < 0.1 We would find x as high in only 5% to 10% of samples if π = 1/3. There is only slight evidence against π = 1/3.
0.01 < p < 0.5 We would find x this high in only 1% to 5% of samples if π = 1/3. There is moderately strong evidence against π = 1/3.
p < 0.01 We would find x this high in under 1% of samples if π = 1/3. There is strong evidence against π = 1/3.

10.2.4   One- and two-tailed tests

Finding the p-value for a one-tailed test

The Australia Post hypothesis test involved a random sample of size n from a population with probability π of success (delivery on time). The data collected were x successes, and we tested the hypotheses...

where π0 was the constant of interest (e.g. 0.80 in this example). The following steps were followed to obtain the p-value for the test.

  1. The sample proportion of successes, p, was identified as the most informative summary statistic about π.
  2. The number of successes, x = np has a standard binomial distribution with no unknown parameters when H0 holds, so it is a better test statistic.
  3. The p-value is a sum of tail probabilities for this binomial distribution.

The diagram below illustrates these steps

binomial test stat & p-value

The telepathy example was similar, but the alternative hypothesis involved high values of π and the p-value was found by counting upper tail probabilities.

Finding the p-value for a two-tailed test

The appropriate tail probability to use depends on the alternative hypothesis. If the alternative hypothesis allows either high or low values of x, the test is called a two-tailed test,

The p-value is then double the smaller tail probability since values of x in both tails of the binomial distribution would provide evidence for HA.

Ethics codes in companies

In 1999, The Conference Board surveyed 124 companies and found that 97 had their own ethics codes ("Business Bulletin", Wall Street Journal, Aug 19, 1999). In 1997, it was believed that 72% of companies had ethics codes, so is there any evidence that the proportion has changed?

This question is equivalent to asking whether a sample proportion of 97 out of 124 is consistent with sampling from a population with π = 0.72. This can be expressed as the hypotheses

H0:pi=0.574, HA:pi!=0.574

We would expect about (0.72 x 124) = 89 of the companies to have ethics codes. A sample count that is either much greater than 89 or much less than 89 would suggest that the probability had changed. Use the slider below to obtain the p-value.

The probability of getting as many as 97 is 0.0718. Since this is a 2-tailed test, we must also take account of the probability of getting a count that is as unusually low, so the p-value is twice this, 0.1436. Getting 97 companies with ethics codes is therefore not unlikely, so we conclude that there is no evidence from these data of a change in the proportion of companies with ethics codes since 1997.

10.2.5   Normal approximation

Computational problem

To find the p-value for a hypothesis test about a proportion, tail probabilities for a binomial distribution must be summed.

If the sample size n is large, there may be a huge number of probabilities to add together and this is both tedious and may result in numerical errors.

Home-based businesses owned by women

A recent study that was reported in the Wall Street Journal sampled 899 home-based businesses and found that 369 were owned by women.

Are home-based businesses less likely to be owned by females than by males? This question can be expressed as a hypothesis test. If the population proportion of home-based businesses owned by females is denoted by π, the hypotheses can be written as...

If the null hypothesis is true, the sample number owned by females will have a binomial distribution with parameters n = 899 and π = 0.5. The p-value for the test is therefore the sum of binomial probabilities,

A lot of probabilities must be evaluated and summed! And all are close to zero.

Normal approximation

We saw earlier that the normal distribution may be used as an approximation to the binomial when n is large. Both the sample proportion of successes, p, and the number of successes, x = np, are approximately normal when n is large.

approx distn of x & p

The best-fitting normal distribution can be used to obtain an approximation to any binomial tail probability. In particular, it can be used to find an approximate p-value for a hypothesis test.

Approximate p-value

A large random sample of size n is selected from a population with probability π of success and x successes are observed. We will again test the hypotheses

The normal approximation to the distribution of x can be used to find the tail probability,

normal approx to find p-val

Home-based businesses owned by women

In this example, the sample size, n = 899 is large, so we can use a normal approximation to obtain the probability of 369 or fewer businesses owned by females if the underlying population probability was 0.5 (the null hypothesis).

Click Accumulate then simulate sampling of 899 businesses about 300 times. (Hold down the button Simulate.) From the simulation, it is clear that the probability of obtaining 369 or fewer businesses owned by females is extremely small — there is strong evidence against the null hypothesis.


The same conclusion can be reached without a simulation.

Select Bar chart from the pop-up menu, then select Normal approximation. From the normal approximation, we can determine that the p-value for the test (the tail area below 369) is extremely close to zero.

Continuity correction (advanced)

The approximate p-value could be found by comparing the z-score for x,

normal approx to find p-val

with a standard normal distribution. Since x is discrete,

P(X ≤ 369)   =  P(X ≤ 369.5)   =   P(X ≤ 369.9)   =   ...

To find this tail probability, any value of x between 369 and 370 might have been used when evaluating the z-score. The p-value can be more accurately estimate by using 369.5. This is called a continuity correction.

The continuity correction involves either adding or subtracting 0.5 from the observed count, x, before finding the z-score.

Be careful about whether to add or subtract — the probability statement should be unchanged. For example, P(X ≥ 410) = P(≥ 409.5), so 0.5 should be subtracted from x = 410 as a continuity correction in order to find this probability using a normal approximation and z-score.

The continuity correction is most important when the observed count is near either 0 or n.

10.2.6   Statistical distance

Difference between parameter and estimate

Many hypothesis tests are about a single parameter of the model:

It is natural to base a test about such a parameter on the corresponding sample statistic:

If the value of the sample statistic is close to the hypothesised value of the parameter, there is no reason to doubt the null hypothesis. However if they are far apart, the data are not consistent with the null hypothesis and we should conclude that the alternative hypothesis holds.

A large distance between the estimate and hypothesized value is evidence against the null hypothesis.

Statistical distance

How do we tell what is a large distance between, say, p and a hypothesised value for the population proportion, π0? The empirical rule says that we expect p to be within two standard errors of π0 (about 95% of the time). If we measure the distance in standard errors, we know that 2 (standard errors) is a large distance, 3 is a very large distance, and 1 is not much.

The number of standard errors is

z-score for mean

In general, the statistical distance of an estimate to a hypothesised value of the underlying parameter is

z-score for mean

If this comes to more than 2, or less than -2, it suggests that the hypothesized value is wrong: the estimate is not consistent with the hypothesised parameter value. If, on the other hand, z is close to zero, the data are giving a result reasonably close to what we expected based on the hypothesis.

10.2.7   Tests based on statistical distance

Test statistic and p-value

The statistical distance of an estimate to a hypothesised value of the underlying parameter is

z-score for mean

This can be used at test statistic. If the null hypothesis holds, it approximately has a standard normal distribution — a normal distribution with zero mean and unit standard deviation.

The p-value for a test can be determined from the tail areas of this standard normal distribution.

z-score for mean

In the above diagram, the null hypothesis is consistent with with estimates close to the hypothesised value and the alternative hypothesis is suggested by estimates that are either much bigger or smaller than this value (called a two-tailed test). For a two-tailed test, the p-value is the red tail area and can be looked up using either normal tables or in Excel.

Refinements

If the standard error of the estimate must itself be estimated from the sample data, the above test statistic is only approximately normally distributed. In some tests that we will describe in later sections, the test statistic has a t distribution (which has slightly longer tails than the standard normal distribution). This refinement will be described fully in the next section.

Home-based businesses owned by women

The diagram below repeats the simulation that we used earlier to test whether the proportion of home-based businesses owned by women was less than 0.5:

The proportion owned by women in a sample of n = 899 businesses was 369/899 = 0.410.

Again click Accumulate and hold down the Simulate button until about 100 samples of 899 businesses have been generated with a population probability of being owned by women of 0.5.

Select Statistical distance from 0.5 from the top pop-up menu to translate the proportions of female owners in the simulated samples into z-scores. Observe that most of these 'statistical distances from 0.5' are between -1 and +1.

The observed proportion owned by females was 0.410, corresponding to a statistical distance of z = -5.37, an unlikely value if the population proportion was 0.5.

Select Normal distribution from the lower pop-up menu to show the theoretical distribution of the z-scores. The p-value for the test is the tail area of this normal(0, 1) distribution below -5.37 and is virtually zero, so we again conclude that:

It is almost certain that π is less than 0.5.


Relation to previous test

The p-value obtained in this way using a 'statistical distance' as the test statistic is identical to the p-value that was found from a normal approximation to the number of successes without a continuity correction. (The p-value is slightly different if a continuity correction is used.)

The use of 'statistical distances' does not add anything when testing a sample proportion, but it is a general method that will be used to obtain test statistics in many other situations later in this e-book.

10.3   Tests about means

  1. Introduction
  2. Test for mean (known σ)
  3. P-value from statistical distance
  1. The t distribution
  2. The t test for a mean

10.3.1   Introduction

Tests about numerical populations

The most important characteristic of a numerical population is usually its mean, µ. Hypothesis tests therefore usually question the value of this parameter.

Blood pressure of executives

The medical director of a large company looks at the medical records of 72 male executives aged between 35 and 44 and observes that their mean blood pressure is xBar = 126.07. We model these 72 blood pressures as a random sample from an underlying population with mean µ (blood pressures of similar executives) .

Published national health statistics report that in the general population for males aged 35-44, blood pressures have mean 128 and standard deviation 15. Do the executives conform to this population? Focusing on the mean of the blood pressure distribution, this can be expressed as the hypotheses,

H0:mu=20, HA:mu!=20

Filling milk containers

In a bottling plant, plastic containers are filled with a nominal 2 litres of milk. However the containers are filled so quickly that it is impossible to ensure that each contains exactly 2 litres. The volume of milk in a container is approximately normally distributed with standard deviation 0.005 litres, and the machinery is adjusted to give a mean volume of 2.012 litres. (Using the normal distribution, you can check that only 1% of containers should contain less than the nominal 2 litres of milk.)

At regular intervals, ten containers are sampled and the volume of milk in each is measured accurately to assess whether the machinery needs adjustment. (Overfilling wastes milk, but underfilling is illegal.) One sample is shown below.

Volume of milk (litres)
2.024
2.015
2.022
2.025
2.008
2.024
2.021
2.018
2.020
2.023
2.005
2.016

Are the data consistent with the target mean volume of 2.012 litres? This can be expressed as a hypothesis test comparing...

H0:mu=20, HA:mu!=20

Null and alternative hypotheses

Both of the above examples involve tests of hypotheses

H0 and HA

where µ0 is the constant that we think may be the true mean. These are called two-tailed tests. In other situations, the alternative hypothesis may involve only high (or low) values of µ (one-tailed tests), such as

H0 and HA

10.3.2   Test for mean (known σ)

Model and hypotheses

In both examples in the first page of this section, there was knowledge of the population standard deviation σ (at least when H0 was true). This greatly simplifies the problem of finding a p-value for the test.

Blood pressure of executives
From published information, the national distribution of blood pressure in males aged 35-44 is known to have a standard deviation σ = 15.
Filling milk containers
The variataion caused by the filling machinery is well understood and the standard deviation of the volume of milk is known to be σ = 0.005 litres.

In both examples, the hypotheses were of the form,

H0 and HA

Summary Statistic

The first step in finding a p-value for the test is to identify a summary statistic that throws light on whether H0 or HA is true. When testing the population mean, µ, the obvious summary statistic is the sample mean, xBar, and the hypothesis tests that will be described here are based on this.

We saw earlier that sample mean has a distribution with mean and standard deviation

mean(xBar) = mu

sd(xBar) = sigma/root(n)

Furthermore, the Central Limit Theorem states that the distribution of the sample mean is approximately normal, provided the sample size is not small. (The result holds even for small samples if the population distribution is also normal.)

P-value

The p-value for the test is the probability of getting a sample mean as 'extreme' as the one that was recorded when H0 is true. It can be found directly from the distribution of the sample mean.

Note that we can assume knowledge of both µ and σ in this calculation — the values of both are fixed by H0

Since we know the distribution of the sample mean (when H0 is true), the p-value can be evaluated as the tail area of this distribution.

One-tailed test
If the alternative hypothesis HA specifies large values of µ, the p-value is the upper tail area (shown in green below). If HA is for small values of µ, the opposite tail of the distribution is used.

sd(xBar) = sigma/root(n)

Two-tailed test
If the alternative hypothesis HA allows for large or small values of µ, the p-value is the sum of the two tail areas below.

sd(xBar) = sigma/root(n)

10.3.3   P-value from statistical distance

Statistical distance and test statistic

The p-value for testing a hypothesis about the mean, µ, when σ is known, is a tail area from the normal distribution of the sample mean and can be evaluated in the usual way using a z-score. This calculation can be expressed in terms of the statistical distance between the parameter and its estimate,

z-score for mean

In the context of a test about means,

z-score for mean

Since z has a standard normal distribution (zero mean and unit standard deviation) when the null hypothesis holds, it can be used as a test statistic.

P-value

The p-value for the test can be determined from the tail areas of the standard normal distribution.

z-score for mean

For a two-tailed test, the p-value is the red tail area.

Quality control for cornflake packets

The diagram below repeats the simulation that we used earlier to test whether a sample mean weight of 10 cornflake packets of 529 gm is consistent with a packing machine that is set to give normally distributed weights with µ = 520 gm and σ = 10 gm.

Again click Accumulate and hold down the Simulate button until about 100 samples of 10 packets have been selected and weighed. The p-value is the probability of getting a sample mean further from 520 gm than 529 gm — either below 511 gm or above 529 gm and the simulation provides an estimate. However a simulation is unnecessary since we can evaluate the p-value exactly.

Select Normal distribution from the pop-up menu on the bottom right to replace the simulation with the normal distribution of the mean,

z-score for mean

z-score for mean

From its tail area, we can calculate (without a simulation) that the probability of getting a sample mean as far from 520 as 529 is exactly 0.0044. This is the exact p-value for the test.

P-value from statistical distance

Finally, consider the statistical distance of our estimate of µ, 529 gm, from the hypothesised value, 520 gm.

z-score for mean

Select 'Statistical distance' from 520 from the middle pop-up menu to show how the p-value is found using this z-score.

Since the p-value is so small (0.0044), we conclude that there is strong evidence that the population mean, µ, is not 520.

Weights of courier packages

A courier company suspected that the weight of recently shipped packages had dropped. From past records, the mean weight of packages was 18.3 kg and their standard deviation was 7.1 kg. These figures were based on a very large number of packages and can be treated as exact.

Thirty packages were sampled from the previous week and their mean weight was found to be 16.8 kg. The data are displayed in the jittered dot plot below.

If the null hypothesis was true, the sample mean would have the normal distribution shown in pale blue. Although the sample mean weight is lower than 18.3 kg, it is not particularly unusual for this distribution, so we conclude that there is no evidence that the mean weight has reduced.

The right of the diagram shows how the p-value is calculated from a statistical distance (z-score).


Choose Modified Data from the pop-up menu. The slider allows you to investigate how low the sample mean must become in order to give strong evidence that µ is less than 18.3.

10.3.4   The t distribution

Unknown standard deviation

In the examples on the previous page, the population standard deviation, σ, was a known value. Unfortunately this is rarely the case in practice, so the previous test cannot be used.

Returns from Mutual Funds

Investing in the share market can be risky for small investors since the value of individual companies can fluctuate greatly, especially over short periods of time. These risks can be reduced by buying shares in a mutual fund that spreads the investment amoung a wide portfolio of companies.

Different mutual funds invest in companies of different types and with different inherent risks of losing and (hopefully) gaining value. Some funds have been categorised as 'high-risk' funds and a sample of 25 of these is shown in the table below. The percentage return paid by these funds over a 3-year period (April 1997 to March 2000) is also shown. (The stock market did particularly well over this period!)

The corresponding annualised return from Federal Constant Maturity Rate Bonds over this period was 5.64%. Did the high-risk funds do any better on average than this 'safe' investment?

High-risk mutual fund Annualised 3-year return
(1997-2000)
Alliance Quasar
Alliance Tech
Amer Cent Gl Gold
Berger Sm Co Gr
Blackrock Sm Cp Gr
CGM Cap Devel
Dreyfus Aggressive Growth
Evergreen Aggressive growth A
Federated Small cap Strat A
Fidelity emerging markets
Fidelity Selects Comp
Franklin Value A
Goldman Sachs small cap val A
Hotchkiss and Wiley Small Cap
JP Morgan Sm Co
J Hancock Small cap Growth B
Kemper Samall cap equity A
MFS Emerg Gr
Montgomery Small cap R
Oakmark Sm Cap
O'Shaughnessy Crn Gr
PBHG Emerging Growth
Putnam OTC Emerg Gr
State St. Res Emer Gr A
USAA Aggressive Gr
8.76%
58.71%
-22.82%
49.02%
43.97%
13.91%
-2.89%
39.64%
17.91%
-10.55%
68.58%
-0.33%
4%
0.14%
23.87%
38.23%
26.6%
36.02%
29.51%
1.62%
28.91%
29.32%
54.43%
30.76%
49.67%

The hypotheses of interest are similar to those in the initial pages of this section,

H0:mu=20, HA:mu!=20

However we no longer know the population standard deviation, σ. The only information we have about σ comes from our sample.

Test statistic and its distribution

When the population standard deviation, σ, was a known value, we used a test statistic

z=(xBar-mu)/(sigma/root(n))

which has a standard normal distribution when H0 was true.

When σ is unknown, we use a closely related test statistic,

t=(xBar-mu)/(s/root(n))

where s is the sample standard deviation. This test statistic has greater spread than the standard normal distribution, due to the extra variability that results from estimating s, especially when the sample size n is small.

The diagram below generates random samples from a normal distribution. Click Take sample a few times to see the variability in the samples.

Click Accumulate then take about 50 random samples. Observe that the stacked dot plot of the t statistic conforms reasonably with a standard normal distribution.

Now use the pop-up menu to reduce the sample size to 5 and take a further 50-100 samples. You will probably notice that there are more 'extreme' t-values (less than -3 or more than +3) than would be expected from a standard normal distribution.

Reduce the sample size to 3 and repeat. It should now be clearer that the distribution of the t-statistic has greater spread than a standard normal distribution. Click on the crosses for the most extreme t-values and observe that they correspond to samples in which the 3 data values happen to be close together, resulting in a small sample standard deviation, s.

The t distribution

We have seen that the t statistic does not have a standard normal distribution, but it does have another standard distribution called a t distribution with (n - 1) degrees of freedom. In the next page, we will use this distribution to obtain the p-value for hypothesis tests.

The diagram below shows the shape of the t distribution for various different values of the degrees of freedom.

Drag the slider to see how the shape of the t distribution depends on the degrees of freedom. Note that


A standard normal distribution can be used as an approximation to a t distribution if the degrees of freedom are large (say 30 or more) but the t distribution must be used for smaller degrees of freedom.


10.3.5   The t test for a mean

Finding a p-value from the t distribution

The p-value for any test is the probability of getting such an 'extreme' test statistic when H0 is true. When testing the value of a population mean, µ, when σ is unknown, the appropriate test statistic is

t=(xBar-mu)/(s/root(n))

Since this has a t distribution (with - 1 degrees of freedom) when H0 is true, the p-value is found from a tail area of this distribution. The relevant tail depends on the alternative hypothesis. For example, if the alternative hypothesis is for low values of µ, the p-value is the low tail area of the t distribution since low values of xBar (and hence t) would support HA over H0.

H0 and HA

The steps in performing the test are shown in the diagram below.

1-tailed t test

Computer software should be used to obtain the p-value from the t distribution.

Returns from Mutual Funds

The example on the previous page asked whether the average annualised return on high-risk mutual funds was higher than that from Federal Bonds (5.64%) over the period April 1997 to March 2000. The population standard deviation was unknown and the hypotheses of interest were,

H0:mu=20, HA:mu!=20

The diagram below shows the calculations for obtaining the p-value for this test from the t distribution with (n - 1) = 24 degrees of freedom.

Since the probability of obtaining such a high mean return from 25 funds is 0.000 (to 3 decimal places) if the underlying population mean is 5.64, we conclude that there is extremely strong evidence that the mean return on high-risk funds was over 5.64 percent.


Select Modified Data from the pop-up menu and use the slider to investigate the relationship between the sample mean and the p-value for the test.

Two-tailed test

In some hypothesis tests, the alternative hypothesis allows both low and high values of µ.

H0 and HA

In this type of two-tailed test, the p-value is the sum of the two tail areas, as illustrated below.

test stat & p-value for t test

10.4   Decisions and significance

  1. Hypothesis tests and decisions
  2. Decision rules
  1. Significance level and p-values
  2. Sample size and power

10.4.1   Hypothesis tests and decisions

Strength of evidence against H0

We have explained how p-values describe the strength of evidence against the null hypothesis.

Saturated fat content of cooking oil

It has been claimed that the saturated fat content of soybean cooking oil is no more than 15%. A clinician believes that the saturated fat content is greater than 15% and randomly samples 13 bottles of soybean cooking oil for testing.

Percentage saturated fat in soybean cooking oil
15.2
12.4
15.4
13.5
15.9
17.1
16.9
14.3
19.1
18.2
15.5
16.3
20.0

The clinician is interested in the following hypotheses.

H0:mu=20, HA:mu!=20

The p-value of 0.04 means that there is moderately strong evidence against H0 — i.e. moderately strong evidence that the mean saturated fat content is greater than 15%.

Decisions from tests

We now take a different (but related) approach to hypothesis testing.

Many hypothesis tests are followed by some action that depends on whether we conclude from the test results that H0 or HA is true. This decision depends on the data.

Decision    Action
accept H0    some action (often the status quo)   
reject H0    a different action (often a change to a process)   

However the decision that is made could be wrong. There are two ways in which an error might be made — wrongly rejecting H0 when it is true (called a Type I error), and wrongly accepting H0 when it is false (called a Type II error). These are represented by the red cells in the table below:

Decision
  accept H0     reject H0  
True state of nature H0 is true    correct Type I error
HA (H0 is false)     Type II error correct

A good decision rule about whether to accept or reject H0 (and perform the corresponding action) will have small probabilities for both kinds of error.

Saturated fat content of cooking oil

The clinician who tested the saturated fat content of soybean cooking oil was interested in the hypotheses.

H0:mu=20, HA:mu!=20

If H0 is rejected, the clinician intends to report the high saturated fat content to the media. The two possible errors that could be made are described below.

Decision
  accept H0  
(do nothing)
  reject H0  
(contact media)
Truth H0: µ is really 15% (or less)    correct wrongly accuses manufacturers
HA: µ is really over 15%     fails to detect high saturated fat correct

Ideally the decision should be made in a way that keeps both probabilities low.

10.4.2   Decision rules

Using a sample mean to make decisions

We now introduce the idea of decision rules with a test about whether a population mean is a particular value, µ0, or greater. We assume initially that the population is normally distributed and that its standard deviation, σ, is known.

H0 and HA

The decision about whether to accept or reject H0 should depend on the value of the sample mean, . Large values throw doubt on H0.

Data Decision
< k    accept H0
is k or higher    reject H0   

We want to choose the value k to make the probability of errors low. This is however complicated because of the two different types of error.

Decision
  accept H0     reject H0  
Truth H0 is true     
HA (H0 is false)      

Increasing the value of k to make the Type I error probability small (top right) also increases the Type II error probability (bottom left) so the choice of k for the decision rule is a trade-off between the acceptable sizes of the two types of error.

Illustration

The diagram below relates to a normal population whose standard deviation is known to be σ = 4. We will test the hypotheses

H0 and HA

The test is based on the sample mean of n = 16 values from this distribution. The sample mean has a normal distribution,

H0 and HA

This normal distributions can be used to calculate the probabilities of the two types of error. The diagram below illustrates how the probabilities of the two types of error depend on the critical value for the test, k.

Drag the slider at the top of the diagram to adjust k. Observe that making k large reduces the probability of a Type I error, but makes a Type II error more likely. It is impossible to simultaneously make both probabilities small with only n = 16 observations.


Note also that there is not a single value for the probability of a Type II error — the probability depends on how far above 10 the mean µ lies. Drag the slider on the row for the alternative hypothesis to observe that:

The probability of a Type II error is always high if µ is close to 10, but is lower if µ is far above 10.

This is as should be expected — the further above 10 the population mean, the more likely we are to detect that it is higher than 10 from the sample mean.

10.4.3   Significance level and p-values

Significance level

The decision rule affects the probabilities of Type I and Type II errors and there is always a trade-off between these two probabilities. Selecting a critical value to reduce one error probability will increase the other.

In practice, we usually concentrate on the probability of a Type I error. The decision rule is chosen to make the probability of a Type I error equal to a pre-chosen value, often 5% or 1%. This probability is called the significance level of the test and its choice should depend on the type of problem. The worse the consequence of incorrectly rejecting H0, the lower the significance level that should be used.

If the significance level of the test is set to 5% and we decide to reject H0 then we say that H0 is rejected at the 5% significance level.

Reducing the significance level of the test increases the probability of a Type II error.

The choice of significance level should depend on the type of problem.

The worse the consequence of incorrectly rejecting H0, the lower the significance level that should be used. In many applications the significance level is set at 5%.

Illustration

The diagram below is identical to the one on the previous page.

With the top slider, adjust k to make the probability of a Type I error as close as possible to 5%. This is the decision rule for a test with significance level 5%.

From the normal distribution, the appropriate value of k for a test with 5% significance level is 11.64.

Drag the top slider to reduce the significance level to 1% and note that the critical value for the test increases to about k = 12.3.

P-values and decisions

The critical value for a hypothesis test about a population mean (known standard deviation) with any significance level (e.g. 5% or 1%) can be obtained from the quantiles of normal distributions. For other hypothesis tests, it is possible to find similar critical values from quantiles of the relevant test statistic's distribution

For example, when testing the mean of a normal population when the population standard deviation is unknown, the test statistic is a t-value and its critical values are quantiles of a t distribution.

It would seem that different methodology is needed to find decision rules for different types of hypothesis test, but this is only partially true. Although some of the underlying theory depends on the type of test, the decision rule for any test can be based on its p-value. For example, for a test with significance level 5%, the decision rule is always:

Decision
p-value > 0.05     accept H0
p-value < 0.05     reject H0

For a test with significance level 1%, the null hypothesis, H0, should be rejected if the p-value is less than 0.01.

If computer software provides the p-value for a hypothesis test, it is therefore easy to translate it into accept (or reject) the null hypothesis at the 5% or 1% significance level.


Illustration

The following diagram again investigates decision rules for testing the hypotheses

H0 and HA

based on a sample of n = 16 values from a normal population with known standard deviation σ = 4.

In the diagram, the decision rule is based on the p-value for the test. Use the slider to adjust the critical p-value and observe that the significance level (probability of Type I error) is always equal to the p-value used in the decision rule. Adjust the critical p-value to 0.01.

Although the probability of a Type II error on the bottom row of the above table varies depending on the type of test, the top row in the diagram is the same for all kinds of hypothesis test.


10.4.4   Sample size and power

Power of a test

A decision rule about whether to accept or reject H0 can result in one of two types of error. The probabilities of making these errors describe the risks involved in the decision.

Prob(Type I error)
This is the significance level of the test. The decision rule is usually defined to make the significance level 5% or 1%.
Prob(Type II error)
When the alternative hypothesis includes a range of possible parameter values (e.g. µ ≠ 0), this probability is not a single value but depends on the parameter.

Instead of the probability of a Type II error, it is common to use the power of the test, defined as one minus the probability of a Type II error,

The power of a test is the probability of correctly rejecting H0 when it is false.

When the alternative hypothesis includes a range of possible parameter values (e.g. µ ≠ 0), the power depends on the actual parameter value.

Decision
  accept H0     reject H0  
Truth H0 is true      Significance level =
P (Type I error)
HA (H0 is false)     P (Type II error) Power =
1 - P (Type II error)

Increasing the power of a test

It is clearly desirable to use a test whose power is as close to 1.0 as possible. There are three different ways to increase the power.

Increase the significance level
If the critical value for the test is adjusted, increasing the probability of a Type I error decreases the probability of a Type II error and therefore increases the power.
Use a different decision rule
For example, in a test about the mean of a normal population, a decision rule based on the sample median has lower power than a decision rule based on the sample mean.

In CAST, we only describe the most powerful type of decision rule to test any hypotheses, so you will not be able to increase the power by changing the decision rule.

Increase the sample size
By increasing the amount of data on which we base our decision about whether to accept or reject H0, the probabilities of making errors can be reduced.

When the significance level is fixed, increasing the sample size is therefore usually the only way to improve the power.

Illustration

The following diagram again investigates decision rules for testing the hypotheses

H0 and HA

based on a samples from a normal population with known standard deviation σ = 4. We will fix the significance level of the test at 5%.

The top half of the diagram shows the normal distribution of the mean for a sample of size n = 16. Use the slider to increase the sample size and observe that:


10.5   Properties of p-values

  1. Null and alternative hypotheses
  2. Consistency with null hypothesis
  3. Distribution of p-values
  1. Interpretation of a p-value
  2. P-values for other tests

10.5.1   Null and alternative hypotheses

Symmetric hypotheses

In some situations there is a kind of symmetry between the two competing hypotheses. The sample data provide information about which of the two hypotheses is true.

Shareholder meeting vote

Two candidates, Mike Smith and Sarah Brown, stand for election as chairperson of the board of directors of a large company. Just before the shareholders' meeting at which the election will be held, 56 randomly selected shareholders are asked about their voting intentions. If the proportion intending to vote for Mike Smith is denoted by π, the hypotheses of interest are

H0:mu=20, HA:mu!=20

The diagram below illustrates how the poll results might weigh the evidence for each candidate winning.

Drag the slider to see how different sample numbers choosing Mike Smith affect the evidence. Unless either candidate receives (say) three quarters of the sample vote, we should admit that there is some doubt about who will win — the sample may not accurately reflect the population proportions.

Null and alternative hypotheses

In statistical hypothesis testing, the two hypotheses are not treated symmetrically in this way. We must distinguish in a much more fundamental way between them.

In statistical hypothesis testing, we do not ask which of the two competing hypotheses is true.

Instead, we ask whether the sample data are consistent with one particular hypothesis (the null hypothesis, denoted by H0). If the data are not consistent with the null hypothesis, then we can conclude that the competing hypothesis (the alternative hypothesis, denoted by HA) must be true.

hypotheses

This distinction between the hypotheses is important. Depending on the sample data, it may be possible to conclude that HA is true. However, regardless of the data, the strongest we can say supporting H0 is that the data are consistent with it.

We can never conclude that H0 is likely to be true.


Market share estimation through audits

The traditional retail store audit is a widely used marketing research tool among consumer packaged goods companies. The retail stort audit involves periodic audits of a sample of retail audits to monitor inventory and purchases of a particular product. Another auditing procedure, weekend selldown audits, has been proposed as a less expensive alternative.

The market shares of 10 brands of fruit juice were estimated using both of the store audit methods. Do the two methods result in the same estimates, on average? The data are paired, so we analyse the difference in estimates for each product (traditional minus weekend selldown) and test whether the underlying population mean of these values is zero.

H0:mu=0, HA:mu!=0

The diagram below illustrates the evidence obtained from a set of sample data.

Drag the slider to see the conclusions that might be reached for data sets with different means. The further the sample mean is from zero (on either side), the stronger the evidence that µ is not zero. We can get very strong evidence that H0 does not hold if the sample mean is far from zero.

However even xBar = 0 does not provide strong evidence that µ = 0

If xBar = 0, µ could just as easily be 0.0001 or -0.0002 (which correspond to HA). We cannot distinguish, so the best we can say is that the data are consistent with the null hypothesis — the data provide no information against the µ being zero.

In the context of this example, the conclusion from a sample mean of zero would be that the experiment gave no evidence of any difference between the mean estimates from the two auditing methods. The mean estimates might be different, but the data did not detect the effect.

The distinction between the null and alternative hypotheses is so important that we repeat it below.

We never try to 'prove' that H0 holds, though we may be able to 'prove' that HA holds.

10.5.2   Consistency with null hypothesis

Describing the credibility of the null hypothesis

In the previous page, a diagram with scales illustrated how the evidence against H0 was 'weighed' for different data sets. A p-value is a numerical description of this evidence that can give a scale to this diagram.

A p-value is a numerical summary statistic that describes the evidence against H0


Market share estimation through audits

In the example on the previous page, two different auditing methods were used to estimate market share. Is there any difference between the mean estimate of market share for the two methods?

The diagram below weighs the evidence using the p-value from a t-test of whether the mean difference, µ = 0.


The p-value is an index of credibility for the null hypothesis, µ = 0.


P-values have the same interpretation for all hypothesis tests.

10.5.3   Distribution of p-values

Interpretation of p-values

Many different types of hypothesis test are commonly used in advanced statistics, but all share common features.

A p-value is a statistic that is evaluated from a random sample, so it has a distribution in the same way that a sample mean has a distribution. This distribution also has features that are common to all hypothesis tests. Understanding the distribution of p-values is the key to understanding how they are interpreted.

Distribution of p-values

In any hypothesis test,

The diagram below shows typical distributions that might be obtained.

p-value distns

To illustrate these properties, we use a test for whether a population mean is zero.

H0:mu=0, HA:mu!=0

In the diagram below, you will take random samples from a normal population for which H0 is true and, separately, from populations for which HA is true.

When H0 holds

Initially the population mean is zero, so H0 holds. A single sample from this population is shown on the left and the p-value for testing whether the population mean is zero is shown as a cross on the jittered dot plot on the bottom right.

Click the button Take sample a few times to take other samples from this population and add their p-values to the display on the bottom right. After taking 50 or more samples, you should observe that the p-values are spread evenly between 0 and 1. This supports our assertion that the p-values have a rectangular distribution between 0 and 1 when H0 holds.

When HA holds

Now use the slider to change the true population mean to 2.0. We are still testing whether the mean is zero, so HA now holds. Take 40 or 50 samples and observe that the p-values are usually closer to 0 than to 1.

Click on some of the larger p-values on the jittered dot plot to display the samples that gave rise to them. The sample means vary and, by chance, some samples have means that are near 0.0, even when the population mean is 2.0; these samples result in larger p-values.

Repeat this exercise with different population means (try at least 1.0, 2.0, 3.0 and -2.0). The further the population mean from the value targetted by H0, 0.0, the more tightly clustered the p-values are around 0.0.

Although it is possible to obtain a low p-value when H0 holds and a high p-value when HA holds, low p-values are more likely under HA than under H0.

10.5.4   Interpretation of a p-value

P-values and probability

We saw in the last page that p-values have a rectangular distribution between 0 and 1 when H0 holds. A consequence of this is that the probability of obtaining a p-value of 0.1 or lower is exactly 0.1 (when H0 holds). This is illustrated on the left of the diagram below.

p-value distns

Similarly, the probability of obtaining a p-value of 0.01 or lower is exactly 0.01, etc. (when H0 holds).

P-values are most likely to be near 0 if the alternative hypothesis holds

Again, we use the specific hypothesis test for

H0:mu=0, HA:mu!=0

in order to demonstrate these general results.

Click the button Take sample 50 or more times to take samples from this population and add their p-values to the display on the right. From the diagram on the top right, we can read off the proportion of p-values that are less than any value. Approximately 50% of p-values are less than 0.5, 20% are less than 0.2, etc. when the null hypothesis is true.

Use the slider to change the true population mean to 1.5 and repeat. From the diagram on the top right, you should observe that more than 50% of p-values are less than 0.5, more than 20% are less than 0.2, etc. when the alternative hypothesis holds.

Interpretation of p-value

Remembering that low p-values favour HA more than H0, we can give the following interpretation to a p-value.

If a data set gives rise to a p-value of say 0.0023, we can state that the probability of getting a data set with such a low p-value is only 0.0023 if H0 is true. Since such a low p-value is so unlikely, the data give strong evidence that H0 does not hold.

Of course, we may be wrong. A p-value of 0.0023 could arise when either H0 or HA holds. However it is unlikely when H0 is true and more likely when HA is true.

Similarly, p-value that is as low as 0.4 occurs with probability 0.4 when the null hypothesis holds. Since this is fairly high, we conclude from a data set that gave rise to a p-value of 0.4 that there is no evidence that the null hypothesis does not hold.

Although it may be regarded as an over-simplification, the table below may be used as a guide to interpreting p-values.

p-value Interpretation
over 0.1 no evidence that the null hypothesis does not hold
between 0.05 and 0.1 very weak evidence that the null hypothesis does not hold
between 0.01 and 0.05 moderately strong evidence that the null hypothesis does not hold
under 0.01 strong evidence that the null hypothesis does not hold

10.5.5   P-values for other tests

Applying the general properties of p-values to different tests

The properties of p-values (and hence their interpretation) have been demonstrated in the context of a hypothesis test about whether a population mean was zero.

P-values for all hypothesis tests have the same properties. As a result, we can interpret any p-value if we know the null and alternative hypotheses that it tests, even if we do not know the formulae that underlies it. (In practice, a statistical computer program is generally used to perform hypothesis tests, so knowledge of formulae is of little importance.)

In particular, for any test where the null hypothesis restricts a parameter to a single value,


p-value Interpretation
over 0.1 no evidence that the null hypothesis does not hold
between 0.05 and 0.1 very weak evidence that the null hypothesis does not hold
between 0.01 and 0.05 moderately strong evidence that the null hypothesis does not hold
under 0.01 strong evidence that the null hypothesis does not hold

Another type of test

The normal distribution is often used as a hypothetical population from which a set of data are assumed to be sampled. But are the data consistent with an underlying normal population, or does the population distribution have a different shape?

One popular test for assessing whether a random sample come from a normal population is the Shapiro-Wilkes W test. The theory behind the test is advanced and the formula for the p-value cannot be readily evaluated by hand. However most statistical programs will perform the test.

A random sample of 40 values from a normal population is displayed in a jittered dot plot on the left of the diagram. The p-value for the Shapiro-Wilkes W test is shown under the dot plot and also graphically on the right.

Click Take sample a few times to take more samples and build the distribution of the p-values for the test. You should observe that the p-values have a rectangular distribution between 0 and 1 when the null hypothesis is true (i.e. if the samples are from a normal distribution).

Drag the slider on the top left of the diagram to change the shape of the population distribution. Repeat the exercise above and observe that when the null hypothesis does not hold, the p-values tend to be closer to 0.

Click on crosses on the display of p-values in the bottom right to display the sample that produced that p-value. P-values near zero usually correspond to samples that have very long tails to one or both sides, or have very short tails to one or both sides.

Returns from Mutual Funds

As a numerical example, the table below gives the annual returns for a sample of 137 mutual funds in 1999 (a period of rapid growth in the US economy).

41.9
90.6
29.9
10.2
33.7
26.9
88.5
6.5
16.6
19.2
12.6
32.0
3.6
8.1
68.1
57.9
-3.0
42.2
14.5
25.7
28.1
78.4
126.2
42.0
66.6
20.6
54.6
31.7
2.3
45.5
55.5
37.2
51.6
97.1
80.3
41.1
7.3
31.0
30.2
1.7
27.0
38.0
144.9
27.8
121.9
26.0
-11.5
15.5
16.9
27.3
23.9
61.1
68.2
10.0
37.8
77.1
24.3
63.2
-0.6
1.0
12.1
134.5
53.8
60.4
9.0
-6.4
31.0
-2.8
114.6
19.8
11.5
39.6
59.0
20.7
37.3
23.1
32.7
13.0
70.6
87.3
-3.2
-20.8
119.1
-0.1
104.4
-4.6
72.5
7.7
31.4
36.9
47.2
74.7
29.1
70.5
77.7
81.0
191.8
1.6
-0.8
59.4
-2.2
-12.5
81.6
44.0
63.6
114.3
33.6
83.0
70.8
50.1
55.8
28.3
-7.9
51.3
37.7
48.3
88.9
59.4
126.9
35.0
51.0
91.1
-2.7
79.2
0.1
12.9
16.2
23.0
22.4
64.4
10.2
7.6
27.7
8.0
23.5
25.3
22.5
 
 
 

A histogram of the data is shown below.

The best-fitting normal distribution (with mean and standard deviation equal to those of the data) has been superimposed on the histogram. There is a suggestion of skewness in the distribution of returns. Are the data really skew, or might this amount of skewness be possible in random samples from a normal distribution?

Applying the Shapiro-Wilkes W test to the data using the statistical program Minitab gives a p-value of "under 0.01". We conclude that there is strong evidence that the distribution is not normal. Even after deleting the 'outlier' — the First American Technology fund had a return of 191.8% — there is still strong evidence of skewness in the distribution of returns.

You should be able to interpret p-values that computer software provides for a wide variety of hypothesis tests using the properties that we have described in this section.