If you don't want to print now,

Chapter 10   Testing Hypotheses

10.1   Introduction to hypothesis tests

  1. Inference
  2. Soccer league simulation
  3. Simulation to test a proportion
  4. Test for a mean
  1. Randomisation tests
  2. Randomisation test for correlation
  3. Common patterns in tests

10.1.1   Inference

Statistical inference

The term statistical inference describes statistical techniques that obtain information about a population parameter (or parameters) based on a single random sample from that population. There are two different but related types of question about the population parameter (or parameters) that we might ask:

What parameter values would be consistent with the sample data?

This branch of inference is called estimation and its main tool is a confidence interval. We described confidence intervals in the previous chapter.

A manufacturer of muesli bars needs to describe the average fat content of the bars (the mean of the hypothetical population of fat contents that would be produced using the recipe). Several bars are analysed and their fat contents are measured.

The sample mean is a point estimate of the population mean, and a 95% confidence interval can also be found.

Are the sample data consistent with some statement about the parameters?

This branch of inference is called hypothesis testing and is the focus of this chapter.

A particular brand of meusli bar is claimed by the manufacturer to have a fat content of 3.4g per bar. A consumer group suspects that the manufacturer is understating the fat content, so a random sample of bars is analysed.

The consumer group must assess whether the data are consistent with the statement (hypothesis) that the underlying population mean is 3.4g.

Errors and strength of evidence

When we studied parameter estimation, we saw that a population parameter cannot be determined exactly from a single random sample — there is a 5% chance that a 95% confidence interval will not include the true population parameter.

In a similar way, a single random sample can rarely provide enough information about a population parameter to allow us to be sure whether or not any hypothesis about that parameter will be true. The best we can hope for is an indication of the strength of the evidence against the hypothesis.

The remainder of this chapter explains how this evidence is obtained and reported.

10.1.2   Soccer league simulation

Randomness in sports results

Although we like to think that the 'best' team wins in sports competitions, there is actually considerable variability in the results. Much of this variability can be considered to be random — if the same teams play again, the results are often different. The most obvious examples of this randomness occur when a series of matches is played between the same two teams.

Since the teams are virtually unchanged in any series, the variability in results can only be explained through randomness.

Randomness or skill?

When we look at sports results, can we tell whether all teams are equally matched with the same probability of winning? Or do some teams have a higher probability of winning than others?

There are different ways to examine this question, depending on the type of data that is available. The following example assesses an end-of-year league table.

English Premier Soccer League, 2008/09

In the English Premier Soccer league, each of the 20 teams plays every other team twice (home and away) during the season. Three points are awarded for a win and one point for a draw. The table below shows the wins, draws, losses and total points for all teams at the end of the 2008/09 season.

 
Team
Wins Draws Losses Points
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
17.
18.
19.
20.
Manchester_U
Liverpool
Chelsea
Arsenal
Everton
Aston_Villa
Fulham
Tottenham
West_Ham
Manchester_C
Wigan
Stoke_City
Bolton
Portsmouth
Blackburn
Sunderland
Hull_City
Newcastle
Middlesbrough
West_Brom_Albion
28
25
25
20
17
17
14
14
14
15
12
12
11
10
10
9
8
7
7
8
6
11
8
12
12
11
11
9
9
5
9
9
8
11
11
9
11
13
11
8
4
2
5
6
9
10
13
15
15
18
17
17
19
17
17
20
19
18
20
22
90
86
83
72
63
62
53
51
51
50
45
45
41
41
41
36
35
34
32
32

We observed in an earlier simulation that there is considerable variability in the points, even when all teams are evenly matched. However, ...

If some teams are more likely to win their matches than others, the spread of final points is likely to be greater — the top and bottom teams are likely to be more extreme.

A simulation

To assess whether there is any difference in skill levels, we can therefore run a simulation of the league, assuming evenly matched teams and generating random results with probabilities 0.372, 0.372 and 0.255 for wins, losses and draws. (A proportion 0.255 of games in the actual league resulted in draws.)

Click Simulate to simulate the 380 games in a season. The standard deviation of the final points is shown below the table. Click Accumulate then run the simulation about 100 times. (Hold down the Simulate button to speed up the process.)

The standard deviation of the points in the actual league table was 18.2. Since most simulated standard deviations are between 5 and 12, we conclude that such a high spread would be extremely unlikely if the teams were evenly matched.

There is strong evidence that the top teams are 'better' than the bottom teams.


10.1.3   Simulation to test a proportion

Other uses of simulation

Simulations can help us to answer questions about a variety of other models (or populations). The following example shows another simple simulation.

Is security at LA International Airport as good as elsewhere?

In 1987, the Federal Aviation Administration (FAA) investigated security at Los Angeles International Airport (LAX). In one test, it was found that only 72 out of 100 mock weapons that FAA inspectors tried to carry onto planes were detected by security guards (Gainesbille Sun, Dec 11, 1987).

Is the FAA justified in claiming that this "detection rate was well below the national rate of 0.80"?

A simulation

If the detection rate at LAX was the same as elsewhere, and every weapon independently has probability 0.80 of being detected, we know that the number detected out of 100 weapons will be a random quantity.

How unlikely is it to get as few as 72 out of 100 weapons detected if the probability of detection at LAX is 0.80 — the same as elsewhere?

A simulation helps to answer this question.

Click Simulate to randomly 'try to get 100 weapons onto planes', with each independently having probability 0.80 of detection. Click Accumulate then run the simulation between 100 and 200 times. (Hold down the Simulate button to speed up the process.)

Observe the distribution of the number of weapons detected. The proportion of simulations with 72 or fewer weapons being detected is shown to the right of the dot plot. Observe that this rarely happens.

We therefore conclude that the FAA's claim that LAX has a poorer detection rate than elsewhere is justified — only 72 weapons being detected would be unlikely if the detection rate was really 0.80.

We will return to this example later.

10.1.4   Test for a mean

Assessing a claim about a mean

In this example, we ask whether a sample mean is consistent with the underlying population mean having a target value.

Quality control for cornflake packets

In a factory producing packets of cornflakes, the weight of cornflakes that a filling machine places in each packet varies from packet to packet. From extensive previous monitoring of the operation of the machine, it is known that the net weight of '500 gm' packets is approximately normal with standard deviation σ = 10 gm.

The mean net weight of cornflakes in the packets is controlled by a single knob. The target is for a mean of µ = 520 gm to ensure that few packets will contain less than 500 gm. Samples are regularly taken to assess whether the machine needs to be adjusted. A sample of 10 packets was weighed and contained an average of 529 gm. Does this indicate that the underlying mean has drifted from µ = 520 and that the machine needs to be adjusted?

A simulation

If the filling machine is working to specifications, each packet would contain a weight that is sampled from a normal distribution with µ = 520 and σ = 10.

How unlikely is it to get the mean of a sample of size 10 that is as far from 520 as 529 if the machine is working correctly?

A simulation helps to answer this question.

Click Simulate to randomly generate the weights of 10 packets from a normal (µ = 520, σ = 10) distribution. Click Accumulate then run the simulation between 100 and 200 times. (Hold down the Simulate button to speed up the process.)

Observe that although many of the individual cornflake packets weigh more than 529 gm, it is rare for the mean weight to be as far from the target as 529 gm (i.e. either ≥529 gm or ≤511 gm).

There is therefore strong evidence that the machine is no longer filling packets with a mean weight of 520 gm and needs adjusting — a sample mean of 529 gm would be unlikely if the machine was filling packets to specifications.

We will return to this example later.

10.1.5   Randomisation tests

Simulation and randomisation

Simulation and randomisation are closely related techniques. Both are based on assumptions about the model underlying the data and involve randomly generated data sets.

Simulation
New data sets are generated directly from the model.
Randomisation
Modifications to the actual data are identified that would have the same probability of arising if the model held. New data sets are randomly picked from these.

Randomisation is understood most easily through an example.

Comparing two groups

If random samples are taken from two populations, we are often interested in whether the populations have the same means.

If the two populations were identical, any allocation of the sample values to the two groups would have been as likely as the observed sample data. By observing the distribution of the difference in means from such randomised allocations of values to groups, we can get an idea of whether the actual difference in sample means is unusually large.

An example helps to explain this method.

Characteristics of failed companies

A study in Greece compared characteristics of 68 healthy companies with those of another 33 that had recently failed. The jittered dot plots on the left below show the ratio of current assets to current liabilities for each of the 101 companies.

The mean asset-to-liabilities ratio for the sample of failed companies is 0.902 lower than that for the healthy companies, but the distributions overlap. Might this difference be simply a result of randomness, or can we conclude that there is a difference in the underlying populations?

Click Randomise to randomly pick 33 of the the 101 values for the failed group. If the underlying distribution of asset-to-liabilities ratios was the same for healthy and failed companies, each such randomised allocation would be as likely as the observed data.

Click Accumulate and repeat the randomisation several more times. Observe that the difference in means would rarely be as far from zero as -0.902 when we assume the same distribution for both groups. This strongly suggests that the distributions must be different.

Since the actual difference is so unusually large, ...

We can conclude that there is strong evidence that the mean asset-to-liability ratio is lower for failed companies than healthy ones.


10.1.6   Randomisation test for correlation

In this page, another example of randomisation is described to assess whether teams in a soccer league are evenly matched.

English Premier Soccer League, 2007/08 and 2008/09

We saw earlier that the distribution of points in the 2008/09 English Premier Soccer League Table was not consistent with all teams being evenly matched — the spread of points was too high. We will now investigate this further.

If some teams are better than others, the positions of teams in the league in successive years will tend to be similar. The table below shows the points for the teams in two seasons. (Note that the bottom three teams are relegated each year and three teams are promoted from the lower league, so we cannot compare the positions of six of the teams.)

  Points
Team
2007/08 2008/09
ManchesterU
Chelsea
Arsenal
Liverpool
Everton
AstonVilla
Blackburn
Portsmouth
ManchesterC
WestHam
Tottenham
Newcastle
Middlesbro
Wigan
Sunderland
Bolton
Fulham
Reading
Birmingham
DerbyCounty
StokeCity
HullCity
WestBromA
87
85
83
76
65
60
58
57
55
49
46
43
42
40
39
37
36
36
35
11
-
-
-
90
83
72
86
63
62
41
41
50
51
51
34
32
45
36
41
53
-
-
-
45
35
32

Manchester United, Chelsea, Arsenal and Liverpool were the top four teams in both years. However, ...

Excluding Manchester United, Chelsea, Arsenal and Liverpool, do there seem to be any differences in ability between the other teams?

Randomisation

If all other teams have equal probabilities of winning against any opponent, the 2008/09 points of 45 (which was actually obtained by Wigan) would have been equally likely to have been obtained by any of the teams in that year. Indeed, any allocation of the points (63, 62, 41, ..., 53) to the teams (Everton, Aston Villa, Blackburn, ..., Fulham) would be equally likely.

The diagram below performs this randomisation of the results in 2008/09.

Click Randomise to shuffle the 2008/09 points between the teams (excluding the top four teams and those that were only in the league for one of the seasons). If the teams were of equal ability, these points would have been as likely as the actual ones.

The correlation coefficient between the points in the two seasons gives an indication of how closely they are related. Click Accumulate and repeat the randomisation several more times. Observe that the correlation for the randomised values is only as far from zero as the actual correlation (r = 0.537) in about 5% of randomisations. Since a correlation as high as 0.537 is fairly unusual for equally-matched teams, ...

There is moderately strong evidence of a difference in skill between teams, even when the top four have been excluded.


10.1.7   Common patterns in tests

A general framework

The examples in earlier pages of this section involved different types of data and different analyses. Indeed, you may find it difficult to spot their common theme!

All analyses were examples of hypothesis testing. We now describe the general framework of hypothesis testing within which all of these examples fit. This general framework is the basis for important applications in later sections of CAST.

The concepts in this page are extremely important — make sure that you understand them well before moving on.

Data, model and question

Data (and model)
Each example dealt with a data set that was assumed to arise from some random mechanism. We may be able to specify some aspects of this random mechanism (model), but it also has unknown characteristics
Null hypothesis
All models had unknown characteristics, and we want to know whether the model has particular properties — the null hypothesis.
Alternative hypothesis
If the null hypothesis is not true, we say that the alternative hypothesis holds. (You can understand most of hypothesis testing without paying much attention to the alternative hypothesis however!)

Either the null hypothesis or the alternative hypothesis must be true.

Approach

We assess whether the null hypothesis is true by asking ...

Are the data consistent with the null hypothesis?

It is extremely important that you understand that hypothesis testing addresses this question — make sure that you remember it well!!

Answering the question

Test statistic
This is some function of the data that throws light on whether the null or alternative hypothesis holds.
P-value
Testing whether the data are consistent with the null hypothesis is based on the probability of obtaining a test statistic value as 'extreme' as the one recorded if the null hypothesis holds. This is called the p-value for the test.
Interpreting the p-value
Although it may be regarded as an over-simplification, the table below can be used as a guide to interpreting p-values.
p-value Interpretation
over 0.1 no evidence that the null hypothesis does not hold
between 0.05 and 0.1 very weak evidence that the null hypothesis does not hold
between 0.01 and 0.05 moderately strong evidence that the null hypothesis does not hold
under 0.01 strong evidence that the null hypothesis does not hold

Use the pop-up menu below to check how the earlier examples in this section fit into the hypothesis testing framework.

Soccer league in one season

Data (and model)
Some random mechanism underlies the actual results in the matches during a season. The probabilities of winning may vary from team to team and there may be a home-team advantage, so there are a lot of unknowns about this model! Our data are a single set of results — the league table at the end of the season.
Null hypothesis
The null hypothesis is that all teams are equally matched — i.e. that they all have the same probability of winning each match.
Alternative hypothesis
The alternative hypothesis is that all teams do not have the same probabilities of winning.
Test statistic
The standard deviation of final points is used. It will be low if the teams have the same abilities (null hypothesis) and higher otherwise (alternative hypothesis).
P-value
We simulated the soccer league, assuming that all teams had the same probability of winning. The p-value was the probability of getting a standard deviation of final points as high as 16.7 (the actual data).
Interpreting the p-value
The p-value was 0.000 (or close). Since there is virtually no chance of getting a standard deviation of points as high as that in the actual league from equally matched teams, we conclude that the teams are not equally matched — the null hypothesis is false.

10.2   Tests about proportions

  1. Inference about parameters
  2. P-value for testing proportion
  3. Another example
  4. One- and two-tailed tests
  1. Normal approximation
  2. Statistical distance
  3. Tests based on statistical distance

10.2.1   Inference about parameters

Inference and random samples

The examples in the previous section involved a range of different types of model for the observed data. In the remainder of this chapter, we concentrate on one particular type of model — random sampling from a population.

We assume now that the observed data are a random sample from some population.

When the observed data are a random sample, inference asks questions about characteristics of the underlying population distribution — unknown population parameters.

For random samples, the null and alternative hypotheses specify values for the unknown population parameters.

Inference about categorical populations

When the population distribution is categorical, the unknowns are the population probabilities for the different categories. To simplify, we consider populations for which one category is of particular interest ('success') and we denote the unknown probability of success by π.

The null and alternative hypotheses are therefore specified in terms of π.

Weapon detection at LAX

FAA agents tried to carry 100 weapons onto planes at LA International Airport. Of these, 72 were detected by security guards, and we are interested in whether this is consistent with the national probability of detection, 0.80.

We model detection of weapons as a random sample of 100 categorical values from a population with probability π of success (detection). The null hypothesis of interest is therefore...

H0:   π = 0.80

The alternative hypothesis is

HA:   π < 0.80

Telepathy experiment

An experiment is conducted to investigate whether one subject can telepathically pass shape information to another subject. A deck of cards containing equal numbers of cards with circles, squares and crosses is shuffled. One subject selects cards at random and attempts to 'send' the shape on the card to the other subject who is seated behind a screen; this second subject reports the shape imagined for the card. From 90 cards, the second subject correctly identifies 36.

This situation can be modelled as random sampling of 90 values (correct or wrong) from a categorical population in which the probability of correctly identifying the card is π. The null hypothesis of interest is therefore...

H0:   π = 1/3       (guessing)

The alternative hypothesis is

HA:   π > 1/3       (telepathy)

Tests about parameters of other populations

Other data sets arise as random samples from different kinds of population. For example, numerical data sets are often modelled as random samples from a normal distribution. Again, the hypotheses of interest are usually expressed in terms of the parameters of this distribution.

For example, to test whether the mean of a normal distribution is zero, the hypotheses would be...

H0:   µ = 0

HA:   µ ≠ 0

In the remainder of this section, we show how to test a population probability, and in the next section we will describe tests about a population mean.

10.2.2   P-value for testing proportion

Test statistic

When testing the value of a probability, π, the obvious statistic to use from our random sample is the corresponding sample proportion, p.

It is however more convenient to use the number of successes, x, rather than p since we know that X has a binomial distribution with parameters n (the sample size) and π.

When we know the distribution of the test statistic (at least after the null hypothesis has fixed the value of the parameters of interest), it becomes much easier to obtain the p-value for the test.

P-value

As in all other tests, the p-value is the probability of getting such an 'extreme' set of data if the null hypothesis is true. Depending on the null and alternative hypotheses, the p-value is therefore the probability that X is as big (or sometimes as small) as the recorded value.

Since we know the binomial distribution of X when the null hypothesis holds, the p-value can therefore be obtained by adding binomial probabilities.

The p-value is a sum of binomial probabilities

Note that the p-value can be obtained exactly without need for simulations or randomisation.

Weapon detection at LAX

FAA agents tried to carry 100 weapons onto planes at LA International Airport, and 72 of these were detected by security guards. Is this consistent with the national probability of detection, 0.80?

H0:   π = 0.80

HA:   π < 0.80

In the diagram below, click Accumulate then hold down Simulate until about 100 samples of 100 values have been generated. The proportion of these simulated samples in which 72 or fewer weapons are detected is an approximation to the p-value for the test.

Since we know that the number detected has a binomial (100, 0.80) distribution when the null hypothesis holds, the simulation is unnecessary. Select Binomial distribution from the pop-up menu. This binomial distribution is displayed, and the probability of 72 or fewer detected weapons is shown to be 0.0342 — the p-value for the test.

Since the p-value is so small, there would have been very little chance of the observed data arising if LAX had probability 0.80 of detection. We can therefore conclude that there is strong evidence that the probability of detection is lower than this. Note that this can be done without any simulations.

10.2.3   Another example

Another example

The following example shows again how the binomial distribution can be used to obtain the p-value for a test about a population probability.

Telepathy experiment

In the telepathy experiment that was described at the start of this section, one subject selects cards with a random shape (circle, square or cross) and attempts to 'send' this shape to another subject who is seated behind a screen; this second subject reports the shape imagined for the card.

Out of 90 cards, 36 are correctly guessed. Since more than a third are correct, does this provide strong evidence that information is being telepathically transmitted?

The null and alternative hypotheses are...

H0:   π = 1/3       (guessing)

HA:   π > 1/3       (telepathy)

The p-value is the probability of getting 36 or more cards correct when π = 1/3. This can be obtained directly from a binomial distribution with π = 1/3 and n = 90.

Use the slider below to obtain the p-value for this test.

The p-value for the test is 0.1103, meaning that there is a probability of 0.1103 of getting 36 or more correct cards if there is no telepathy. We therefore conclude that there is no evidence of telepathy from the data.

Interpretation of p-values

If the p-value for a test is very small, the data are 'inconsistent' with the null hypothesis. (The observed data may still be possible, but are at least extremely unlikely.)

From a very small p-value, we can conclude that the null hypothesis is probably wrong.

However a high p-value cannot allow us to conclude that the null hypothesis is correct — only that the observed data are consistent with it. For example, if exactly 30 cards (a third) were correctly picked in the telepathy example above, it would be wrong to conclude that there was no telepathy. The data are also consistent with other values of π near 1/3, so we cannot conclude that π is not 0.32 or 0.34.

A hypothesis test can never conclude that the null hypothesis is correct.

The correct interpretation of p-values for the telepathy test would be...

p-value Interpretation Conclusion
p >  0.1 x is not unusually high. It would be as high in more than 10% of samples if π = 1/3. There is no evidence against π = 1/3.
0.05 < p < 0.1 We would find x as high in only 5% to 10% of samples if π = 1/3. There is only slight evidence against π = 1/3.
0.01 < p < 0.5 We would find x this high in only 1% to 5% of samples if π = 1/3. There is moderately strong evidence against π = 1/3.
p < 0.01 We would find x this high in under 1% of samples if π = 1/3. There is strong evidence against π = 1/3.

10.2.4   One- and two-tailed tests

Finding the p-value for a one-tailed test

The LAX weapon-detection hypothesis test involved a random sample of size n from a population with probability π of success (detection of weapon). The data collected were x successes, and we tested the hypotheses...

where π0 was the constant of interest (e.g. 0.96 in this example). The following steps were followed to obtain the p-value for the test.

  1. The sample proportion of successes, p, was identified as the most informative summary statistic about π.
  2. The number of successes, x = np has a standard binomial distribution with no unknown parameters when H0 holds, so it is a better test statistic.
  3. The p-value is a sum of tail probabilities for this binomial distribution.

The diagram below illustrates these steps

binomial test stat & p-value

The telepathy example was similar, but the alternative hypothesis involved high values of π and the p-value was found by counting upper tail probabilities.

Finding the p-value for a two-tailed test

The appropriate tail probability to use depends on the alternative hypothesis. If the alternative hypothesis allows either high or low values of x, the test is called a two-tailed test,

The p-value is then double the smaller tail probability since values of x in both tails of the binomial distribution would provide evidence for HA.

Somali blood groups

In a study of sab bondsmen, a population sub-group in Northern Somalia, blood tests were conducted on a sample of 54, in order to investigate whether they differed genetically from the main population of 'free-born noble Somali'.

It is known that a proportion 0.574 of free-born noble Somali have blood group O. (Actually 574 had blood group O in a sample of 1000, but this sample size was large enough to provide a reasonably accurate estimate.) Is there any evidence that the sample proportion with blood group O in the sab bondsmen, 26 out of 54, does not come from a population with π  = 0.574? This can be expressed as the hypotheses

H0:pi=0.574, HA:pi!=0.574

We would expect (0.574 × 54) = 31 of the sab bondsmen to have blood group O. A sample count that is either much greater than 31 or much less than 31 would suggest a genetic difference between the sab bondsmen and the free-born noble Somali. Use the slider below to obtain the p-value.

The probability of getting as few as 26 is 0.1084. Since this is a 2-tailed test, we must also take account of the probability of getting a count that is as unusually high, so the p-value is twice this, 0.2169. Getting 26 sab bondsmen with blood group O is therefore not unlikely, so we conclude that there is no evidence from these data of a genetic difference between sab bondsmen and the free-born Somali.

10.2.5   Normal approximation

Computational problem

To find the p-value for a hypothesis test about a proportion, tail probabilities for a binomial distribution must be summed.

If the sample size n is large, there may be a huge number of probabilities to add together and this is both tedious and may result in numerical errors.

Home-based businesses owned by women

A recent study that was reported in the Wall Street Journal sampled 899 home-based businesses and found that 369 were owned by women.

Are home-based businesses less likely to be owned by females than by males? This question can be expressed as a hypothesis test. If the population proportion of home-based businesses owned by females is denoted by π, the hypotheses can be written as...

If the null hypothesis is true, the sample number owned by females will have a binomial distribution with parameters n = 899 and π = 0.5. The p-value for the test is therefore the sum of binomial probabilities,

A lot of probabilities must be evaluated and summed! And all are close to zero.

Normal approximation

We saw earlier that the normal distribution may be used as an approximation to the binomial when n is large. Both the sample proportion of successes, p, and the number of successes, x = np, are approximately normal when n is large.

approx distn of x & p

The best-fitting normal distribution can be used to obtain an approximation to any binomial tail probability. In particular, it can be used to find an approximate p-value for a hypothesis test.

Approximate p-value

A large random sample of size n is selected from a population with probability π of success and x successes are observed. We will again test the hypotheses

The normal approximation to the distribution of x can be used to find the tail probability,

normal approx to find p-val

Home-based businesses owned by women

In this example, the sample size, n = 899 is large, so we can use a normal approximation to obtain the probability of 369 or fewer businesses owned by females if the underlying population probability was 0.5 (the null hypothesis).

Click Accumulate then simulate sampling of 899 businesses about 300 times. (Hold down the button Simulate.) From the simulation, it is clear that the probability of obtaining 369 or fewer businesses owned by females is extremely small — there is strong evidence against the null hypothesis.


The same conclusion can be reached without a simulation.

Select Bar chart from the pop-up menu, then select Normal approximation. From the normal approximation, we can determine that the p-value for the test (the tail area below 369) is extremely close to zero.

Continuity correction (advanced)

The approximate p-value could be found by comparing the z-score for x,

normal approx to find p-val

with a standard normal distribution. Since x is discrete,

P(X ≤ 369)   =  P(X ≤ 369.5)   =   P(X ≤ 369.9)   =   ...

To find this tail probability, any value of x between 369 and 370 might have been used when evaluating the z-score. The p-value can be more accurately estimate by using 369.5. This is called a continuity correction.

The continuity correction involves either adding or subtracting 0.5 from the observed count, x, before finding the z-score.

Be careful about whether to add or subtract — the probability statement should be unchanged. For example, P(X ≥ 410) = P(≥ 409.5), so 0.5 should be subtracted from x = 410 as a continuity correction in order to find this probability using a normal approximation and z-score.

The continuity correction is most important when the observed count is near either 0 or n.

10.2.6   Statistical distance

Difference between parameter and estimate

Many hypothesis tests are about a single parameter of the model:

It is natural to base a test about such a parameter on the corresponding sample statistic:

If the value of the sample statistic is close to the hypothesised value of the parameter, there is no reason to doubt the null hypothesis. However if they are far apart, the data are not consistent with the null hypothesis and we should conclude that the alternative hypothesis holds.

A large distance between the estimate and hypothesized value is evidence against the null hypothesis.

Statistical distance

How do we tell what is a large distance between, say, p and a hypothesised value for the population proportion, π0? The empirical rule says that we expect p to be within two standard errors of π0 (about 95% of the time). If we measure the distance in standard errors, we know that 2 (standard errors) is a large distance, 3 is a very large distance, and 1 is not much.

The number of standard errors is

z-score for mean

In general, the statistical distance of an estimate to a hypothesised value of the underlying parameter is

z-score for mean

If this comes to more than 2, or less than -2, it suggests that the hypothesized value is wrong: the estimate is not consistent with the hypothesised parameter value. If, on the other hand, z is close to zero, the data are giving a result reasonably close to what we expected based on the hypothesis.

10.2.7   Tests based on statistical distance

Test statistic and p-value

The statistical distance of an estimate to a hypothesised value of the underlying parameter is

z-score for mean

This can be used at test statistic. If the null hypothesis holds, it approximately has a standard normal distribution — a normal distribution with zero mean and unit standard deviation.

The p-value for a test can be determined from the tail areas of this standard normal distribution.

z-score for mean

In the above diagram, the null hypothesis is consistent with with estimates close to the hypothesised value and the alternative hypothesis is suggested by estimates that are either much bigger or smaller than this value (called a two-tailed test). For a two-tailed test, the p-value is the red tail area and can be looked up using either normal tables or in Excel.

Refinements

If the standard error of the estimate must itself be estimated from the sample data, the above test statistic is only approximately normally distributed. In some tests that we will describe in later sections, the test statistic has a t distribution (which has slightly longer tails than the standard normal distribution). This refinement will be described fully in the next section.

Home-based businesses owned by women

The diagram below repeats the simulation that we used earlier to test whether the proportion of home-based businesses owned by women was less than 0.5:

The proportion owned by women in a sample of n = 899 businesses was 369/899 = 0.410.

Again click Accumulate and hold down the Simulate button until about 100 samples of 899 businesses have been generated with a population probability of being owned by women of 0.5.

Select Statistical distance from 0.5 from the top pop-up menu to translate the proportions of female owners in the simulated samples into z-scores. Observe that most of these 'statistical distances from 0.5' are between -1 and +1.

The observed proportion owned by females was 0.410, corresponding to a statistical distance of z = -5.37, an unlikely value if the population proportion was 0.5.

Select Normal distribution from the lower pop-up menu to show the theoretical distribution of the z-scores. The p-value for the test is the tail area of this normal(0, 1) distribution below -5.37 and is virtually zero, so we again conclude that:

It is almost certain that π is less than 0.5.


Relation to previous test

The p-value obtained in this way using a 'statistical distance' as the test statistic is identical to the p-value that was found from a normal approximation to the number of successes without a continuity correction. (The p-value is slightly different if a continuity correction is used.)

The use of 'statistical distances' does not add anything when testing a sample proportion, but it is a general method that will be used to obtain test statistics in many other situations later in this e-book.

10.3   Tests about means

  1. Introduction
  2. Test for mean (known σ)
  3. P-value from statistical distance
  1. The t distribution
  2. The t test for a mean

10.3.1   Introduction

Tests about numerical populations

The most important characteristic of a numerical population is usually its mean, µ. Hypothesis tests therefore usually question the value of this parameter.

Blood pressure of executives

The medical director of a large company looks at the medical records of 72 male executives aged between 35 and 44 and observes that their mean blood pressure is xBar = 126.07. We model these 72 blood pressures as a random sample from an underlying population with mean µ (blood pressures of similar executives) .

Published national health statistics report that in the general population for males aged 35-44, blood pressures have mean 128 and standard deviation 15. Do the executives conform to this population? Focusing on the mean of the blood pressure distribution, this can be expressed as the hypotheses,

H0:mu=20, HA:mu!=20

Active ingredient in medicine

Pharmaceutical companies routinely test their products to ensure that the amount of active ingredient is within tight limits. However the chemical analysis is not precise and repeated measurements of this same specimen differ slightly. One type of analysis has errors that are normally distributed with mean 0 and standard deviation 0.0068 grams per litre.

A product is tested three times with the following concentrations of the active ingredient:

0.8403, 0.8363 and 0.8447 grams per litre

Are the data consistent with the target concentration of 0.86 grams per litre? This can be expressed as a hypothesis test comparing...

H0:mu=20, HA:mu!=20

Null and alternative hypotheses

Both of the above examples involve tests of hypotheses

H0 and HA

where µ0 is the constant that we think may be the true mean. These are called two-tailed tests. In other situations, the alternative hypothesis may involve only high (or low) values of µ (one-tailed tests), such as

H0 and HA

10.3.2   Test for mean (known σ)

Model and hypotheses

In both examples in the first page of this section, there was knowledge of the population standard deviation σ (at least when H0 was true). This greatly simplifies the problem of finding a p-value for the test.

Blood pressure of executives
From published information, the national distribution of blood pressure in males aged 35-44 is known to have a standard deviation σ = 15.
Active ingredient in medicine
The testing procedure is widely used and the errors are known to have a distribution with σ = 0.0068 grams per litre.

In both examples, the hypotheses were of the form,

H0 and HA

Summary Statistic

The first step in finding a p-value for the test is to identify a summary statistic that throws light on whether H0 or HA is true. When testing the population mean, µ, the obvious summary statistic is the sample mean, xBar, and the hypothesis tests that will be described here are based on this.

We saw earlier that sample mean has a distribution with mean and standard deviation

mean(xBar) = mu

sd(xBar) = sigma/root(n)

Furthermore, the Central Limit Theorem states that the distribution of the sample mean is approximately normal, provided the sample size is not small. (The result holds even for small samples if the population distribution is also normal.)

P-value

The p-value for the test is the probability of getting a sample mean as 'extreme' as the one that was recorded when H0 is true. It can be found directly from the distribution of the sample mean.

Note that we can assume knowledge of both µ and σ in this calculation — the values of both are fixed by H0

Since we know the distribution of the sample mean (when H0 is true), the p-value can be evaluated as the tail area of this distribution.

One-tailed test
If the alternative hypothesis HA specifies large values of µ, the p-value is the upper tail area (shown in green below). If HA is for small values of µ, the opposite tail of the distribution is used.

sd(xBar) = sigma/root(n)

Two-tailed test
If the alternative hypothesis HA allows for large or small values of µ, the p-value is the sum of the two tail areas below.

sd(xBar) = sigma/root(n)

10.3.3   P-value from statistical distance

Statistical distance and test statistic

The p-value for testing a hypothesis about the mean, µ, when σ is known, is a tail area from the normal distribution of the sample mean and can be evaluated in the usual way using a z-score. This calculation can be expressed in terms of the statistical distance between the parameter and its estimate,

z-score for mean

In the context of a test about means,

z-score for mean

Since z has a standard normal distribution (zero mean and unit standard deviation) when the null hypothesis holds, it can be used as a test statistic.

P-value

The p-value for the test can be determined from the tail areas of the standard normal distribution.

z-score for mean

For a two-tailed test, the p-value is the red tail area.

Quality control for cornflake packets

The diagram below repeats the simulation that we used earlier to test whether a sample mean weight of 10 cornflake packets of 529 gm is consistent with a packing machine that is set to give normally distributed weights with µ = 520 gm and σ = 10 gm.

Again click Accumulate and hold down the Simulate button until about 100 samples of 10 packets have been selected and weighed. The p-value is the probability of getting a sample mean further from 520 gm than 529 gm — either below 511 gm or above 529 gm and the simulation provides an estimate. However a simulation is unnecessary since we can evaluate the p-value exactly.

Select Normal distribution from the pop-up menu on the bottom right to replace the simulation with the normal distribution of the mean,

z-score for mean

z-score for mean

From its tail area, we can calculate (without a simulation) that the probability of getting a sample mean as far from 520 as 529 is exactly 0.0044. This is the exact p-value for the test.

P-value from statistical distance

Finally, consider the statistical distance of our estimate of µ, 529 gm, from the hypothesised value, 520 gm.

z-score for mean

Select 'Statistical distance' from 520 from the middle pop-up menu to show how the p-value is found using this z-score.

Since the p-value is so small (0.0044), we conclude that there is strong evidence that the population mean, µ, is not 520.

Weights of courier packages

A courier company suspected that the weight of recently shipped packages had dropped. From past records, the mean weight of packages was 18.3 kg and their standard deviation was 7.1 kg. These figures were based on a very large number of packages and can be treated as exact.

Thirty packages were sampled from the previous week and their mean weight was found to be 16.8 kg. The data are displayed in the jittered dot plot below.

If the null hypothesis was true, the sample mean would have the normal distribution shown in pale blue. Although the sample mean weight is lower than 18.3 kg, it is not particularly unusual for this distribution, so we conclude that there is no evidence that the mean weight has reduced.

The right of the diagram shows how the p-value is calculated from a statistical distance (z-score).


Choose Modified Data from the pop-up menu. The slider allows you to investigate how low the sample mean must become in order to give strong evidence that µ is less than 18.3.

10.3.4   The t distribution

Unknown standard deviation

In the examples on the previous page, the population standard deviation, σ, was a known value. Unfortunately this is rarely the case in practice, so the previous test cannot be used.

Saturated fat content of cooking oil

Both cholesterol and saturated fats are often avoided by people who are trying to lose weight or reduce their blood cholesterol level. Cooking oil made from soybeans has little cholesterol and has been claimed to have only 15% saturated fat.

A clinician believes that the saturated fat content is greater than 15% and randomly samples 13 bottles of soybean cooking oil for testing.

Percentage saturated fat in soybean cooking oil
15.2
12.4
15.4
13.5
15.9
17.1
16.9
14.3
19.1
18.2
15.5
16.3
20.0

The hypotheses of interest are similar to those in the initial pages of this section,

H0:mu=20, HA:mu!=20

However we no longer know the population standard deviation, σ. The only information we have about σ comes from our sample.

Test statistic and its distribution

When the population standard deviation, σ, was a known value, we used a test statistic

z=(xBar-mu)/(sigma/root(n))

which has a standard normal distribution when H0 was true.

When σ is unknown, we use a closely related test statistic that is also a 'statistical distance' between the sample mean and µ0,

t=(xBar-mu)/(s/root(n))

where s is the sample standard deviation. This test statistic has greater spread than the standard normal distribution, due to the extra variability that results from estimating s, especially when the sample size n is small.

The diagram below generates random samples from a normal distribution. Click Take sample a few times to see the variability in the samples.

Click Accumulate then take about 50 random samples. Observe that the stacked dot plot of the t statistic conforms reasonably with a standard normal distribution.

Now use the pop-up menu to reduce the sample size to 5 and take a further 50-100 samples. You will probably notice that there are more 'extreme' t-values (less than -3 or more than +3) than would be expected from a standard normal distribution.

Reduce the sample size to 3 and repeat. It should now be clearer that the distribution of the t-statistic has greater spread than a standard normal distribution. Click on the crosses for the most extreme t-values and observe that they correspond to samples in which the 3 data values happen to be close together, resulting in a small sample standard deviation, s.

The t distribution

We have seen that the t statistic does not have a standard normal distribution, but it does have another standard distribution called a t distribution with (n - 1) degrees of freedom. In the next page, we will use this distribution to obtain the p-value for hypothesis tests.

The diagram below shows the shape of the t distribution for various different values of the degrees of freedom.

Drag the slider to see how the shape of the t distribution depends on the degrees of freedom. Note that


A standard normal distribution can be used as an approximation to a t distribution if the degrees of freedom are large (say 30 or more) but the t distribution must be used for smaller degrees of freedom.


10.3.5   The t test for a mean

Finding a p-value from the t distribution

The p-value for any test is the probability of getting such an 'extreme' test statistic when H0 is true. When testing the value of a population mean, µ, when σ is unknown, the appropriate test statistic is

t=(xBar-mu)/(s/root(n))

Since this has a t distribution (with - 1 degrees of freedom) when H0 is true, the p-value is found from a tail area of this distribution. The relevant tail depends on the alternative hypothesis. For example, if the alternative hypothesis is for low values of µ, the p-value is the low tail area of the t distribution since low values of xBar (and hence t) would support HA over H0.

H0 and HA

The steps in performing the test are shown in the diagram below.

1-tailed t test

Computer software should be used to obtain the p-value from the t distribution.

Saturated fat content of cooking oil

The example on the previous page asked whether the saturated fat content of soybean cooking oil was greater than 15%, based on data from 13 bottles. The population standard deviation was unknown and the hypotheses of interest were,

H0:mu=20, HA:mu!=20

The diagram below shows the calculations for obtaining the p-value for this test from the t distribution with (n - 1) = 12 degrees of freedom.

Since the probability of obtaining such a high sample mean if the underlying population mean was 15 (the p-value) is only 0.04, we conclude that there is moderately strong evidence that the mean saturated oil content is over 15 percent.


Select Modified Data from the pop-up menu and use the slider to investigate the relationship between the sample mean and the p-value for the test.

Two-tailed test

In some hypothesis tests, the alternative hypothesis allows both low and high values of µ.

H0 and HA

In this type of two-tailed test, the p-value is the sum of the two tail areas, as illustrated below.

test stat & p-value for t test

10.4   Decisions and significance

  1. Hypothesis tests and decisions
  2. Decision rules
  1. Significance level and p-values
  2. Sample size and power

10.4.1   Hypothesis tests and decisions

Strength of evidence against H0

We have explained how p-values describe the strength of evidence against the null hypothesis.

Saturated fat content of cooking oil

It has been claimed that the saturated fat content of soybean cooking oil is no more than 15%. A clinician believes that the saturated fat content is greater than 15% and randomly samples 13 bottles of soybean cooking oil for testing.

Percentage saturated fat in soybean cooking oil
15.2
12.4
15.4
13.5
15.9
17.1
16.9
14.3
19.1
18.2
15.5
16.3
20.0

The clinician is interested in the following hypotheses.

H0:mu=20, HA:mu!=20

The p-value of 0.04 means that there is moderately strong evidence against H0 — i.e. moderately strong evidence that the mean saturated fat content is greater than 15%.

Decisions from tests

We now take a different (but related) approach to hypothesis testing.

Many hypothesis tests are followed by some action that depends on whether we conclude from the test results that H0 or HA is true. This decision depends on the data.

Decision    Action
accept H0    some action (often the status quo)   
reject H0    a different action (often a change to a process)   

However the decision that is made could be wrong. There are two ways in which an error might be made — wrongly rejecting H0 when it is true (called a Type I error), and wrongly accepting H0 when it is false (called a Type II error). These are represented by the red cells in the table below:

Decision
  accept H0     reject H0  
True state of nature H0 is true    correct Type I error
HA (H0 is false)     Type II error correct

A good decision rule about whether to accept or reject H0 (and perform the corresponding action) will have small probabilities for both kinds of error.

Saturated fat content of cooking oil

The clinician who tested the saturated fat content of soybean cooking oil was interested in the hypotheses.

H0:mu=20, HA:mu!=20

If H0 is rejected, the clinician intends to report the high saturated fat content to the media. The two possible errors that could be made are described below.

Decision
  accept H0  
(do nothing)
  reject H0  
(contact media)
Truth H0: µ is really 15% (or less)    correct wrongly accuses manufacturers
HA: µ is really over 15%     fails to detect high saturated fat correct

Ideally the decision should be made in a way that keeps both probabilities low.

10.4.2   Decision rules

Using a sample mean to make decisions

We now introduce the idea of decision rules with a test about whether a population mean is a particular value, µ0, or greater. We assume initially that the population is normally distributed and that its standard deviation, σ, is known.

H0 and HA

The decision about whether to accept or reject H0 should depend on the value of the sample mean, . Large values throw doubt on H0.

Data Decision
< k    accept H0
is k or higher    reject H0   

We want to choose the value k to make the probability of errors low. This is however complicated because of the two different types of error.

Decision
  accept H0     reject H0  
Truth H0 is true     
HA (H0 is false)      

Increasing the value of k to make the Type I error probability small (top right) also increases the Type II error probability (bottom left) so the choice of k for the decision rule is a trade-off between the acceptable sizes of the two types of error.

Illustration

The diagram below relates to a normal population whose standard deviation is known to be σ = 4. We will test the hypotheses

H0 and HA

The test is based on the sample mean of n = 16 values from this distribution. The sample mean has a normal distribution,

H0 and HA

This normal distributions can be used to calculate the probabilities of the two types of error. The diagram below illustrates how the probabilities of the two types of error depend on the critical value for the test, k.

Drag the slider at the top of the diagram to adjust k. Observe that making k large reduces the probability of a Type I error, but makes a Type II error more likely. It is impossible to simultaneously make both probabilities small with only n = 16 observations.


Note also that there is not a single value for the probability of a Type II error — the probability depends on how far above 10 the mean µ lies. Drag the slider on the row for the alternative hypothesis to observe that:

The probability of a Type II error is always high if µ is close to 10, but is lower if µ is far above 10.

This is as should be expected — the further above 10 the population mean, the more likely we are to detect that it is higher than 10 from the sample mean.

10.4.3   Significance level and p-values

Significance level

The decision rule affects the probabilities of Type I and Type II errors and there is always a trade-off between these two probabilities. Selecting a critical value to reduce one error probability will increase the other.

In practice, we usually concentrate on the probability of a Type I error. The decision rule is chosen to make the probability of a Type I error equal to a pre-chosen value, often 5% or 1%. This probability is called the significance level of the test and its choice should depend on the type of problem. The worse the consequence of incorrectly rejecting H0, the lower the significance level that should be used.

If the significance level of the test is set to 5% and we decide to reject H0 then we say that H0 is rejected at the 5% significance level.

Reducing the significance level of the test increases the probability of a Type II error.

The choice of significance level should depend on the type of problem.

The worse the consequence of incorrectly rejecting H0, the lower the significance level that should be used. In many applications the significance level is set at 5%.

Illustration

The diagram below is identical to the one on the previous page.

With the top slider, adjust k to make the probability of a Type I error as close as possible to 5%. This is the decision rule for a test with significance level 5%.

From the normal distribution, the appropriate value of k for a test with 5% significance level is 11.64.

Drag the top slider to reduce the significance level to 1% and note that the critical value for the test increases to about k = 12.3.

P-values and decisions

The critical value for a hypothesis test about a population mean (known standard deviation) with any significance level (e.g. 5% or 1%) can be obtained from the quantiles of normal distributions. For other hypothesis tests, it is possible to find similar critical values from quantiles of the relevant test statistic's distribution

For example, when testing the mean of a normal population when the population standard deviation is unknown, the test statistic is a t-value and its critical values are quantiles of a t distribution.

It would seem that different methodology is needed to find decision rules for different types of hypothesis test, but this is only partially true. Although some of the underlying theory depends on the type of test, the decision rule for any test can be based on its p-value. For example, for a test with significance level 5%, the decision rule is always:

Decision
p-value > 0.05     accept H0
p-value < 0.05     reject H0

For a test with significance level 1%, the null hypothesis, H0, should be rejected if the p-value is less than 0.01.

If computer software provides the p-value for a hypothesis test, it is therefore easy to translate it into accept (or reject) the null hypothesis at the 5% or 1% significance level.


Illustration

The following diagram again investigates decision rules for testing the hypotheses

H0 and HA

based on a sample of n = 16 values from a normal population with known standard deviation σ = 4.

In the diagram, the decision rule is based on the p-value for the test. Use the slider to adjust the critical p-value and observe that the significance level (probability of Type I error) is always equal to the p-value used in the decision rule. Adjust the critical p-value to 0.01.

Although the probability of a Type II error on the bottom row of the above table varies depending on the type of test, the top row in the diagram is the same for all kinds of hypothesis test.


10.4.4   Sample size and power

Power of a test

A decision rule about whether to accept or reject H0 can result in one of two types of error. The probabilities of making these errors describe the risks involved in the decision.

Prob(Type I error)
This is the significance level of the test. The decision rule is usually defined to make the significance level 5% or 1%.
Prob(Type II error)
When the alternative hypothesis includes a range of possible parameter values (e.g. µ ≠ 0), this probability is not a single value but depends on the parameter.

Instead of the probability of a Type II error, it is common to use the power of the test, defined as one minus the probability of a Type II error,

The power of a test is the probability of correctly rejecting H0 when it is false.

When the alternative hypothesis includes a range of possible parameter values (e.g. µ ≠ 0), the power depends on the actual parameter value.

Decision
  accept H0     reject H0  
Truth H0 is true      Significance level =
P (Type I error)
HA (H0 is false)     P (Type II error) Power =
1 - P (Type II error)

Increasing the power of a test

It is clearly desirable to use a test whose power is as close to 1.0 as possible. There are three different ways to increase the power.

Increase the significance level
If the critical value for the test is adjusted, increasing the probability of a Type I error decreases the probability of a Type II error and therefore increases the power.
Use a different decision rule
For example, in a test about the mean of a normal population, a decision rule based on the sample median has lower power than a decision rule based on the sample mean.

In CAST, we only describe the most powerful type of decision rule to test any hypotheses, so you will not be able to increase the power by changing the decision rule.

Increase the sample size
By increasing the amount of data on which we base our decision about whether to accept or reject H0, the probabilities of making errors can be reduced.

When the significance level is fixed, increasing the sample size is therefore usually the only way to improve the power.

Illustration

The following diagram again investigates decision rules for testing the hypotheses

H0 and HA

based on a samples from a normal population with known standard deviation σ = 4. We will fix the significance level of the test at 5%.

The top half of the diagram shows the normal distribution of the mean for a sample of size n = 16. Use the slider to increase the sample size and observe that:


10.5   Properties of p-values

  1. Null and alternative hypotheses
  2. Consistency with null hypothesis
  3. Distribution of p-values
  1. Interpretation of a p-value
  2. P-values for other tests

10.5.1   Null and alternative hypotheses

Symmetric hypotheses

In some situations there is a kind of symmetry between the two competing hypotheses. The sample data provide information about which of the two hypotheses is true.

Election poll

Two candidates, Mike Smith and Sarah Brown, stand for election as president of a student council. Four days before the election, the student newspaper asks 56 randomly selected students about their voting intentions. If the proportion intending to vote for Mike Smith is denoted by π, the hypotheses of interest are

H0:mu=20, HA:mu!=20

The diagram below illustrates how the poll results might weigh the evidence for each candidate winning.

Drag the slider to see how different sample numbers choosing Mike Smith affect the evidence. Unless either candidate receives (say) three quarters of the sample vote, we should admit that there is some doubt about who will win — the sample may not accurately reflect the population proportions.

Null and alternative hypotheses

In statistical hypothesis testing, the two hypotheses are not treated symmetrically in this way. We must distinguish in a much more fundamental way between them.

In statistical hypothesis testing, we do not ask which of the two competing hypotheses is true.

Instead, we ask whether the sample data are consistent with one particular hypothesis (the null hypothesis, denoted by H0). If the data are not consistent with the null hypothesis, then we can conclude that the competing hypothesis (the alternative hypothesis, denoted by HA) must be true.

hypotheses

This distinction between the hypotheses is important. Depending on the sample data, it may be possible to conclude that HA is true. However, regardless of the data, the strongest we can say supporting H0 is that the data are consistent with it.

We can never conclude that H0 is likely to be true.


Memory test and exercise

Forty students in a psychology class are given a memory test. After a 30-minute session where the students undertake a variety of physical exercises, the students are given another similar memory test.

Has exercise has affected memory? The data are paired, so we analyse the difference in test results for each student ('after exercise' minus 'before exercise') and test whether the underlying population mean of these values is zero.

H0:mu=0, HA:mu!=0

The diagram below illustrates the evidence obtained from a set of sample data.

Drag the slider to see the conclusions that might be reached for data sets with different means. The further the sample mean is from zero (on either side), the stronger the evidence that µ is not zero. We can get very strong evidence that H0 does not hold if the sample mean is far from zero.

However even xBar = 0 does not provide strong evidence that µ = 0

If xBar = 0, µ could just as easily be 0.0001 or -0.0002 (which correspond to HA). We cannot distinguish, so the best we can say is that the data are consistent with the null hypothesis — the data provide no information against the µ being zero.

In the context of this example, the conclusion from a sample mean of zero would be that the experiment gave no evidence that exercise affected memory. Exercise might affect memory, but the experiment did not detect the effect.

The distinction between the null and alternative hypotheses is so important that we repeat it below.

We never try to 'prove' that H0 holds, though we may be able to 'prove' that HA holds.

10.5.2   Consistency with null hypothesis

Describing the credibility of the null hypothesis

In the previous page, a diagram with scales illustrated how the evidence against H0 was 'weighed' for different data sets. A p-value is a numerical description of this evidence that can give a scale to this diagram.

A p-value is a numerical summary statistic that describes the evidence against H0


Computer user-interface test

In an assessment of the user-interface of a computer program, sixteen users are shown a screen containing typical output for 10 seconds. Each user is then asked to indicate the position on the screen of a particular piece of information. The vertical distance between the indicated location and the actual location is recorded from each individual. (These 'errors' are negative if the user indicated too low a position.)

Do the users tend to pick the location of the item correctly, or is there a tendency to point too high or low? This question is equivalent to asking whether there is evidence that the underlying population mean of the 'errors' is different from zero.

The diagram below weighs the evidence using the p-value from a t-test of whether µ = 0.


The p-value is an index of credibility for the null hypothesis, µ = 0.


P-values have similar interpretation for all hypothesis tests.

10.5.3   Distribution of p-values

Interpretation of p-values

Many different types of hypothesis test are commonly used in advanced statistics, but all share common features.

A p-value is a statistic that is evaluated from a random sample, so it has a distribution in the same way that a sample mean has a distribution. This distribution also has features that are common to all hypothesis tests. Understanding the distribution of p-values is the key to understanding how they are interpreted.

Distribution of p-values

In any hypothesis test,

The diagram below shows typical distributions that might be obtained.

p-value distns

To illustrate these properties, we use a test for whether a population mean is zero.

H0:mu=0, HA:mu!=0

In the diagram below, you will take random samples from a normal population for which H0 is true and, separately, from populations for which HA is true.

When H0 holds

Initially the population mean is zero, so H0 holds. A single sample from this population is shown on the left and the p-value for testing whether the population mean is zero is shown as a cross on the jittered dot plot on the bottom right.

Click the button Take sample a few times to take other samples from this population and add their p-values to the display on the bottom right. After taking 50 or more samples, you should observe that the p-values are spread evenly between 0 and 1. This supports our assertion that the p-values have a rectangular distribution between 0 and 1 when H0 holds.

When HA holds

Now use the slider to change the true population mean to 2.0. We are still testing whether the mean is zero, so HA now holds. Take 40 or 50 samples and observe that the p-values are usually closer to 0 than to 1.

Click on some of the larger p-values on the jittered dot plot to display the samples that gave rise to them. The sample means vary and, by chance, some samples have means that are near 0.0, even when the population mean is 2.0; these samples result in larger p-values.

Repeat this exercise with different population means (try at least 1.0, 2.0, 3.0 and -2.0). The further the population mean from the value targetted by H0, 0.0, the more tightly clustered the p-values are around 0.0.

Although it is possible to obtain a low p-value when H0 holds and a high p-value when HA holds, low p-values are more likely under HA than under H0.

10.5.4   Interpretation of a p-value

P-values and probability

We saw in the last page that p-values have a rectangular distribution between 0 and 1 when H0 holds. A consequence of this is that the probability of obtaining a p-value of 0.1 or lower is exactly 0.1 (when H0 holds). This is illustrated on the left of the diagram below.

p-value distns

Similarly, the probability of obtaining a p-value of 0.01 or lower is exactly 0.01, etc. (when H0 holds).

P-values are most likely to be near 0 if the alternative hypothesis holds

Again, we use the specific hypothesis test for

H0:mu=0, HA:mu!=0

in order to demonstrate these general results.

Click the button Take sample 50 or more times to take samples from this population and add their p-values to the display on the right. From the diagram on the top right, we can read off the proportion of p-values that are less than any value. Approximately 50% of p-values are less than 0.5, 20% are less than 0.2, etc. when the null hypothesis is true.

Use the slider to change the true population mean to 1.5 and repeat. From the diagram on the top right, you should observe that more than 50% of p-values are less than 0.5, more than 20% are less than 0.2, etc. when the alternative hypothesis holds.

Interpretation of p-value

Remembering that low p-values favour HA more than H0, we can give the following interpretation to a p-value.

If a data set gives rise to a p-value of say 0.0023, we can state that the probability of getting a data set with such a low p-value is only 0.0023 if H0 is true. Since such a low p-value is so unlikely, the data give strong evidence that H0 does not hold.

Of course, we may be wrong. A p-value of 0.0023 could arise when either H0 or HA holds. However it is unlikely when H0 is true and more likely when HA is true.

Similarly, p-value that is as low as 0.4 occurs with probability 0.4 when the null hypothesis holds. Since this is fairly high, we conclude from a data set that gave rise to a p-value of 0.4 that there is no evidence that the null hypothesis does not hold.

Although it may be regarded as an over-simplification, the table below may be used as a guide to interpreting p-values.

p-value Interpretation
over 0.1 no evidence that the null hypothesis does not hold
between 0.05 and 0.1 very weak evidence that the null hypothesis does not hold
between 0.01 and 0.05 moderately strong evidence that the null hypothesis does not hold
under 0.01 strong evidence that the null hypothesis does not hold

10.5.5   P-values for other tests

Applying the general properties of p-values to different tests

The properties of p-values (and hence their interpretation) have been demonstrated in the context of a hypothesis test about whether a population mean was zero.

P-values for all hypothesis tests have the same properties. As a result, we can interpret any p-value if we know the null and alternative hypotheses that it tests, even if we do not know the formulae that underlies it. (In practice, a statistical computer program is generally used to perform hypothesis tests, so knowledge of formulae is of little importance.)

In particular, for any test where the null hypothesis restricts a parameter to a single value,


p-value Interpretation
over 0.1 no evidence that the null hypothesis does not hold
between 0.05 and 0.1 very weak evidence that the null hypothesis does not hold
between 0.01 and 0.05 moderately strong evidence that the null hypothesis does not hold
under 0.01 strong evidence that the null hypothesis does not hold

Another type of test

The normal distribution is often used as a hypothetical population from which a set of data are assumed to be sampled. But are the data consistent with an underlying normal population, or does the population distribution have a different shape?

One popular test for assessing whether a random sample come from a normal population is the Shapiro-Wilkes W test. The theory behind the test is advanced and the formula for the p-value cannot be readily evaluated by hand. However most statistical programs will perform the test.

A random sample of 40 values from a normal population is displayed in a jittered dot plot on the left of the diagram. The p-value for the Shapiro-Wilkes W test is shown under the dot plot and also graphically on the right.

Click Take sample a few times to take more samples and build the distribution of the p-values for the test. You should observe that the p-values have a rectangular distribution between 0 and 1 when the null hypothesis is true (i.e. if the samples are from a normal distribution).

Drag the slider on the top left of the diagram to change the shape of the population distribution. Repeat the exercise above and observe that when the null hypothesis does not hold, the p-values tend to be closer to 0.

Click on crosses on the display of p-values in the bottom right to display the sample that produced that p-value. P-values near zero usually correspond to samples that have very long tails to one or both sides, or have very short tails to one or both sides.

Measuring the speed of light

As a numerical example, consider the following experimental measurements made by a scientist, Simon Newcomb, in 1882 for the purpose of estimating the speed of light in air. The values were the times in nanoseconds (0.000000001 seconds) for light to travel 7442 metres. Since the measurements were all close to 24,800, they have been coded

Raw data (nanoseconds) Coded data
24,828 24,828 - 24,800 = 28
24,826 24,826 - 24,800 = 26
etc etc

The coded data and a histogram are shown below.

28  26  33  24  34 -44  27  16  40  -2  29  22
24  21  25  30  23  29  31  19  24  20  36  32
36  28  25  21  28  29  37  25  28  26  30  32
36  26  30  22  36  23  27  27  28  27  31  27
26  33  26  32  32  24  39  28  24  25  32  25
29  27  28  29  16  23
histo

The best-fitting normal distribution (with mean and standard deviation equal to those of the data) has been superimposed on the histogram. Could the two 'outliers' in the data have occurred by chance from a normal population?

Applying the Shapiro-Wilkes W test to the data using the statistical program JMP gives a p-value '0.0000'. Since JMP rounds p-values to four decimal places, this really means that the p-value is less than 0.00005. We therefore conclude that the probability of obtaining such a non-normal looking sample from a normal distribution is less than 0.00005, so there is extremely strong evidence that the data do not come from a normal population.

In contrast, if the two 'outliers' are omitted, JMP reports a p-value of 0.6167 for the test. Since a p-value as low as this would be found from 62% of samples from a normal population, there is no evidence that the data without the outliers are non-normal. The test therefore lends support to the assertion that the two outliers resulted from errors in Newcomb's experimental procedures.

You should be able to interpret p-values that computer software provides for a wide variety of hypothesis tests using the properties that we have described in this section.