8. Designed Experiments

In most data sets, we are interested in the relationship between the variables. This is most obviously the case when the data consist of two numerical variables (correlation and least squares give information about the relationship) or two categorical variables (a contingency table and conditional proportions help show the relationship).

Questions about differences between two or more groups can also be expressed in terms of relationships between variables if group membership is represented by a categorical variable. For example, if we have collected data about incomes from a sample of males and a sample of females, we would be interested in comparing these two groups — i.e. the relationship between income and gender.

In some situations, the relationship between two variables, such as the relationship evident in a scatterplot, may not describe a meaningful 'real' relationship.

8.1.2 Causal and non-causal relationships

As noted in the previous page, researchers are usually interested in relationships between variables. When two variables are related, we say that there is association between them.

For example, consider the height, X, and weight, Y, of a sample of school children. Tall children tend to be heavier, so high values of X are associated with high values of Y. The correlation coefficient describes the amount of linear association between two such numerical variables.

In some data sets, it is possible to conclude that one variable has a direct influence on the other. This is called a causal relationship.

If two variables are causally related, it is possible to conclude that changes to the explanatory variable, X, will have a direct impact on Y.

Not all relationships are causal. In non-causal relationships, the relationship that is evident between the two variables is not completely the result of one variable directly affecting the other. In the most extreme case, ...

The two diagrams below illustrate mechanisms that result in non-causal relationships between X and Y.

If two variables are not causally related, it is impossible to tell whether changes to one variable, X, will result in changes to the other variable, Y.

For example, the scatterplot below shows data from a sample of towns in a region.

The positive correlation between the number of churches and the number of deaths from cancer is an example of a non-causal relationship — the size of the towns is a lurking variable since larger towns have more churches and also more deaths. Clearly decreasing the number of churches in a town will not reduce the number of deaths from cancer!

8.1.3 Detecting causal relationships

Investigators usually hope to find causal relationships between the variables that are recorded. If one variable causally affects the other, then adjusting the value of that variable will cause the other to change. For example, if the milk yield of cows is causally affected by a dietary supplement, then yields can be increased by changing this supplement.

Non-causal relationships between two variables usually result from the effect of further variables called lurking variables that are related to the variables under investigation. Causal relationships can only be deduced if it can be reasoned that lurking variables are not present.

Smoking and attendance at college

The following contingency table comes from a survey of 400 males aged 19.

table

A smaller proportion of college students smoke than males who do not attend college. Also, a smaller proportion of smokers attend college than non-smokers.

There are three possible interpretations of this relationship between smoking and college attendance.

Smoking directly affects education. Perhaps nicotine makes it harder to learn? If so, stopping smoking would be likely to improve a student's chances of going to college.
Education directly affects smoking. This would be true if going to college (and mixing with other college students) discouraged students from smoking.
Lurking variables affect both smoking and attendance at college. For example, genetic and environmental characteristics of students may determine both whether they smoke and whether they attend college. Changing one variable would not affect the other.

The data cannot help to resolve the issue of causation so it would be incorrect to report any causal relationship from these data.

8.1.4 Observational and experimental data

Most data sets consist of one or more values that are recorded from each of a set of individuals (or plants, plots of land, repetitions of an experiment or other 'units'). These individuals will vary in many ways other than the variables that are recorded.

Data are collected in an observational study if we passively record (observe) values from each unit.

Most observational studies are conducted by sampling units from some population.

Heights of fathers and sons

The scatterplot below describes the heights (inches) of a randomly selected group of 60 men at age 18 and their fathers' heights at the same age.

Click Take sample a few times to see the heights of other fathers and sons. Observe that both variables vary from data set to data set since the data are observational.

In an experiment, the researcher actively changes some characteristics of the units before the data are collected. The values of some variables are therefore under the control of the experimenter. In other words, the experimenter is able to choose each individual's values for some variables.

Type of experiment	Possible controlled variables
Agriculture	Fertiliser applied to plot of land Irrigation applied to plot Time of planting seeds
Psychology	Time allowed to memorise text Type of stimulus in reaction-time test
Industrial	Temperature of chemical reaction Quality of raw materials for a process

Lathe experiment

A mechanical engineer is investigating the surface finish of metal parts produced on a lathe and its relationship to the speed (in RPM) of the lathe. The engineer measures the finish of 4 parts produced at each of 8 lathe speeds, and the resulting data are displayed below.

This is experimental data since the engineer can control the lathe speed (the explanatory variable).

Click Take sample a few times to repeat the experiment and observe that the distribution of lathe speed remains the same — only the response (surface finish) changes in repetitions of the experiment.

8.1.5 Data collection and causality

The method of data collection has a major influence on whether a relationship can be interpreted as causal.

The most important characteristic of experiments is that they often do allow relationships to be interpreted as causal ones.

In a badly designed experiment however, lurking variables can still cause difficulties in interpreting relationships.

The examples below illustrate differences in interpreting observational and experimental data.

Irrigation and wheat growth: an observational study

How does soil moisture affect the yield of wheat? What are the likely benefits of irrigation? In a sample of farms, rain gauges are installed; rainfall (which includes irrigation from sprinklers) and crop yield are subsequently measured.

The scatterplot does not suggest that farms with higher rainfall or irrigation tend to have greater yields of wheat. However the data are observational and may not therefore show causal relationships.

The researcher should not conclude that irrigation has no effect on wheat yields

The study was conducted over a large region in which the climate varied, so other characteristics of the sampled farms may also affect wheat yield. In particular, the mean summer temperature is a lurking variable that may affect the relationship — wetter farms also tend to be colder and the lower temperature may counter any benefits from higher rainfall.

Click the checkbox Show Temperature to discover which farms are hottest and coldest. Temperature is indeed a lurking variable here — there is a positive relationship between wheat yield and rainfall within each temperature group. If information about temperature was not available, the wrong conclusion about the likely effect of irrigation would be reached.

Irrigation and wheat growth: an experiment

In an experimental study, the researcher controls moisture in each farm using supplementary irrigation.

In the experiment whose results are displayed below, farms from wetter areas are not used (since it is not possible to use irrigation to reduce moisture, only to increase it). In each of 8 farms, 3 fields were used in the study, so that the fields in each group of 3 are as similar as possible. Irrigation is used to control moisture in each field, so that the three fields in each farm get the equivalent of 2, 2.5 and 3 mm of rainfall per day over the growing season.

Since differences between temperature, sunshine, and other variables that may affect wheat yields are no longer related to moisture, any differences between yields can be causally attributed to moisture.

The jittered dot plot below shows the results of the experiment.

Since there are similar proportions of warm farms in all three groups, a similar picture of the effect of moisture is obtained whether or not we take account of temperature. From the results shown above, we would conclude that increasing moisture by 1 mm per day would increase wheat yields by approximately 0.4 tonnes per hectare.

8.2 Principles of experimental design

8.2.1 Experiments and treatments

An experiment is usually conducted in order to determine how some response is affected by one or more types of potential influences.

Experiments are generally conducted on a set of experimental units. Depending on the type of experiment, these units may be...

The definition of the experimental units is therefore closely associated with decisions about the response measurement that will be taken. For example, if a farmer is interested in the milk yield of a herd of cows, it may be decided that monthly measurements will be made from each cow. Each combination of a cow and a month would be considered to be a separate experimental unit.

The researcher has control over some aspect of each unit — perhaps a numerical characteristic such as the temperature at which a chemical reaction is conducted or a categorical characteristic such as the variety of wheat that is planted.

These controlled characteristics are the explanatory variables and are called factors in the context of an experiment. The different values of the controlled characteristics are called experimental treatments.

8.2.2 Variable experimental units

Weight gain of calves

An experiment is to be conducted to assess whether a feed supplement improves the weight gain of calves over a 2-month period. Eighteen calves are available for use in the experiment. These calves vary in their age and weight at the start of the experiment, as shown in the scatterplot below.

We initially consider the weight gains of the calves if none of them are given the feed supplement. Even without being given a supplement, the weight gains of the calves will vary and some of this variability is likely to be related to the initial ages and weights.

The diagram is 3-dimensional, so move the mouse to the centre (marked by either a pink circle or the pointer changing to a hand) and drag towards the top left to rotate. (Or click the y-x-z rotation button.) The third dimension shows weight gains for the calves.

Click the y-x and the y-z rotation buttons and observe that weight gain is associated with both age and initial weight. (The calves that are older and heavier at the start of the experiment tend to gain more weight.)

Click Repeat experiment to run the experiment with a different group of 18 calves that have the same initial ages and weights.

Remember that we are not interested on the effect of age and initial weight on weight gain.

We want to use these calves to assess the effect of a feed supplement on weight gain so the effect of age and initial weight only serves to complicate the experiment.

8.2.3 A badly designed experiment

We noted earlier that the experimental units often have considerable variability. If the treatments are allocated to experimental units in a way that is associated with their characteristics, these varying characteristics can distort the apparent relationship between the treatments and the response.

In a badly designed experiment, the characteristics of the experimental units act in the same way as lurking variables in observational studies.

Since variability in the experimental units is usually unavoidable, we cannot prevent their effect on the response. However, in an experiment it is possible to allocate treatments to the experimental units in a way that either eliminates, or at least reduces, the relationship between the treatment, X, and characteristics of the experimental units.

Weight gain in calves — a badly designed experiment

Eighteen calves were used in an experiment to asses whether a feed supplement improves their weight gain over a 2-month period. The calves were driven into a barn and the first nine to enter were separated and given the supplement. We will conduct a simulation of this experiment in which the supplement increases weight gain by exactly 5.

The circles on the left of the diagram below represent the 18 calves with their initial weights represented by the colours of the circles.

Click Allocate treatments to simulate the selection of nine of the calves (the first nine to enter the barn) to be given the feed supplement. Now click Run experiment to simulate the weight gains of the calves over two months.

Repeat the experiment a few times and observe that most runs of the experiment estimate the effect of the supplement to be an increased weight gain of between 6 and 11.

Why does the experiment consistently over-estimate the effect of the supplement?

The problem lies in the method of choosing the calves to get the supplement. Larger calves tend to push ahead and enter the barn first, so the calves getting the supplement tend to be larger and we saw in the previous page that larger cows tend to gain more weight even without being given a supplement.

Weight gain with ineffective supplement

The reason for the misleading results is clearer in the following simulation in which the supplement has zero effect.

Repeat the simulation a few times and observe that the supplement is usually estimated to have a positive effect even though we know that it has no effect.

Observe that the calves receiving the supplement tend to have higher initial weight — their circles are bluer. The difference between the means of the two groups of calves is caused by the difference in their average weights, not the supplement being tested.

Good experimental design means ensuring that there are no major differences between the two groups of experimental units.

Later pages in this section describe some strategies to follow when designing experiments that avoid the problem of lurking variables.

8.2.4 Confounding

In a badly designed experiment, the characteristics of the experimental units can distort the apparent relationship between the controlled variable and response.

In the most extreme case, the design makes it impossible to disentangle the effects of the treatment and other characteristics of the experimental uits. If the treatment is perfectly correlated with another variable, the effects of the two variables cannot be distinguished. The treatment and variable are then said to be confounded.

It is particularly important that confounding is avoided when data are collected.

Experiment with a new variety of wheat

An agricultural researcher wants to compare the yield of a new variety of wheat with the standard variety. Information about the yields from the standard variety are available from 10 experimental plots that grew this variety in 2003. The new variety is grown in the same plots in 2004 and the yields are recorded.

Although its mean yield is higher, it should not be concluded that the new variety is better. Because the two varieties are planted different years, they will be grown with different rainfall and sunshine — variety is confounded with year (and hence its weather).

Perhaps the yield for the new variety is higher because the temperature and rainfall were higher in 2004.

The experiment was a waste of time and money — no conclusions can be reached about the new variety.

Trial of a new teaching method

An electronics lecturer writes a web-based tutorial resource to help teach digital logic. Students in a large class are told about the resource and allowed access through a login system that records usage. About half of the class use the tutorial.

To assess whether the tutorial helps students to learn the ideas of digital logic, the lecturer counts the number of correct answers from the 3 questions about digial logic in the final multiple-choice exam.

Students who used the resource got a higher average mark, but it is impossible to conclude that it was the tutorial resource that caused it.

Use of the resource is confounded with other characteristics of the students — only the more motivated students use it and they also study harder. It is impossible to distinguish between motivated students performing better and use of the resource improving marks.

The data do not allow you to reach any conclusions about whether the resource is effective.

8.2.5 Randomisation

The varying characteristics of the experimental units can only be lurking variables if they are associated with the allocation of treatments to the experimental units. To avoid this, ...

The method depends on whether these varying characteristics of the experimental units are understood and measured before the experiment is conducted.

When the differing characteristics of the experimental units are unmeasured, the best way to avoid association between them and the treatments is to randomly allocate treatments to the experimental units. This is called randomisation of the treatments and the experimental design is called a completely randomised design.

Randomisation does not guarantee that there will be no association between the treatments and characteristics of the experimental units — by chance, there may be some association. However...

There is no better way to allocate treatments if the varying characteristics of the experimental units are unmeasured before the experiment is conducted.

Effect of a feed supplement on weight gain of calves

Earlier in this section we described an experiment in which a feed supplement was given to the first nine calves from a group of 18 that entered a barn. This resulted in older and heavier calves getting the supplement, so the effect of the supplement was over-estimated.

Randomising the allocation of the supplement to 9 of the 18 calves reduces the chance of any association between the treatment and the age and weight of the calves. The simulation below increases the weight gain of calves getting the supplement by exactly 5.0.

Click Allocate treatments to randomly pick calves to get the supplement then click Run experiment to find their weight gains and estimate the effect of the supplement.

Repeat a few times and observe that the effect of the supplement is usually estimated to be between 2 and 8. The results are therefore consistent with the true effect that was built into the simulation, 5.0.

With randomisation, there is no tendency to over- or under-estimate the effect of the feed supplement.

There are several different ways to randomise allocation of treatments to the experimental units. These all use random numbers either from printed tables or, preferably, generated by a computer. The simplest method is to use a spreadsheet such as Microsoft Excel.

This creates a random permutation of the numbers 1 to n. If there are two treatments, allocate the first to the experimental units in the top half of the list.

The table below illustrates this method.

The first column shows twelve random numbers, each between 0 and 1, and the list of unit numbers. Click Sort random numbers to sort the rows of the table into ascending order of the random numbers. This randomly permutes the unit numbers.

Although the above method of randomly allocating treatments to the experimental units using a spreadsheet is recommended, two alternative methods are now described. The first illustrates the randomisation better, whereas the second provides an interesting application of probability.

Randomisation of teaching methods to 48 children

The diagram below shows 48 children, numbered 0 to 47. The children will be split into three groups of 16 that will be taught a topic by three different teaching methods (A, B and C).

Click Random index to select a random number between 0 and 49. If either 48 or 49 are chosen, click the button again since these numbers do not correspond to any of the 48 children. (We could have allowed the first digit to be 0-9 instead of 0-4, but over half of the resulting numbers would have been rejected for being 48 or higher. By restricting the first digit in the diagram, the simulation runs faster.)

The selected child is allocated teaching method A. Repeatedly click Random index to allocate more children to treatment A. Once 16 children have been allocated to the group getting treatment A, children are randomly allocated to treatment B. When 16 children have received treatment B, the remaining children get treatment C.

This method can be rather slow since many random indices are rejected towards the end of the method because treatments have already been allocated to these experimental units.

The final randomisation method works through all experimental units in order, picking a random treatment for each unit in turn. The only complication is that the probabilities used to generate the treatments must be adjusted after each unit gets a treatment. For example, if the first unit gets treatment A, then the probability of the second unit getting treatment A must be reduced a little. (If the same probabilities were used for each successive unit, then too many treatment A's might be allocated.)

Randomisation of teaching methods to 30 children

The diagram below illustrates this randomisation method for allocating equal numbers of a class of 30 children to 3 teaching methods. The first child has equal probabilities for all three teaching methods.

Click Generate Next to select a random value between 0 and 1, and hence a random treatment (i.e. teaching method) for the first child. (Generation of a random category in this way was described earlier.) If the first child gets treatment B, there will only be 10 A's, 9 B's and 10 C's left to allocate for the second child, so the probability of the second child getting treatment B should be reduced to ⁹/₂₉. Click Update Probs to adjust the probabilities, then click Generate Next again to allocate a treatment to the second child.

Repeatedly click Update Probs and Generate Next until all children have been allocated a treatment.

8.2.6 Replication

In a completely randomised experiment, there are two potential reasons why the response in an experiment varies between experimental units.

In a completely randomised experiment, all variation is caused by the treatments or is considered as random variation.

Experiments are conducted to determine the effect of different treatments. It is therefore essential that we can distinguish the treatment effects from random variation.

The easiest way to do this is with repeat measurements for each treatment — replication. Differences between these replicates are not due to treatment effects so they contain information about the amount of random variation.

Using knowledge about the amount of random variation in the experiment, we can better assess whether or not observed differences between the treatments are more than can be attributed to chance.

Crop experiment in a field

Researchers want to discover whether two varieties of wheat (A and B) have the same yield. One field is available for the experiment and it is know from previous experiments that the fertility and drainage of the soil is uniform over the whole field. The initial design for the experiment is shown below — the varieties were randomly allocated to either the left or right side of the field.

The diagram below shows the resulting yields from the experiment.

Since there was no replication, we cannot conclude that the observed difference in yields was due to the effects of the treatments.

Replication would mean growing each variety in two or more different areas. The simplest modification to the above experiment involves growing the wheat in the same way, but recording yields from smaller areas of the field.

We can now distinguish between the following two scenarios, each of which gives the same mean yields as before.

Since we can now assess the random plot-to-plot variation, we can also assess whether the difference in yields can be attributed to the varieties of wheat.

In the above example, we assumed that the fertility of the soil was uniform over the whole field. In practice, this assumption can rarely be made. A natural fertility gradient across the field (left to right) would confound the variety used with fertility, making it impossible to tell whether higher yields for variety B would be caused by the variety or better soil in the plots on the right.

The risk of confounding variety and fertility makes the above experimental design bad.

Good experimental design

The diagram below illustrates an experiment with random allocation of treatments to the 12 plots.

8.2.7 Blocking

When nothing is known about the differences between the experimental units before the experiment is conducted, we can do no better than to randomise allocation of treatments to the units. Randomisation avoids systematic over- or under-estimation of treatment effects. However...

The greater the variability between experimental units, the more the resulting variability within treatments (noise) tends to mask differences between the treatments (signal).

In practice, a researcher often has little influence on the choice of experimental units and they can be very variable. More accurate estimates are obtained if they can be grouped together before the experiment is conducted in groups of similar units called blocks. A separate experiment is conducted within each block with treatments randomly allocated to the experimental units in the block. Although all data are analysed together, the lower variability of experimental units with each block means that differences between the treatments can be more accurately estimated.

Since treatments are randomly allocated within blocks, this design is called a randomised block design.

The concept of a randomised block design is a general one that can be applied whatever the sizes of the blocks. In some situations, blocks consist of pairs of experimental units (e.g. twins). In others, the block sizes may be unequal (e.g. villages in a region). However the benefits are most evident in the following common special case.

Since there are equal replicates for all treatments in every block, if the experimental units within one block tend to have a higher mean response all treatments are affected equally. As a result, our assessment of differences between the treatments is not affected by differences between the blocks so the treatment effects are more accurately estimated.

Effect of irrigation on grass growth

A researcher wants to conduct an experiment to assess how irrigation affects grass growth. Three levels of irrigation will be used (none, a little or a lot) and they can only be applied to whole fields. As a result, the experimental units are fields and the available fields close to the research station differ in soil type and fertility.

We will initially examine a completely randomised design. The 36 pictures above represent the 36 available fields. Click Randomise treatments a few times to show how the 12 fields getting each level of irrigation are randomly selected from the 36 fields.

If the researcher has some prior knowledge of the each of these fields, they can be grouped together in blocks. Select Randomised block from the pop-up menu to group similar fields together into three groups.

Click Randomise treatments a few times to randomly allocate each of the treatments to 4 of the 12 fields within each block. Note that the randomisation is applied within each block.

Uptake of amino acids by fish

In an investigation of the effect of sodium cyanide (NaCN) on the uptake in vitro of a particular amino acid by intestinal preparations of a certain species of fish, it was found that each fish would give only about six preparations.

Since there could be sizeable differences between individual fish, the fish were treated as blocks (of size 6). The table below shows uptake of the amino acid (µmol per g dry weight in a 20 minute period) for four fish. The two treatments were randomly allocated to the six intestinal preparations from each fish in a randomised block design.

Block

Treatment

Fish 1

Fish 2

Fish 3

Fish 4

Mean

Without NaCN

1.54
1.92
2.26

1.52
2.02
1.91

1.00
1.12
1.13

1.58
1.78
1.52

1.608

With NaCN

1.10
1.42
1.04

1.31
1.15
1.51

0.79
0.84
0.86

1.24
0.81
1.32

1.116

Fish 3 has a mean uptake that is considerably lower than the other fish. However since each treatment is used with the same number of samples from this fish, its lower uptake levels equally affect the two blue treatment means on the left of the table. The use of a fish with such a low amino acid uptake therefore does not affect the relative values of the two blue means.

Differences between the fish (blocks) do not therefore affect the accuracy of comparisons between the treatments — this gives us more confidence that the use of NaCN decreases amino acid uptake by around 0.5.

If we know anything about differences between the experimental units before the experiment is conducted, it is always worthwhile to group them into blocks and conduct a randomised block design — the effects of the treatments will be estimated more accurately.

The benefits are greatest when there are large differences between the mean response in different blocks — if all blocks are essentially the same, there is nothing to be gained from using a randomised block design.

Antibiotic and weight gain of calves

Consider an experiment that will be conducted on 24 calves, 8 of which are in each of three herds. (In practice, herd sizes would be much larger but this small-scale example illustrates the experimental designs more clearly.) The aim of the experiment is to estimate the effect on the weight gain of the calves of injection with an antibiotic.

The top half of the diagram conducts a completely randomised experiment and the bottom half conducts a randomised block experiment. Both experiments use the same 24 calves and, in each case, half of the animals get the antibiotic and the other half are in a control group — the ticks on calves represent those getting the antibiotic. Click Repeat several times for each of the experiments.

Observe that the estimated effect of the antibiotic (the difference between the mean weight gain for the two treatments) is much more variable for the completely randomised experiment. The randomised block experiment therefore provides a more consistent (and accurate) estimate of the effect of the antibiotic.

The three herds (blocks) are initially very different — there is a strong block effect on the weight gain of the calves, perhaps due to different pastures or herd management. Select Herds are essentially the same from the pop-up menu, then repeat the experiments. Observe that there is now no advantage in using a randomised block experiment — it no longer estimates the effect of the antibiotic any more accurately than the completely randomised design.

8.3 Pairing and blocks

8.3.1 Aim of similar experimental units

In an experiment, we usually try to use experimental units that are as similar as possible. Variability between experimental units increases the variability in resulting response measurements and this 'noise' in the data makes it harder to detect differences cause by the experimental treatments.

For example, experiments involving mice often use strains that are bred to be genetically very similar so that differences in a response variable between mice will be mostly caused by differences between the experimental treatments and very little by differences between the mice themselves.

In a similar way, all plants in an experiment will be grown from identical seed under as similar growing conditions as possible. (Other than the factor being controlled.)

Feed supplement and beef from cows

Consider an experiment that is to be conducted to estimate how much a particular feed supplement increases the amount of beef that is obtained when cows are eventually killed. A herd of calves is available for the experiment and half will be given a standard diet with the other half getting the extra supplement. The response measurement is the weight of meat obtained from each cow when it is eventually slaughtered.

Before the experiment starts, the calves in the herd are not identical and we will exaggerate their differences for the sake of this illustration. The calves have different body weights and body weight has a strong influence on the weight of meat in the adult cows.

The diagram simulates a completely randomised experiment in which the feed supplement is given to half of the calves (selected at random). The horizontal axis shows the initial calf weights and the vertical axis shows the resulting weight of beef when the adult cows are slaughtered. The effect of the supplement is estimated by the difference in the mean weight of beef from the calves who were given the supplement and the control group receiving no supplement.

Click Accumulate then click Conduct experiment several times and observe that the estimate of the supplement's effect (on the right) is very variable — one run of the experiment does not give an accurate estimate.

The main reason for the inaccuracy in the estimates is the variability in the experimental units and, in particular, the variability in the initial weights of the calves. If the calves in the herd all had similar weights, the experiment would have resulted in a more accurate estimate of the effect of the supplement. Drag the slider to make the calf weights similar then repeat the simulated experiment several times. Observe that the estimate of the feed supplement's effect is now much more consistent.

8.3.2 Experiment with matched pairs

Identical experimental units result in the most accurate estimates of the effect of a factor. In practice however, we usually have little choice and the available experimental units are often very variable.

Sometimes we have no prior understanding of the variability of the experimental units. For example, we may know that the seeds in a batch of maize seeds are variable but cannot tell anything about the differences from looking at them. In this case, a completely randomised experiment is the best possible design.

However if we have some knowledge about differences between the experimental units before the start of the experiment, we can use this to design the experiment better and therefore get a more accurate estimate of the effect of the experimental factor.

For example, consider an experiment involving a single factor with two levels. If the experimental units can be grouped into pairs that are similar, a better experimental design randomly allocates the two factor levels to the two experimental units in each each pair. This is called a matched pairs design.

As before, half of the experimental units get each factor level and the estimate of the effect is the difference between the mean response at the two factor levels. This estimate is however more accurate than the corresponding estimate from a completely randomised experiment.

Feed supplement and beef from cows

The diagram below simulates experiments using the same herd of calves that were used in the previous page. We now assume that the initial variability in calf weights is unavoidable.

Initially click Accumulate then click Conduct experiment several times to see the variability in the estimate of the feed supplement's effect in a completely randomised experiment. Observe that the estimate is very inaccurate.

Now select Paired by calf weight from the pop-up menu. In this experimental design, the calves are grouped into matched pairs with similar weights before the experiment is started, illustrated by the vertical bands on the scatterplot. In each of these matched pairs of calves, exactly one is randomly chosen to get the feed supplement.

Repeat this experiment several times and observe that estimate of the effect of the feed supplement is much more accurate than with the earlier completely randomised experimental design.

In the above example, the experimental units were grouped into pairs using a numerical measurement — ages of animals or the previous year's yield from fruit trees could be used in a similar way. However pairing is often done in a less formal way — it is acceptable to construct the matched pairs of experimental units by any subjective or objective method.

8.3.3 Randomisation in paired experiments

If the factor levels are allocated in a subjective way within each pair, it is possible for the treatment allocation to be associated with some lurking variable that will bias the results.

Illustration of pairing

Consider an experiment that is conducted to assess whether a new exercise programme for broken legs aids recovery over the standard method. A group of 20 children with broken legs is used (the experimental units) and the response measurement will be the strength of the leg muscles three weeks after the break.

In a completely randomised experiment, 10 children are randomly picked from the 20 for the new exercise programme. Click Randomise to show this.

In a paired experiment, the 20 children might be grouped into pairs of the same gender who had similar weight and muscular strength at the start of the experiment. Select Paired from the pop-up menu. One child from each pair will be randomly selected for the new exercise programme. Click Randomise to see this randomisation.

8.3.4 Matched groups of 3 or more units

The idea of using matched pairs of experimental units to give more accurate comparisons of two factor levels can be generalised to experiments with 3 or more factor levels.

The experimental units can be grouped into collections of similar units whose size equals the number of factor levels. The different factor levels are randomly allocated to the units within each such matched group.

Again, this type of experiment results in more accurate estimates of the differences between the factor levels than a completely randomised experiment.

Three diets and beef from cows

The diagram below simulates experiments using a herd of 21 calves of varying weights. An experiment is to be conducted to compare a standard diet and three new diets (3 different factor levels).

Initially click Accumulate then click Conduct experiment several times to see the variability in the three estimates of the differences between the three diets in a completely randomised experiment. Observe that the estimates are all very inaccurate.

Now select Grouped by calf weight from the pop-up menu. In this experimental design, the calves are grouped into matched groups of three with similar weights before the experiment is started, illustrated by the vertical bands on the scatterplot. In each of these matched groups of calves, exactly one is randomly chosen to get each of the diets.

Repeat this experiment several times and observe that estimates of the differences between the diets are much more accurate than with the earlier completely randomised experimental design.

8.3.5 Randomised block experiments

Experiments with matched pairs or matched groups are special kinds of randomised block experiments. In many of these experiments, the grouping of experimental units is fairly subjective, perhaps based on similar weights, ages, etc.

In other situations, the experimental units naturally separate into groups. For example,

If the number of experimental units in each block is a multiple of the number of factor levels in the experiment, it is possible to randomly allocate each factor level the same number of times within each block. The resulting experiment is called a randomised block design.

Acupuncture and Codeine for dental pain relief

An anaesthetist conducted an experiment to assess the effects of codeine and acupuncture for relieving dental pain. The experiment used 32 subjects who were grouped into blocks of 4 according to an initial assessment of their tolerance to pain.

The four treatment combinations of (codeine or a sugar capsule) and (active or inactive acupuncture points) were randomly given to the four subjects in each block. Pain relief scores were recorded from each subject two hours after dental treatment. The experiment was double blind since neither the subjects nor the person assessing pain relief knew which treatment had been adminstered.

Pain relief score

Tolerance
group

Control

Codeine
only

Acupuncture
only

Codeine +
Acupuncture

1
2
3
4
5
6
7
8

0.0
0.3
0.4
0.4
0.6
0.9
1.0
1.2

0.6
0.7
0.8
0.9
1.5
1.6
1.7
1.6

0.5
0.6
0.8
0.7
1.0
1.4
1.8
1.7

1.2
1.3
1.6
1.5
1.9
2.3
2.1
2.4

In the above experiment the tolerance groups are matched blocks of subjects. In the following examples, the blocks occur naturally.

Measuring sulphur in soils

Measurements were made on the amount of sulphur (in parts per million) in soil samples using four different solvents. The soil samples were collected from five different geographical locations in Florida, USA, and represented different soil types. The soil samples are the blocks in this experiment:

Troup, Jackson Co. (Paleudults soil)
Lakeland,Walton Co. (Quartzipsamments soil)
Leon, Duval Co. (Haplaquads soil)
Chipley, Jackson Co. (Quartzipsamments soil)
Norfolk, Alachua Co. (Paleudults soil)

The solvents used to analyse the each of the five soils were:

Calcium Chloride, CaCl₂
Ammonium Acetate, NH₄OAc
Mono-Calcium Phosphate, Ca(H₂PO₄)₃
Water, H₂O

Each soil sample was split into four and the solvents were randomly allocated to them.

Soil sample

Solvent

Troup

Lakeland

Leon

Chipley

Norfolk

CaCl₂NH₄OAc
Ca(H₂PO₄)₃
H₂O

5.07
4.43
7.09
4.48

3.31
2.74
2.32
2.35

2.54
2.09
1.09
2.70

2.34
2.07
4.38
3.85

4.71
5.29
5.70
4.98

Effect of cultivar on plant height

An experiment was conducted to compare the differences in growth among four different cultivars of a house plant. The greenhouse had three benches in different locations which form natural blocks. Two pots of each cultivar were randomly assigned to each bench for a total of six pots per cultivar and eight pots per bench.

In this experiment, the benches are blocks and there are two replicates — each treatment (cultivar) was used twice on each bench.

Cultivar

Bench

19.3
17.2

20.1
19.4

17.4
16.6

16.6
15.7

16.7
15.5

21.2
20.8

14.4
13.6

13.5
12.9

17.7
19.8

21.0
21.9

15.8
17.4

12.8
14.7

8.4 Two or more factors

8.4.1 Experiments with 2 factors

Sometimes the response of interest is influenced by two or more factors, each of which can be controlled in an experiment.

The simplest way to study two factors is with two separate completely randomised experiments. In each of these experiments, one factor is kept constant (e.g. the colour of an artificial flower) and the other factor is varied (e.g. the flower's shape). However...

Consider an experiment to compare the three levels of a single factor with 6 replicates — each factor level is applied to a randomly selected 6 of the 18 experimental units. The table below illustrates the type of data that would arise from this experiment.

Factor X
X = A	X = B	X = C
x_A1 x_A2 x_A3 x_A4 x_A5 x_A6	x_B1 x_B2 x_B3 x_B4 x_B5 x_B6	x_C1 x_C2 x_C3 x_C4 x_C5 x_C6

Now consider a modification to this experiment that also varies a second factor, Y. The table below describes an experiment with 3 replicates for each combination of the levels of factors X and Y. This experiment uses the same number of experimental units as the earlier experiment.

Although it is not intuitively obvious, the effect of changing the levels of factor X is estimated equally accurately in both experiments.

In the factorial experiment however, we can also estimate the effect of changing factor Y, so the factorial design provides a 'free' estimate of the effect of Y.

It might initially seem that each factor will be estimated less accurately because the other factor is also varied, but this is not so.

	Factor X
Factor Y	X = A	X = B	X = C
Y = S	x_AS1 x_AS2 x_AS3	x_BS1 x_BS2 x_BS3	x_CS1 x_CS2 x_CS3
Y = T	x_AT1 x_AT2 x_AT3	x_BT1 x_BT2 x_BT3	x_CT1 x_CT2 x_CT3

Blood pressure after an operation

The diagram below simulates a completely randomised experiment in which two surgical procedures (operations by keyhole surgery and a standard surgical method) are compared. Initially, all patients are given the same dose of a drug that is intended to reduce their blood pressure after the operation. The response variable is the systolic blood pressure of the patients two hours later.

Click Repeat experiment to randomise the patients, perform the surgery and record blood pressures. Click Accumulate then repeat the experiment several times to see the variability in the estimated difference between blood pressures using keyhole and standard operations.

Initially all patients receive the same drug dose, so the experiment is completely randomised with only a single factor (the type of operation). Drag the slider to vary the amount of drug, effectively turning the experiment into a factorial design with two factors — operation type and amount of drug.

Reducing the amount of drug for 3 patients getting each operation type increases their blood pressure and increasing the drug dose for 3 others decreases their blood pressure. However since this happens for the same number of patients getting each operation type,

The difference between the mean blood pressures for the two operation types remains the same.

With the Accumulate checkbox still selected, repeat the experiment several more times. (Unchecking the Animation checkbox speeds up the simulations.) Observe that the variability (accuracy) of the estimated difference between the operation types is the same as for the experiment using the same amount of drug for everyone.

The experiment can also vary the amount of drug and estimate its effect without affecting the accuracy of estimating the difference between the two operation types.

In a similar way, the effect of the drug on blood pressure would be estimated equally accurately whether or not two operation types were used.

8.4.2 Factorial design

An efficient design for experiments with two or more factors uses each possible combination of factor levels (called treatments) in the same number of experimental units. The repeat measurements for each treatment are called replicates and the design is called a factorial design.

The design on the previous page was an example of a factorial design for two factors but similar designs are also used for three or more factors.

Strength of asphaltic concrete

An experiment was conducted by a civil engineer to assess the effect of the compaction method on asphaltic concrete. Two types of aggregate were used in the experiment.

Compaction method

Aggregate type

Static

Regular
kneading

Low
kneading

Very low
kneading

Basalt

68
63
65

126
128
133

93
101
98

56
59
57

Silicious

71
66
66

107
110
116

63
60
59

40
41
44

In this experiment, three samples were tested at each combination of aggregate type and compaction type — i.e. there were 3 replicates — and the tensile strength (psi) was recorded.

Surface treatment and abrasion

Abrasion resistance in materials is often measured by rubbing specimens against a standard abrasive and recording either the decrease in thickness or the loss in weight. The table below describes results from a factorial experiment on coated fabrics to assess the effect of three factors.

Two types of filler were used in each of 3 proportions and half of the pieces of fabric were given a surface treatment before testing. The response measurement was the weight loss (mg) after 3000 revolutions of the testing machine.

	No surface treatment		Surface treatment
Percentage Filler	Filler A	Filler B	Filler A	Filler B
25%	527 561	456 377	475 466	296 325
50%	621 664	426 476	561 540	301 235
75%	724 743	460 426	626 682	322 304

8.4.3 Simple model for two factors

The simplest model for the effect of two factors on a response is an additive one of the form:

(mean response) = (base value) + (effect of factor A) + (effect of factor B )

One implication of this model is that the effect on the response of changing the level of factor A is the same, whatever the level of factor B. In a similar way, the model assumes that the effect of changing factor B is the same whatever the level of factor A.

It should be noted that this model does not always hold. For example the effect on the response of changing A may be lower when B is at a low level than when B is at a high level. This is called interaction between the effects of A and B and a different model must be used if interaction is present. Interaction will be discussed later in this section.

Strength of asphaltic concrete

An experiment was conducted by a civil engineer to assess the effect of the compaction method on asphaltic concrete. Two types of aggregate and four compaction types were used in the factorial experiment with three replicates.

The diagram below shows the tensile stringths (psi) of the samples and the best-fitting factor model with no interaction.

Rotate to y-z. The coloured lines show how the tensile strength differs for the two aggregate types for all compaction methods. Observe that :

The fitted lines are all parallel since the no-interaction model assumes that the differences (Basalt - Silicious) are the same for all compaction methods.

Similarly, rotate to y-x. Again the two fitted lines (corresponding to the different aggregate types) are all parallel.

The no-interaction model assumes that the effect of changing the compaction method is the same for both aggregate types.

With the assumption of no interaction, the differences between the effects of the aggregates and compaction methods can be well summarised by the two tables below:

	Aggregate type
	Basalt	Silicious
Mean strength	87.25	70.25

The model therefore estimates that Basalt aggregate is 17psi stronger than Silicious, whatever the compaction method.

	Compaction method
	Static	Regular kneading	Low kneading	Very low kneading
Mean strength	66.5	120.0	79.0	49.4

This table summarises the differences between the compaction methods, whatever the aggregate type.

8.4.4 Interaction between factors

The increased accuracy of the parameter estimates is an important reason for using a factorial experimental design.

However an equally important reason is that the effect of changing the level of one factor may be different for different levels of the second factor. If this occurs, there is said to be an interaction between the effects of the two factors.

If there are no runs of the experiment for some combinations of factor levels, we cannot assess whether there is interaction.

Summarising the results of experiments is usually easier if there is no interaction between the factors. However if interaction exists, it is important that it is discovered and described.

Texture of a dairy product

Consider an experiment in which the main aim is to assess how the texture of a dairy product deteriorates between 1 day and 2 weeks after manufacture. It is also known that the storage temperature affects deterioration, so the experiment will attempt to discover the effects of both factors on texture.

The diagram above initially shows typical results from an experiment that only keeps samples at low temperature for 2 weeks. From it, we might estimate that texture decreases by 2.96 between 1 day and 2 weeks from manufacture.

Now select Factorial experiment from the pop-up menu at the top of the diagram. This adds some extra observations when samples are stored for 2 weeks at 20°C to complete the factorial design. With these extra observations, it can be seen that the deterioration in texture from 1 day to 2 weeks is much greater when the temperature is 20°C than when it is 10°C. This is called an interaction between the effects of storage time and temperature.

In conclusion,

A single value is not enough to describe how texture decreases between storage of 1 day and 2 weeks.

Because of the interaction between temperature and storage time, we must separately provide the mean decreases at both 10°C and 20°C to fully describe how storage time affects texture.

8.4.5 Example with interaction

The example below is a larger data set that exhibits interaction between two factors.

Hay fever relief

A study was conducted to investigate the effect of a drug compound in providing relief for hay fever. In the experiment, two active ingredients (A and B) were each varied at 3 levels in a factorial design with 4 replicates. There were 36 hay fever sufferers available and they were randomly allocated to the 9 treatment combinations. The table below shows the hours of relief that the subjects reported.

Ingredient A

Ingredient B

Low

Medium

High

Low

2.4
2.3

2.7
2.5

04.6
04.9

04.2
04.7

04.8
04.4

04.5
04.6

Medium

5.8
5.5

5.2
5.3

08.9
08.7

09.1
09.0

09.1
08.7

09.3
09.4

High

6.1
5.9

5.7
6.2

9.9
10.6

10.5
10.1

13.5
13.3

13.0
13.2

The diagram below initially shows the data, plus the overall mean hours of relief for the 36 subjects.

Click the checkbox Main effect for Ingredient A to show the mean hours of relief for this ingredient. Increasing the amount of this ingredient gives longer relief.

Click Main effect for Ingredient B to show the 'best' fit of a model with no interaction between the ingredients. The vertical red lines are 'residuals' for the model — they show whether the no-interaction model over- or under-estimates yield for each patient. In particular, the model under-estimates the relief when both ingredients are at the High level.

Click the checkbox Interaction. The best model with an interaction between the ingredients fits the data much more closely.

The combination of high levels of the two ingredients has a particularly good effect.

8.4.6 No-interaction model for three factors

The simplest model for data from a factorial experiment assumes that there is no interaction between the effects of any of the factors — each acts additively on the response. For an experiment with 3 factors, this implies that...

(mean response) = (base value) + (effect of factor A) + (effect of factor B )

It must be stressed that the no-interaction assumption does not always hold. Indeed, one of the main attractions of factorial experiments is the ability to assess interactions between the factors.

As in all other experiments, it is important to remember that the treatments (factor combinations) should be randomly allocated to the experimental units — randomisation of the experiment.

Surface treatment and abrasion

This example was presented earlier in this section. For illustration, we will combine the two replicates and treat the cell means as a single replicate. (This has only been done to halve the number of crosses in the diagram.)

	No surface treatment		Surface treatment
Percentage Filler	Filler A	Filler B	Filler A	Filler B
25%	544	416.5	470.5	310.5
50%	642.5	451	550.5	268
75%	733.5	443	654	313

The diagram below allows models with the different main effects to be fitted.

Initially the diagram shows only the mean response. Use the checkboxes to fit models with different combinations of factors. Observe that using all three factors allows the mean responses to be close (but not identical) to the corresponding observed responses.

Note how the diagram displays the two levels of Surface treatment with a green and a purple grid when its main effect is in the model.

8.4.7 Interaction in 3-factor designs

It must be stressed again that the no-interaction model does not always fit factorial data. Sometimes the effect of one factor is different for different levels of the others. Interactions in models with 3 or more factors are however difficult to understand, so we only briefly mention their existence here.

Shrimp culture

Commercial cultivation of the Californian brown shrimp was being planned, so scientists conducted a factorial experiment to assess how shrimp growth was affected by temperature, salinity and the density of shrimps in the tanks. The table below shows the average 4-week gain in weight per shrimp (mg) from the post-larval stage for each combination of factor levels.

	80 shrimps/litre		160 shrimps/litre
Salinity	25°C	35°C	25°C	35°C
10%	73	349	86	364
25%	482	330	208	316
40%	397	205	243	281

The data are displayed in the diagram below

Click the top three checkboxes to fit the main effects for the factors. Weight gain appears to be highest at 25% salinity, density 80 shrimps per litre and 35°C.

This however does not tell the whole story. Several data points are far from the corresponding fitted value for the model — they have large residuals (the vertical red lines). In particular, at 10% salinity, the weight gains are under-estimated if the temperature is 35°C and over-estimated if the temperature is 25°C.

Click the checkbox for a Temperature-by-salinity interaction. This allows the effect of salinity to be different at the different temperatures. Observe that there is a very different pattern for the two temperatures.

When the temperature is 35°C, growth decreases steadily with increasing salinity.
When the temperature is 25°C, growth is extremely low for 10% salinity but peaks at 25% salinity — the overall best combination of temperature and salinity.

8.5 Practical issues in design

8.5.1 Purpose

The initial stage in any data-collection exercise is to clearly state the objectives of the study. With no clear idea of the objectives of the study, it is unlikely that the data collected will contain suitable information.

It is essential that the researcher has a clear idea of what an experiment is being conducted to achieve. In defining the goals of the experiment, it is important that people with intimate knowledge of the process or subject area are included in the team which is charged with designing and running the experiment.

Many experiments are intended to solve a problem. After recognising that there is a problem, it needs to be carefully defined. Stating the problem clearly and obtaining general agreement that this statement really does describe the problem is imperative. Quite frequently,...

8.5.2 Experimental units and measurements

It is desirable for experimental units to be as similar as possible, so every attempt should be made to make the experimental units homogeneous. We should therefore characterise the process in terms of 'nuisance' variables and endeavour to find ways of minimising their variability for the experiment.

Often however, the experimenter has little influence on the choice of experimental units and must contend with whatever variability exists. If possible, the experimental units should be grouped into blocks which can be used later in the design process to obtain more precise answers to the questions of interest.

In an experiment, there is sometimes a single obvious response measurement from an experimental unit (e.g. crop yield per square metre, concentration of impurities, exam mark), but often there are several variables which can be considered as response measurements.

For example, in a study of how a fertiliser affects growth of tomato plants, how do you measure growth?

There are many possibilities here and a biologist would need to decide on which was most important from a biological perspective.

Thought also needs to be given to which variables need to be controlled (the input variables) and what settings should be used for these variables in different experimental runs. In an agricultural experiment, do we only want to assess the difference in yields for three crop varieties, or do we simultaneously want to determine the effects of different levels of application of fertiliser? (And if so, what levels should be used?)

8.5.3 Experiments with human subjects

There are a few practicalities that complicate experimentation with human subjects. For ethical reasons, experiments involving potential danger to the subjects are not possible. Even if there is no known danger, the subjects should be aware of what is involved in the experiment and must give informed consent.

There are many instances where an experiment is intended to measure the effect of a particular treatment, such as the improvement of a medical condition caused by administration of a particular drug. A naive experimenter may record the value of some variable (e.g. concentration of some chemical in the blood) before medication is commenced and also after the medication has been used for a week. However any improvement in the condition may have resulted simply from the passage of time and it may not be related to the drug. Time is a lurking variable that is confounded with the treatment.

In order to assess the effect of the treatment, some (randomly selected) subjects who have not received the treatment should also be included in the study — differences between the improvements of the two groups can then be credited to the treatment. Subjects who receive no treatment are called controls.

Unfortunately, the act of administering a treatment to a human subject may itself affect the response, irrespective of the treatment effect. For example, if a drug is being assessed for its ability to reduce headaches, the knowledge that medication has been administered may make the subject feel better, even if the drug has no active ingredient.

To avoid the psychological effect of the treatment on the subject being confounded with the effect of the drug, an indistinguishable 'treatment' with no effect may be given to the control group of subjects; this is called a placebo. For example, two batches of pills of similar size and taste may be prepared, with only one batch containing the active ingredient being assessed. Any difference between the control group and the treatment group can therefore be attributed to the treatment.

A further complication may arise when the measured response from each subject may be affected by knowledge of the treatment applied. If the experimenter knows which treatment has been applied to each experimental unit, there may be a subconscious tendency to systematically over- or under-assess one treatment. To avoid this potential problem, the experimenter may be unaware of which experimental units received which treatment until after the experiment.

The experiment is called double-blind if neither the experimenter not the subject knows which treatment has been applied. For example, a third party may randomly decide which of two drugs will be given to each subject, and package the appropriate pills for each subject in unlabelled containers. The experimenter would administer the treatments and record results without knowing which subjects were receiving which treatments. Again, the aim is to ensure that other factors do not act as lurking variables to confound comparisons of the treatments.

Chapter 8 Designed Experiments

8.1 Association & causal relationships

8.1.1 Interest in relationships