13. Independence

In this section, we explain how to model data sets that consist of pairs of categorical values — bivariate categorical data.

Examples where pairs of categorical variables are measured from 'individuals' are:

'Individuals'	Variable X	Variable Y
Employees in a large company	Sex (M or F)	Education (none, high school or tertiary)
Customers leaving supermarket	Checkout operator type (full- or part-time)	Rating of quality of service (poor, OK or good)
TVs leaving a production line	Assembler (A, B, C or D)	Status (defective or OK)

In each case, the data that are collected are pairs of categorical values that are measurements of the two variables. Data of this form are usually summarised in a contingency table.

Requests for promotional material by travelers

Travel agents provide 'destination-specific travel literature' about activities, facilities and prices to tourists free of charge on request. A study was made to investigate the differences between information seekers (who requested such literature) and nonseekers, with the aim of better targeting such material.

A sample of 686 tourists was selected and each was classified as an information seeker or non-seeker and in various other ways including educational level.

Tourist	Educational level	Information seeker?
1	High school degree	Yes
2	College degree	Yes
3	Some high school	No
...	...	...

The data from the 686 tourists are summarised in the contingency table below.

	Information seeker?
Education	Yes	No	Total
Some high school	13	27	40
High school degree	64	118	182
Some college	100	123	223
College degree	59	69	128
Graduate degree	67	46	113
Total	303	383	686

To model bivariate categorical data, we assume an underlying population of pairs of categorical values. The data are treated as a random sample of pairs of values from this population. A real finite population occasionally underlies the data, but we must usually hypothesise an infinite underlying population.

The proportion of times that the pair of categorical values (x, y) occurs in the population — its probability — is denoted by p_xy. The probabilities p_xy are called the joint probabilities for the two variables.

Gambling simulation

A gambler draws a card from a shuffled deck and also tosses a coin, resulting in a pair of categorical values,

Variable	Possible values
Coin side, X	Head or Tail
Card suit, Y	Heart, Club, Diamond or Spade

Each of the eight possible combinations of coin side and card suit is equally likely and would occur in the same proportions in the underlying population.

pxy = 1/8

The probabilities for all pairs are therefore the same,

These joint probabilities are shown in blue in the table below.

The two lower tables (in black) describe 100 pairs of values sampled from this population — 100 coin-card pairs. The first of these tables shows the sample as a contingency table of counts; the other table displays the counts as proportions.

Click Take sample a few times to see the variability in samples of 100 coin-card pairs. Increase the sample size to 500 and repeat. Observe that the sample proportions are less variable when the sample size is large.

We are usually interested in the joint probabilities in the underlying model, rather than the corresponding proportions from the sample data that have been collected (the contingency table). However the joint probabilities are unknown parameters in most practical situations and must be estimated from the sample data.

Requests for promotional material by travelers

In situations of practical importance, the underlying probabilities are unknown. In the data set that was collected to examine which travelers request promotional material, the 686 travelers in the study were not the focus of attention — the researcher wanted to generalise to all other similar travelers.

The population proportions are unknown, but the sample proportions provide estimates of them.

13.1.2 Marginal probabilities

A model for two categorical variables is characterised by the joint probabilities p_xy. However we sometimes want to restrict attention to one of these variables on its own. The marginal probabilities for the variable X are defined and interpreted in a similar way to the marginal proportions that were defined earlier for bivariate categorical data.

The marginal probability, p_x, for the variable X is the proportion of (x, y) pairs in the population for which the value of X is x . For example, consider a situation where we are interested in hair colour and eye color of teenagers. The number of blue-eyed teenagers is the sum of those with either (blue eyes and blonde hair) or (blue eyes and brunette hair) or ... or (blue eyes and red hair),

where the right of the equation denotes summing the joint probabilities over all possible values of y. There is a similar formula for the marginal probabilities of the other variable,

Eye strain for office workers

It is difficult to find illustrative examples since population probabilities are unknown in most 'interesting' applications. The following example is based on a real data set which classifies 295 office workers by their type of work and whether they have symptoms of eye strain.

**Data from 295 office workers**
Type of work	No eye strain	Eye strain
VDU data entry	42	11
General VDU use	79	30
Full-time typing	64	14
Standard clerical work	52	3

We do not know the underlying population joint probabilities for workers in this type of office in general. However, to provide an illustrative example, we will pretend that the population probabilities are equal to the proportions in this data set. For example, we will pretend that the joint probability for a worker doing VDU data entry and not having eye strain is 42 / 295 = 0.1424.

**Probabilities for office workers in general**
Type of work	No eye strain	Eye strain	Total
VDU data entry	0.1424	0.0373	0.1797
General VDU use	0.2678	0.1017	0.3695
Full-time typing	0.2169	0.0475	0.2644
Standard clerical work	0.1763	0.0102	0.1864
Total	0.8034	0.1966	1.0000

The two marginal totals (red and orange) of the table give the marginal probabilities for the two variables. For example,

the probability that a worker has symptoms of eye strain is 0.1966
the probability that a worker is doing full-time typing is 0.2644

The diagram below illustrates the summing of joint probabilities to give marginal ones with a 3-dimensional barchart of the joint probabilities.

Click the formula for the marginal probabilities of 'X' (the type of work) on the right. The bars stack to show the marginal probabilities for type of work.

Similarly, clicking the formula for the marginal probabilities of 'Y' stacks the bars to show the overall probability that a worker has eye strain.

13.1.3 Conditional probabilities

The concept of a conditional probability is similar to that of a conditional proportion that was described earlier for bivariate categorical data sets.

Consider again hair colour (Y ) and eye colour (X ) in a population of teenagers. The probability of a teenager being blonde, conditional on blue eyes, is the proportion of blondes within the sub-population with blue eyes. The conditional probability is most easily understood as the ratio of the population numbers with (a) blue eyes and (b) both blonde hair and blue eyes.

However if the population is infinite, it is better to express it in terms of probabilities as the ratio of a joint and marginal probability (an equivalent definition for finite populations).

The general definition of the conditional probabilities for Y given that the value of X is x is

The conditional probabilities for Y, given X = x , can therefore be found by rescaling of that row of the table of joint probabilities (dividing by p_x) so that the row sums to 1.0, as shown in the diagram below.

Note that there is an equivalent formula for conditional probabilities for X given the value of Y that corresponds to using the other variable to define the sub-population. When we restrict attention to population values for which Y has the value y , the conditional probabilities for X are

Support and grief state after neonatal death

The diagram below again shows the joint probabilities in a 3-dimensional barchart.

Click the formula for the conditional probabilities of 'Y' (grief state) given 'X' (the level of support). The bars for each type of work are separately scaled up to add to 1.0. Observe that

the probability of being in grief state I is highest for those getting good support
the probability of being in grief state IV is highest for those getting moderate support

Click the formula for joint probabilities, then the formula for conditional probabilities of 'X' given 'Y'. This time the joint probabilities are separately scaled for mothers in different grief states. These conditional probabilities are less useful for understanding this example.

13.1.4 Graphical display of probabilities

Marginal and conditional probabilities are meaningful and useful summaries of the relationship between two categorical variables. Proportional Venn diagrams were used earlier to graphically display marginal and conditional proportions for bivariate categorical data sets. They can also be used in the same way to display marginal and conditional probabilities for a bivariate categorical model.

The proportional Venn diagram is drawn in a unit square (with both sides of length 1.0).

Since this is the product of the height and width of the rectangle representing categories x and y ,

A similar diagram can be based on the marginal probabilities of Y and the conditional probabilites of X given Y, splitting the unit square first horizontally and then vertically. The areas of the resulting rectangles are again equal to the joint probabilities, so the two diagrams are just rearrangements of the same areas (the joint probabilities, p_xy).

Apple bruising

Before showing the relationship between joint, conditional and marginal probabilities, we illustrate the formulae for joint, conditional and marginal proportions.

The contingency table below describes bruising of 96 apples in a packing plant. The apples were classified by the variety of apple (Granny Smith or Fuji) and whether or not they were bruised. (The data are not real.)

	Bruised	Not bruised
Granny Smith	40	8
Fuji	24	24

The diagram below shows a Proportional Venn diagram for the data. Note that the four areas are proportional to the numbers of apples for each combination of apple type and bruising.

Click on any rectangle in the diagram to observe how the joint proportion of apples with any combination of apple type and bruising equals the product of a marginal and conditional proportion.
Clicking on the other formulae under the diagram shows how the joint proportions can be obtained from the other marginal and conditional proportions.

World population by age and region

The table below shows the world population in 2002, categorised by region and by age group.

**World population (millions)**
	Age
	0-19	20-64	65+
Africa and Near East	0,526.6	0,455.6	034.9
Asia	1,340.0	1,964.2	216.7
America, Europe and Oceanea	0,522.6	0,981.8	188.7

Consider randomly selecting one person in the world. The joint probabilities for this person being in each age/region are obtained by dividing the above values by the total world population.

**Joint probabilities**
	Age
	0-19	20-64	65+
Africa and Near East	0.085	0.073	0.006
Asia	0.215	0.315	0.035
America, Europe and Oceanea	0.084	0.158	0.030

Marginal and conditional probabilities can be obtained using formulae from the previous pages. The proportional Venn diagram below displays them graphically.

The diagram initially splits the unit square horizontally using the marginal probabilities of Y — the probabilities of a random person being from each of the three regions. Each row is split according to the conditional probabilites for age group within that region. From the diagram, we can easily see that:

The probability that the person is from Asia is higher than the other regions.
Conditionally on the person being from Africa (or the Near East), the person is likely to be young. In particular, the conditional probability of being aged 65+ is small.
Conditional on the person being from America, Europe or Oceania, there is a much higher probability of the person being aged 65+ than in the other regions.

Click on any rectangle in the diagram to observe how its area equals the product of a marginal and conditional probability and therefore is the joint probability for the corresponding categories.

Click the rightmost formula under the diagram. The rectangles change in shape but retain the same areas to rearrange into vertical columns corresponding to the marginal probabilities for age group. Each column is split in proportion to the conditional probabilities of region given age group. From this version of the diagram, observe that

Conditional on the person being aged 0-19, the probabilities of being in America/Europe/Oceania and Africa are similar.

13.1.5 Calculations with probabilities

We have used three types of probability to describe a model for two categorical variables — the joint probabilities, the marginal probabilities for the two variables and the conditional probabilities for each variable given the value of the other variable. These sets of probabilities are closely related. Indeed, the model can be equivalently described by any of the following.

The diagram below shows how to find each set of probabilities from the others, using the formulae described in the earlier pages of this section.

In particular, note that it is possible to obtain the conditional probabilities for X given Y,

, from the marginal probabilities of X, p_x, and the conditional probabilities for Y given X,

. This can be expressed in a single formula that is called Bayes Theorem, but it is easier in practice to do the calculations in two steps, obtaining the joint probabilities, p_xy, in the first step. There are several important applications of Bayes Theorem.

Fraudulent tax claims

Tax inspectors investigate some of the tax returns that are submitted by individuals if they think that some claims for expenses are too high or are unjustified.

A investigation of the tax return does not always conclude that the claims were fraudulent — their suspicions are rarely 100% accurate. There are two types of error:

The tax return of someone who has submitted correct claims is investigated.
The tax return of someone who has submitted a bad claim is not investigated

There are commonly non-zero probabilities for each of these types of error. Consider tax inspectors who have probability 0.1 of investigating a correct claim and 0.2 of not investigating a bad claim. These are conditional probabilities and can be written formally as:

Since the probability (proportion) of investigating a bad claim is one minus the conditional probability of investigating it (and a similar result for correct claims), the remaining conditional probabilities are

We will also assume that 10% of tax returns are bad claims. This corresponds to a marginal probability, P(bad claim) = 0.10.

The diagram below shows how these marginal probabilities for Y (claim type) and conditional probabilities for X (investigation) given Y can be used to obtain the conditional probabilities for Y (claim type) given X (investigation).

The initial information is shown in blue at the top of the diagram. The joint probabilities (green) are first found from them. Click on any value in the table of joint probabilities to see how it is related to the initial information.

Marginal probabilities for the test results are next obtained by adding the columns of joint probabilities. Click on any of the black marginal probabilities to see how they are obtained from the joint probabilities.

Finally the conditional probabilities for claim type (given whether the tax return has been investigated) are obtained from the joint probabilities and the marginal probabilities for the claim types. Click on the conditional probabilities on the bottom right of the diagram to see the formula.

Initially there might seem to be a contradiction between the two conditional probabilities,

However the two probabilities are consistent since they have very different interpretations. The proportional Venn diagrams below help to explain the difference. The diagram on the left shows the marginal and conditional probabilities given in the question. The corresponding diagram on the right shows the marginal probabilities for the whether claims are investigated and the conditional probabilities for good/bad claims.

Remember that the areas of the rectangles equal the joint probabilities and are therefore the same in both diagrams.

Drag the slider to alter the proportion of people who make bad tax claims in the population. (We assume that the conditional probabilities of investigating the claims remain the same.) Observe that:

When a large proportion in the population make bad tax claims, is high. This is because there are very few people in the population who make good claims and are investigated.
When P(bad claim) is small, is also small since the small number who make bad claims and are investigate is outweighed by those who make good claims and are investigated.

13.2 Independence

13.2.1 Association

When two or more measurements are made from each individual in a population, we are usually interested in whether these variables are related to each other. When both variables are numerical, the strength of the relationship can be described with a correlation coefficient and regression models allow us to test whether two variables are related on the basis of sample data.

As with numerical variables, we may be able to conclude that any relationship between categorical variables is causal if it results from an experiment (e.g. a randomised experiment in which some pea seeds are coated and others are uncoated). From observational data however, we usually cannot deduce a causal relationship — all we can say is that the variables may be associated.

We say that two variables are associated if knowledge of the value of one tells you something about the likely value of the other.

For example, if the conditional distribution of the Job satisfaction of new employees given Job type = secretary is different from the conditional distribution of Job satisfaction given Job type = manager, then we say that Job satisfaction and Job type are associated.

In the next page, we will characterise two variables that are not associated, but first we give an example of variables that are related.

Absenteeism and weight

To illustrate the idea of association, we use a table of joint probabilities that constitute a possible model for absenteeism of employees in a supermarket chain and their weight.

Note that the joint probabilities in this model do not accurately represent the effect of weight on absenteeism — they are only used to illustrate the concepts.

**Joint Probabilities**
	Attendance record
	Poor	Satisfactory	Above average	Marginal
Underweight	0.0450	0.0900	0.0150	0.1500
Normal	0.0825	0.3025	0.1650	0.5500
Overweight	0.0500	0.1200	0.0300	0.2000
Obese	0.0300	0.0650	0.0050	0.1000
Marginal	0.1700	0.5400	0.2900	1.0000

The implications of this model are best explained from conditional probabilities for athletic performance, given weight:

**Conditional Probabilities**
	Attendance record
	Poor	Satisfactory	Above average	Total
Underweight	0.30	0.60	0.10	1.0
Normal	0.15	0.55	0.30	1.0
Overweight	0.25	0.60	0.15	1.0
Obese	0.30	0.65	0.05	1.0

A proportional Venn diagram displays these conditional probabilities graphically.

If this model is correct, the conditional probability of poor attendance is lowest for staff with 'normal' weight, increasing as weight gets further from 'normal'. Similarly, the probability of above average attendance is highest for those with 'normal' weight.

13.2.2 Independence

If the conditional probabilities for Y are the same for all values of X, then Y is said to be independent of X.

Independence implies that the sub-populations corresponding to different values of X all contain values of Y in the same proportions.

Work performance and weight

As an example of independence, we continue with the (artificial) example on the previous page. We now show the relationship between weight and work performance (as assessed by a supervisor). In this model, weight and performance are independent — knowing someone's weight gives no clues as to that person's ability to do their job.

**Joint Probabilities**
	Work performance
	Poor	Satisfactory	Above average	Marginal
Underweight	0.0225	0.1125	0.0150	0.1500
Normal	0.0825	0.4125	0.0550	0.5500
Overweight	0.0300	0.1500	0.0200	0.2000
Obese	0.0150	0.0750	0.0100	0.1000
Marginal	0.1500	0.7500	0.1000	1.0000

For this model, the conditional probabilities for work performance, given weight, are:

**Conditional Probabilities**
	Work performance
	Poor	Satisfactory	Above average	Total
Underweight	0.15	0.75	0.10	1.0
Normal	0.15	0.75	0.10	1.0
Overweight	0.15	0.75	0.10	1.0
Obese	0.15	0.75	0.10	1.0

The conditional probabilities are the same for each weight, so knowing that a student is, say, obese does not affect the probability of being rated as an above-average worker. The proportional Venn diagram has the form shown below.

Note that the Proportional Venn Diagram now consists of a grid of horizontal and vertical lines.

Since the conditional and marginal probabilities are equal if Y and X are independent, an equivalent definition of independence is:

3-dimensional illustration of independence

The diagram below shows the joint probabilities in the model of independence above.

Click the formula for the conditional probability of Y given X. (This separately scales the bars for each X to have the same total, 1.0.) Observe that the distribution of performance is the same in each weight group.

Click the formula for the joint probabilities, then the formula for the conditional probabilities of X given Y. Observe that the distribution of weights is the same in each performance group.

13.2.3 Independence from samples

Independence is an important special case of models for bivariate data. However it is a property of the joint population probabilities and in most practical situations these are unknown.

Recruiting source and success

A sample of 1,400 store clerks hired during 1979 by a large US retailing chain was selected by researchers who wanted to determine whether the recruiting source for employees is related to whether they perform satisfactorily in their job (determined from supervisor evaluations). Four recruiting sources were defined.

**Sample Data**
	Unsatisfactory	Satisfactory	Total
Employee referral	167	85	252
In-store notice	383	261	644
Employment agency	33	17	50
Media announcement	250	204	454
Total	833	567	1400

Independence would be an important characteristic of employment since it would imply that employees recruited from all sources have the same probability of satisfactory performance.

Are those sample data consistent with a model of independence?

The marginal counts in a contingency table describe the univariate distributions of the two variables on their own, but do not tell you anything about their relationship. For example, the two contingency tables below have the same margins.

However the table on the left supports an extremely strong relationship — if the row category is known, we can accurately predict the column category. On the other hand, there is no evidence of association in the table on the right — each row of the table contains the column categories in the same proportions.

In practice, the pattern of counts in a contingency table is rarely so easily interpreted. A first step is to determine the pattern that is most consistent with independence of the rows and columns, based on the observed margins.

If the rows and columns are independent, the conditional probabilities are the same for each row, so we distribute each marginal row total between the column categories in the same proportions — determined by the marginal proportions for the column categories.

	C1	C2	C3	Total
R1	?	?	?	30
R2	?	?	?	40
R3	?	?	?	30
Total	30	40	30	100

This pattern is gives the estimated cell counts and the following formula can be used to evaluate them.

where n denotes the total for the whole table and n_x and n_y denote the marginal totals for row x and column y.

Recruiting source and success

We now find the pattern of estimated cell counts for the recruitment data that is most consistent with independence of recruiting source and success, based only on the margins of the observed contingency table.

**Sample Data**
	Unsatisfactory	Satisfactory	Total
Employee referral	?	?	252
In-store notice	?	?	644
Employment agency	?	?	50
Media announcement	?	?	454
Total	833	567	1400

If success is indeed independent of recruitment, then we estimate that the proportion of the 252 recruited from 'Employee referral' who are successful would be the same as the marginal proportion who are successful. Since 833 out of the total 1400 in the study are successful, we therefore expect that the number recruited from 'Employee referral' who are successful would be

This is an example of the general formula that was presented earlier,

The complete table of estimated cell counts is:

If recruitment and success are indeed independent, then the observed cell counts in the sample data should be similar to these estimated cell counts.

**Observed and estimated cell counts**
	Unsatisfactory	Satisfactory	Total
Employee referral	167 (149.9)	85 (102.1)	252
In-store notice	383 (383.2)	261 (260.8)	644
Employment agency	33 (29.8)	17 (20.2)	50
Media announcement	250 (270.1)	204 (183.9)	454
Total	833	567	1400

The hypothesis of independence is assessed by asking whether the observed and estimated cell counts are 'sufficiently close' — are the observed counts consistent with the counts estimated under independence? We address this formally in the following pages.

13.2.4 Testing for independence

Recruiting source and success

If the recruitment source and work performance are indeed independent, then the observed cell counts in the sample data should be similar to these estimated cell counts.

**Observed and estimated cell counts**
	Unsatisfactory	Satisfactory	Total
Employee referral	167 (149.9)	85 (102.1)	252
In-store notice	383 (383.2)	261 (260.8)	644
Employment agency	33 (29.8)	17 (20.2)	50
Media announcement	250 (270.1)	204 (183.9)	454
Total	833	567	1400

Did a sample contingency table come from a population in which the categorical row and column variables, X and Y are independent? This question can be formalised as the hypothesis test,

In order to assess whether the data are consistent with the null hypothesis, we ask whether the observed cell counts in the contingency table, n_xy, are similar to the estimated cell counts based on independence, e_xy. The simplest measure of their match is the sum of squares of the differences,

Small values of this statistic are expected when there is independence in the underlying population. However it does not behave entirely as desired. To be useful, a test statistic must have a known distribution when H₀ is true and, ideally, this distribution should not depend too much on specific characteristics of the problem.

It would be very unusual for a cell in a contingency table with estimated cell count e_xy< = 1 to have observed cell count e_xy = 11. However if the estimated cell count is e_xy = 1001 then sampling variability would mean that an observed cell count of e_xy = 1011 would not be unusual. Yet the difference is the same in both cases.

Distribution of sum of squares

The blue values in the contingency table on the left below have been sampled from a population in which each of the row categories is equally likely (with marginal probability ¹/₃), each column category is equally likely (marginal probability ¹/₃) and the row and column categories are independent. All joint probabilities are therefore know to be ¹/₉.

Click Sample a few times to observe the variability of the blue observed counts, n_xy.

The estimated counts, e_xy, obtained from the margins of the table, are also shown in red. Observe the variability in the differences and their sum of squares.

Increase the sample size from 100 to 1000 and repeat. Observe that the differences are usually higher. Increase the sample size to 10000 and observe that the statistic is usually higher still.

The distribution of the sum of squares depends on the sample size, so it is not an easily interpreted measure of independence.

13.2.5 Chi-squared test statistic

The raw sum of squares on the previous page is a poor way to assess whether a contingency table has been sampled from a population with independence. A better statistic is χ² (pronounced chi-squared), defined by

This more fairly assesses differences between n_xy and e_xy when the e_xy vary in magnitude. Its distribution still depends on the number of rows and columns in the contingency table, but is no longer affected by either the number of individuals (the total count for the table) or the margins of the table.

Simulation

The diagram below again samples from two independent categorical variables.

Click Accumulate then take several samples to build up the distribution of the χ² statistic.

Now increase the sample size and repeat. Observe that χ² is approximately the same magnitude (usually between 0.5 and 15.0) regardless of the sample size.

Finally, use the pop-up menu labelled Model to change the model to one where the marginal probabilities for the two categorical variables are unequal (but there is still independence). Observe that the distribution of χ² remains approximately the same.

When there is independence, the χ² statistic has approximately a standard distribution called a chi-squared distribution whose shape only depends on the number of rows and columns in the table but not the sample size or the underlying joint probabilities.

If a contingency table with r rows and c columns is sampled from a population with independence, χ² has a chi-squared distribution with (r - 1)(c - 1) degrees of freedom.

Shape of the chi-squared distribution

The diagram below shows the probability density function for the chi-squared distribution.

Use the pop-up menus to change the number of rows and columns in the table. Observe that:

The chi-squared distribution is skew, but becomes closer to symmetric when the degrees of freedom are large (i.e. when the number of rows and columns in the table is large).
The mean of the distribution equals its degrees of freedom.

13.2.6 P-value for chi-squared test

We now formally describe a hypothesis test for whether two categorical variables are independent.

describes whether the observed counts in a contingency table, n_xy, are close to those expected for independent variables.

In a similar way to other hypothesis tests, we evaluate a p-value — the probability of getting such an extreme χ² when the two variables are independent (H₀).

If the p-value is close to zero, we conclude that the observed table would be unlikely for independent variables, so there is evidence that the variables are associated.

p-value	Interpretation
over 0.1	no evidence against the null hypothesis (independence)
between 0.05 and 0.1	very weak evidence of dependence between the row and column variables
between 0.01 and 0.05	moderately strong evidence of dependence between the row and column variables
under 0.01	strong evidence of dependence between the row and column variables

The p-value for the test can be found because the χ² test statistic has approximately a chi-squared distribution. This approximation is close for most data sets that are encountered, but is less so when the sample size, n, is small. The guidelines that are often given suggest that the p-value can be relied on if:

If the cell counts are small enought that these conditions do not hold, the p-value is less reliable. (But advanced statistical methods are required to do better!)

Simulation: Independent variables

The diagram below shows a random sample from a model in which the row and column variables are independent. It also illustrates how the p-value is evaluated.

Click Take sample a few times to generate other samples from the model. Observe that the p-value is usually quite large (since H₀ is true), but

About 1 in 10 samples result in a p-value under 0.1
About 1 in 20 samples result in a p-value under 0.05
About 1 in 100 samples result in a p-value under 0.01

An 'unlucky sample' might mislead you into erroneously concluding that the variables are dependent.

13.2.7 Examples

If it is concluded that dependence is likely in the table, you should examine carefully the cells of the table where there are the biggest mismatches between the observed and estimated cell counts. This should help you to discover the nature of the dependence.

Examples

The chi-squared test for association is applied to a few real contingency tables below.

In some examples, the value of χ² is so far into the upper tail of the reference distribution that we are almost certain that the row and column variable are dependent. In others, the value of χ² is small enough that it could have arisen by chance even if the variables are independent in the underlying population.

13.2.8 Comparing groups

Contingency tables often arise from bivariate categorical data. However they can also arise from univariate categorical data that is recorded separately from several groups.

Effect of false claims in adverts

In a study to assess how false or misleading adverts affect consumers, one group of 100 experimental subjects was exposed to a series of adverts falsely claiming that a new brand of coffee contained 'no bitterness'. These subjects and another control group of 100 people who had not seen the adverts were given a sample of coffee that had been prepared to be intentionally bitter. The contingency table below shows whether the subjects reported the coffee as 'having bitterness'.

	False advert	No advert	Total
Coffee described as bitter	68	89	157
Coffee described as not bitter	32	11	43
Total	100	100	200

In this example, two different groups of people were used in the experiment. The column variable (distinguishing between the two types of advert) is not a random variable, as it is controlled by the experiment. A single categorical measurement (whether or not the coffee sample was bitter) was made from each person.

Although the chi-squared test was motivated as a test of independence of two categorical variables, the same test can be used when each row (or column) of a contingency table corresponds to a separate group of individuals.

The χ² test statistic and p-value are identical to those given earlier for testing independence.

Examples

In the following examples, we test whether the 'response' proportions are the same in several groups.

Note again that a visual comparison of the observed counts and those estimated from the margins assuming independence helps to explain the nature of the relationship in examples where we conclude that there is some difference between the groups.

In the special case where there are two groups and the categorical measurement has two categories (that we will call 'success' and 'failure'), the chi-squared test is testing whether the probability of success is the same in both groups. For example, in the Bitter Coffee data set, we are testing whether the probability of reporting that the coffee was bitter is the same for the groups seeing the false adverts and those who did not.

Fortunately, although the two tests have been motivated in a different way, it can be proved that:

Chapter 13 Independence

13.1 Probability and applications

13.1.1 Joint probabilities

13.1.2 Marginal probabilities

13.1.3 Conditional probabilities

13.1.4 Graphical display of probabilities

13.1.5 Calculations with probabilities

13.2 Independence

13.2.1 Association

13.2.2 Independence

13.2.3 Independence from samples

The complete table of estimated cell counts is:

13.2.4 Testing for independence

13.2.5 Chi-squared test statistic

13.2.6 P-value for chi-squared test

13.2.7 Examples

13.2.8 Comparing groups