If you don't want to print now,

Chapter 13   Independence

13.1   Probability and applications

  1. Joint probabilities
  2. Marginal probabilities
  3. Conditional probabilities
  1. Graphical display of probabilities
  2. Calculations with probabilities

13.1.1   Joint probabilities

Data sets with two categorical variables

In this section, we explain how to model data sets that consist of pairs of categorical values — bivariate categorical data.

Examples where pairs of categorical variables are measured from 'individuals' are:

'Individuals' Variable X Variable Y
Employees in a large company Sex (M or F) Education (none, high school or tertiary)
Customers leaving supermarket Checkout operator type (full- or part-time) Rating of quality of service (poor, OK or good)
TVs leaving a production line Assembler (A, B, C or D) Status (defective or OK)

In each case, the data that are collected are pairs of categorical values that are measurements of the two variables. Data of this form are usually summarised in a contingency table.

Requests for promotional material by travelers

Travel agents provide 'destination-specific travel literature' about activities, facilities and prices to tourists free of charge on request. A study was made to investigate the differences between information seekers (who requested such literature) and nonseekers, with the aim of better targeting such material.

A sample of 686 tourists was selected and each was classified as an information seeker or non-seeker and in various other ways including educational level.

Tourist    Educational level        Information seeker?    
1 High school degree Yes
2 College degree Yes
3 Some high school No
... ... ...

The data from the 686 tourists are summarised in the contingency table below.

  Information seeker?  
Education     Yes         No        Total   
  Some high school 13 27 40
  High school degree    64 118 182
  Some college    100 123 223
  College degree    59 69 128
  Graduate degree    67 46 113
Total 303 383 686

Joint probabilities

To model bivariate categorical data, we assume an underlying population of pairs of categorical values. The data are treated as a random sample of pairs of values from this population. A real finite population occasionally underlies the data, but we must usually hypothesise an infinite underlying population.

The proportion of times that the pair of categorical values (x, y) occurs in the population — its probability — is denoted by pxy. The probabilities pxy are called the joint probabilities for the two variables.

Gambling simulation

A gambler draws a card from a shuffled deck and also tosses a coin, resulting in a pair of categorical values,

Variable Possible values
Coin side, X   Head or Tail
Card suit, Y   Heart, Club, Diamond or Spade

Each of the eight possible combinations of coin side and card suit is equally likely and would occur in the same proportions in the underlying population.

pxy = 1/8

The probabilities for all pairs are therefore the same,

pxy = 1/8

These joint probabilities are shown in blue in the table below.

The two lower tables (in black) describe 100 pairs of values sampled from this population — 100 coin-card pairs. The first of these tables shows the sample as a contingency table of counts; the other table displays the counts as proportions.

Click Take sample a few times to see the variability in samples of 100 coin-card pairs. Increase the sample size to 500 and repeat. Observe that the sample proportions are less variable when the sample size is large.

Interest in the model

We are usually interested in the joint probabilities in the underlying model, rather than the corresponding proportions from the sample data that have been collected (the contingency table). However the joint probabilities are unknown parameters in most practical situations and must be estimated from the sample data.

Requests for promotional material by travelers

In situations of practical importance, the underlying probabilities are unknown. In the data set that was collected to examine which travelers request promotional material, the 686 travelers in the study were not the focus of attention — the researcher wanted to generalise to all other similar travelers.

The population proportions are unknown, but the sample proportions provide estimates of them.

13.1.2   Marginal probabilities

Probabilities for a single variable

A model for two categorical variables is characterised by the joint probabilities pxy. However we sometimes want to restrict attention to one of these variables on its own. The marginal probabilities for the variable X are defined and interpreted in a similar way to the marginal proportions that were defined earlier for bivariate categorical data.

The marginal probability, px, for the variable X is the proportion of (xy) pairs in the population for which the value of X is x . For example, consider a situation where we are interested in hair colour and eye color of teenagers. The number of blue-eyed teenagers is the sum of those with either (blue eyes and blonde hair) or (blue eyes and brunette hair) or ... or (blue eyes and red hair),

n(blue eyes)

The same holds for the proportion with blue eyes — its marginal probability,

p(blue eyes)

This is generalised with the formula

px = sumy(pxy)

where the right of the equation denotes summing the joint probabilities over all possible values of y. There is a similar formula for the marginal probabilities of the other variable,

py = sumx(pxy)

Eye strain for office workers

It is difficult to find illustrative examples since population probabilities are unknown in most 'interesting' applications. The following example is based on a real data set which classifies 295 office workers by their type of work and whether they have symptoms of eye strain.

Data from 295 office workers
Type of work No eye strain Eye strain
VDU data entry 42 11
General VDU use 79 30
Full-time typing 64 14
Standard clerical work 52 3

We do not know the underlying population joint probabilities for workers in this type of office in general. However, to provide an illustrative example, we will pretend that the population probabilities are equal to the proportions in this data set. For example, we will pretend that the joint probability for a worker doing VDU data entry and not having eye strain is 42 / 295 = 0.1424.

Probabilities for office workers in general
Type of work No eye strain Eye strain Total
VDU data entry 0.1424 0.0373 0.1797
General VDU use 0.2678 0.1017 0.3695
Full-time typing 0.2169 0.0475 0.2644
Standard clerical work 0.1763 0.0102 0.1864
Total 0.8034 0.1966 1.0000

The two marginal totals (red and orange) of the table give the marginal probabilities for the two variables. For example,

The diagram below illustrates the summing of joint probabilities to give marginal ones with a 3-dimensional barchart of the joint probabilities.

Click the formula for the marginal probabilities of 'X' (the type of work) on the right. The bars stack to show the marginal probabilities for type of work.

Similarly, clicking the formula for the marginal probabilities of 'Y' stacks the bars to show the overall probability that a worker has eye strain.

13.1.3   Conditional probabilities

Probabilities in a sub-population

Complete population
The joint probabilities pxy and the marginal probabilities px and py all describe proportions in the complete population of (xy) pairs.
Sub-population
In contrast, it is sometimes meaningful to restrict attention to a subset of the (xy) pairs. For example, we may be interested only in pairs for which the first variable, X , has some particular value. Probabilities that relate to a sub-population are called conditional probabilities.

The concept of a conditional probability is similar to that of a conditional proportion that was described earlier for bivariate categorical data sets.

Conditional probabilities for Y, given X = x

Consider again hair colour (Y ) and eye colour (X ) in a population of teenagers. The probability of a teenager being blonde, conditional on blue eyes, is the proportion of blondes within the sub-population with blue eyes. The conditional probability is most easily understood as the ratio of the population numbers with (a) blue eyes and (b) both blonde hair and blue eyes.

p(blonde given blue)

However if the population is infinite, it is better to express it in terms of probabilities as the ratio of a joint and marginal probability (an equivalent definition for finite populations).

p(blonde given blue)

The general definition of the conditional probabilities for Y given that the value of X is x is

py given x

Conditional probabilities as a rescaling of joint probabilities

The conditional probabilities for Y, given X  = x , can therefore be found by rescaling of that row of the table of joint probabilities (dividing by px) so that the row sums to 1.0, as shown in the diagram below.

conditional probs

Two sets of conditional probabilities

Note that there is an equivalent formula for conditional probabilities for X given the value of Y that corresponds to using the other variable to define the sub-population. When we restrict attention to population values for which Y  has the value y , the conditional probabilities for X are

px given y

You should be careful to distinguish between px given y and py given x.

The probability of being pregnant, given that a randomly selected person is female would be fairly small. The probability of being female, given that a person is pregnant is 1.0 !!

Support and grief state after neonatal death

The diagram below again shows the joint probabilities in a 3-dimensional barchart.

Click the formula for the conditional probabilities of 'Y' (grief state) given 'X' (the level of support). The bars for each type of work are separately scaled up to add to 1.0. Observe that

Click the formula for joint probabilities, then the formula for conditional probabilities of 'X' given 'Y'. This time the joint probabilities are separately scaled for mothers in different grief states. These conditional probabilities are less useful for understanding this example.

13.1.4   Graphical display of probabilities

Proportional Venn diagrams

Marginal and conditional probabilities are meaningful and useful summaries of the relationship between two categorical variables. Proportional Venn diagrams were used earlier to graphically display marginal and conditional proportions for bivariate categorical data sets. They can also be used in the same way to display marginal and conditional probabilities for a bivariate categorical model.

The proportional Venn diagram is drawn in a unit square (with both sides of length 1.0).

display of conditional and marginal probs

Area = joint probability

The definition of the conditional probability py given x is

py given x

and the relationship can be rewritten in the form

pxy = pyGivenX x px

Since this is the product of the height and width of the rectangle representing categories x and y ,

The area of any rectangle in the diagram equals the joint probability of the categories it represents.

A similar diagram can be based on the marginal probabilities of Y and the conditional probabilites of X given Y, splitting the unit square first horizontally and then vertically. The areas of the resulting rectangles are again equal to the joint probabilities, so the two diagrams are just rearrangements of the same areas (the joint probabilities, pxy).

The use of the diagrams is best explained in an example.

Apple bruising

Before showing the relationship between joint, conditional and marginal probabilities, we illustrate the formulae for joint, conditional and marginal proportions.

The contingency table below describes bruising of 96 apples in a packing plant. The apples were classified by the variety of apple (Granny Smith or Fuji) and whether or not they were bruised. (The data are not real.)

  Bruised Not bruised
Granny Smith 40 8
Fuji 24 24

The diagram below shows a Proportional Venn diagram for the data. Note that the four areas are proportional to the numbers of apples for each combination of apple type and bruising.

World population by age and region

The table below shows the world population in 2002, categorised by region and by age group.

World population (millions)
  Age
  0-19 20-64 65+
Africa and Near East 0,526.6 0,455.6 034.9
Asia 1,340.0 1,964.2 216.7
America, Europe and Oceanea 0,522.6 0,981.8 188.7

Consider randomly selecting one person in the world. The joint probabilities for this person being in each age/region are obtained by dividing the above values by the total world population.

Joint probabilities
  Age
  0-19 20-64 65+
Africa and Near East 0.085 0.073 0.006
Asia 0.215 0.315 0.035
America, Europe and Oceanea 0.084 0.158 0.030

Marginal and conditional probabilities can be obtained using formulae from the previous pages. The proportional Venn diagram below displays them graphically.

The diagram initially splits the unit square horizontally using the marginal probabilities of Y — the probabilities of a random person being from each of the three regions. Each row is split according to the conditional probabilites for age group within that region. From the diagram, we can easily see that:

Click on any rectangle in the diagram to observe how its area equals the product of a marginal and conditional probability and therefore is the joint probability for the corresponding categories.

Click the rightmost formula under the diagram. The rectangles change in shape but retain the same areas to rearrange into vertical columns corresponding to the marginal probabilities for age group. Each column is split in proportion to the conditional probabilities of region given age group. From this version of the diagram, observe that

13.1.5   Calculations with probabilities

Marginal and conditional probs can be found from joint probs (and vice versa)

We have used three types of probability to describe a model for two categorical variables — the joint probabilities, the marginal probabilities for the two variables and the conditional probabilities for each variable given the value of the other variable. These sets of probabilities are closely related. Indeed, the model can be equivalently described by any of the following.

The diagram below shows how to find each set of probabilities from the others, using the formulae described in the earlier pages of this section.

relns between joint,condit,marginal probs

Bayes theorem

In particular, note that it is possible to obtain the conditional probabilities for X given Y, px given y, from the marginal probabilities of X, px, and the conditional probabilities for Y given X, py given x. This can be expressed in a single formula that is called Bayes Theorem, but it is easier in practice to do the calculations in two steps, obtaining the joint probabilities, pxy, in the first step. There are several important applications of Bayes Theorem.

Fraudulent tax claims

Tax inspectors investigate some of the tax returns that are submitted by individuals if they think that some claims for expenses are too high or are unjustified.

A investigation of the tax return does not always conclude that the claims were fraudulent — their suspicions are rarely 100% accurate. There are two types of error:

There are commonly non-zero probabilities for each of these types of error. Consider tax inspectors who have probability 0.1 of investigating a correct claim and 0.2 of not investigating a bad claim. These are conditional probabilities and can be written formally as:

p(neg given dis)=.05, p(pos given noDis)=0.1

Since the probability (proportion) of investigating a bad claim is one minus the conditional probability of investigating it (and a similar result for correct claims), the remaining conditional probabilities are

p(pos given dis)=.05, p(neg given noDis)=0.1

We will also assume that 10% of tax returns are bad claims. This corresponds to a marginal probability, P(bad claim) = 0.10.

The diagram below shows how these marginal probabilities for Y (claim type) and conditional probabilities for X (investigation) given Y can be used to obtain the conditional probabilities for Y (claim type) given X (investigation).

The initial information is shown in blue at the top of the diagram. The joint probabilities (green) are first found from them. Click on any value in the table of joint probabilities to see how it is related to the initial information.

Marginal probabilities for the test results are next obtained by adding the columns of joint probabilities. Click on any of the black marginal probabilities to see how they are obtained from the joint probabilities.

Finally the conditional probabilities for claim type (given whether the tax return has been investigated) are obtained from the joint probabilities and the marginal probabilities for the claim types. Click on the conditional probabilities on the bottom right of the diagram to see the formula.


Initially there might seem to be a contradiction between the two conditional probabilities,

p(pos given dis)=.95

p(dis given pos)=.514

However the two probabilities are consistent since they have very different interpretations. The proportional Venn diagrams below help to explain the difference. The diagram on the left shows the marginal and conditional probabilities given in the question. The corresponding diagram on the right shows the marginal probabilities for the whether claims are investigated and the conditional probabilities for good/bad claims.

Remember that the areas of the rectangles equal the joint probabilities and are therefore the same in both diagrams.

Drag the slider to alter the proportion of people who make bad tax claims in the population. (We assume that the conditional probabilities of investigating the claims remain the same.) Observe that:

13.2   Independence

  1. Association
  2. Independence
  3. Independence from samples
  4. Testing for independence
  1. Chi-squared test statistic
  2. P-value for chi-squared test
  3. Examples
  4. Comparing groups

13.2.1   Association

Relationships between numerical variables

When two or more measurements are made from each individual in a population, we are usually interested in whether these variables are related to each other. When both variables are numerical, the strength of the relationship can be described with a correlation coefficient and regression models allow us to test whether two variables are related on the basis of sample data.

Relationships between categorical variables

Two categorical measurements may also be related.

As with numerical variables, we may be able to conclude that any relationship between categorical variables is causal if it results from an experiment (e.g. a randomised experiment in which some pea seeds are coated and others are uncoated). From observational data however, we usually cannot deduce a causal relationship — all we can say is that the variables may be associated.

What does association mean?

We say that two variables are associated if knowledge of the value of one tells you something about the likely value of the other.

If the conditional distribution of Y given X = x depends on the value of x, we say that X and Y are associated.

For example, if the conditional distribution of the Job satisfaction of new employees given Job type = secretary is different from the conditional distribution of Job satisfaction given Job type = manager, then we say that Job satisfaction and Job type are associated.

In the next page, we will characterise two variables that are not associated, but first we give an example of variables that are related.

Absenteeism and weight

To illustrate the idea of association, we use a table of joint probabilities that constitute a possible model for absenteeism of employees in a supermarket chain and their weight.

Note that the joint probabilities in this model do not accurately represent the effect of weight on absenteeism — they are only used to illustrate the concepts.

Joint Probabilities
Attendance record
Poor Satisfactory Above average Marginal
Underweight 0.0450 0.0900 0.0150 0.1500
Normal 0.0825 0.3025 0.1650 0.5500
Overweight 0.0500 0.1200 0.0300 0.2000
Obese 0.0300 0.0650 0.0050 0.1000
Marginal 0.1700 0.5400 0.2900 1.0000

The implications of this model are best explained from conditional probabilities for athletic performance, given weight:

Conditional Probabilities
Attendance record
Poor Satisfactory Above average Total
Underweight 0.30 0.60 0.10 1.0
Normal 0.15 0.55 0.30 1.0
Overweight 0.25 0.60 0.15 1.0
Obese 0.30 0.65 0.05 1.0

A proportional Venn diagram displays these conditional probabilities graphically.

If this model is correct, the conditional probability of poor attendance is lowest for staff with 'normal' weight, increasing as weight gets further from 'normal'. Similarly, the probability of above average attendance is highest for those with 'normal' weight.

13.2.2   Independence

Independence

If the conditional probabilities for Y are the same for all values of X, then Y is said to be independent of X.

If X and Y are independent, knowing the value of X does not give us any information about the likely value for Y.

Independence implies that the sub-populations corresponding to different values of X all contain values of Y in the same proportions.

Work performance and weight

As an example of independence, we continue with the (artificial) example on the previous page. We now show the relationship between weight and work performance (as assessed by a supervisor). In this model, weight and performance are independent — knowing someone's weight gives no clues as to that person's ability to do their job.

Joint Probabilities
Work performance
    Poor     Satisfactory Above average Marginal
Underweight 0.0225 0.1125 0.0150 0.1500
Normal 0.0825 0.4125 0.0550 0.5500
Overweight 0.0300 0.1500 0.0200 0.2000
Obese 0.0150 0.0750 0.0100 0.1000
Marginal 0.1500 0.7500 0.1000 1.0000

For this model, the conditional probabilities for work performance, given weight, are:

Conditional Probabilities
Work performance
    Poor     Satisfactory Above average Total
Underweight 0.15 0.75 0.10 1.0
Normal 0.15 0.75 0.10 1.0
Overweight 0.15 0.75 0.10 1.0
Obese 0.15 0.75 0.10 1.0

The conditional probabilities are the same for each weight, so knowing that a student is, say, obese does not affect the probability of being rated as an above-average worker. The proportional Venn diagram has the form shown below.

Note that the Proportional Venn Diagram now consists of a grid of horizontal and vertical lines.

Mathematical definition of independence

If Y is independent of X, then:

Also, if Y is independent of X, then X is also independent of Y.

Since the conditional and marginal probabilities are equal if Y and X are independent, an equivalent definition of independence is:

X and Y are independent if     


3-dimensional illustration of independence

The diagram below shows the joint probabilities in the model of independence above.

Click the formula for the conditional probability of Y given X. (This separately scales the bars for each X to have the same total, 1.0.) Observe that the distribution of performance is the same in each weight group.

Click the formula for the joint probabilities, then the formula for the conditional probabilities of X given Y. Observe that the distribution of weights is the same in each performance group.

13.2.3   Independence from samples

Assessing independence, based on a sample

Independence is an important special case of models for bivariate data. However it is a property of the joint population probabilities and in most practical situations these are unknown.

We must assess independence from a sample of individuals — a contingency table.


Recruiting source and success

A sample of 1,400 store clerks hired during 1979 by a large US retailing chain was selected by researchers who wanted to determine whether the recruiting source for employees is related to whether they perform satisfactorily in their job (determined from supervisor evaluations). Four recruiting sources were defined.

Sample Data
  Unsatisfactory Satisfactory Total
Employee referral 167 85 252
In-store notice 383 261 644
Employment agency 33 17 50
Media announcement 250 204 454
Total 833 567 1400

Independence would be an important characteristic of employment since it would imply that employees recruited from all sources have the same probability of satisfactory performance.

Are those sample data consistent with a model of independence?


Marginal distributions and independence

The marginal counts in a contingency table describe the univariate distributions of the two variables on their own, but do not tell you anything about their relationship. For example, the two contingency tables below have the same margins.

Strong relationship
  C1 C2 C3 Total
R1 30 0 0 30
R2 0 40 0 40
R3 0 0 30 30
Total 30 40 30 100
 
No relationship
  C1 C2 C3 Total
R1 9 12 9 30
R2 12 16 12 40
R3 9 12 9 30
Total 30 40 30 100

However the table on the left supports an extremely strong relationship — if the row category is known, we can accurately predict the column category. On the other hand, there is no evidence of association in the table on the right — each row of the table contains the column categories in the same proportions.

Estimated cell counts under independence

In practice, the pattern of counts in a contingency table is rarely so easily interpreted. A first step is to determine the pattern that is most consistent with independence of the rows and columns, based on the observed margins.

  C1 C2 C3 Total
R1 ? ? ? 30
R2 ? ? ? 40
R3 ? ? ? 30
Total 30 40 30 100

If the rows and columns are independent, the conditional probabilities are the same for each row, so we distribute each marginal row total between the column categories in the same proportions — determined by the marginal proportions for the column categories.

This pattern is gives the estimated cell counts and the following formula can be used to evaluate them.

exy = nx*ny/n

where n denotes the total for the whole table and nx and ny denote the marginal totals for row x and column y.

Recruiting source and success

We now find the pattern of estimated cell counts for the recruitment data that is most consistent with independence of recruiting source and success, based only on the margins of the observed contingency table.

Sample Data
  Unsatisfactory Satisfactory Total
Employee referral ? ? 252
In-store notice ? ? 644
Employment agency ? ? 50
Media announcement ? ? 454
Total 833 567 1400

If success is indeed independent of recruitment, then we estimate that the proportion of the 252 recruited from 'Employee referral' who are successful would be the same as the marginal proportion who are successful. Since 833 out of the total 1400 in the study are successful, we therefore expect that the number recruited from 'Employee referral' who are successful would be

e(begin,injured) = n(begin)*n(injured)/n

This is an example of the general formula that was presented earlier,

exy = nx*ny/n

The complete table of estimated cell counts is:

If recruitment and success are indeed independent, then the observed cell counts in the sample data should be similar to these estimated cell counts.

Observed and estimated cell counts
  Unsatisfactory Satisfactory Total
Employee referral 167
(149.9)
85
(102.1)
252
In-store notice 383
(383.2)
261
(260.8)
644
Employment agency 33
(29.8)
17
(20.2)
50
Media announcement 250
(270.1)
204
(183.9)
454
Total 833 567 1400

Comparison of observed and estimated cell counts

The hypothesis of independence is assessed by asking whether the observed and estimated cell counts are 'sufficiently close' — are the observed counts consistent with the counts estimated under independence? We address this formally in the following pages.

13.2.4   Testing for independence

Comparison of observed and estimated cell counts

The hypothesis of independence is assessed by asking whether the observed and estimated cell counts are 'sufficiently close' — are the observed counts consistent with the counts estimated under independence?

Recruiting source and success

If the recruitment source and work performance are indeed independent, then the observed cell counts in the sample data should be similar to these estimated cell counts.

Observed and estimated cell counts
  Unsatisfactory Satisfactory Total
Employee referral 167
(149.9)
85
(102.1)
252
In-store notice 383
(383.2)
261
(260.8)
644
Employment agency 33
(29.8)
17
(20.2)
50
Media announcement 250
(270.1)
204
(183.9)
454
Total 833 567 1400

Hypotheses

Did a sample contingency table come from a population in which the categorical row and column variables, X and Y are independent? This question can be formalised as the hypothesis test,

H0:independent, HA:not independent

Possible test statistic?

In order to assess whether the data are consistent with the null hypothesis, we ask whether the observed cell counts in the contingency table, nxy, are similar to the estimated cell counts based on independence, exy. The simplest measure of their match is the sum of squares of the differences,

sum(nxy-exy)2

Small values of this statistic are expected when there is independence in the underlying population. However it does not behave entirely as desired. To be useful, a test statistic must have a known distribution when H0 is true and, ideally, this distribution should not depend too much on specific characteristics of the problem.

The raw sum of squares has a distribution that depends on the sample size and on the marginal probabilities.

It would be very unusual for a cell in a contingency table with estimated cell count exy< = 1 to have observed cell count exy = 11. However if the estimated cell count is exy = 1001 then sampling variability would mean that an observed cell count of exy = 1011 would not be unusual. Yet the difference is the same in both cases.

The raw sum of squares must be interpreted differently, depending on the size of the estimated cell counts, so it is a bad test statistic.


Distribution of sum of squares

The blue values in the contingency table on the left below have been sampled from a population in which each of the row categories is equally likely (with marginal probability 1/3), each column category is equally likely (marginal probability 1/3) and the row and column categories are independent. All joint probabilities are therefore know to be 1/9.

Click Sample a few times to observe the variability of the blue observed counts, nxy.

The estimated counts, exy, obtained from the margins of the table, are also shown in red. Observe the variability in the differences and their sum of squares.

Increase the sample size from 100 to 1000 and repeat. Observe that the differences are usually higher. Increase the sample size to 10000 and observe that the statistic is usually higher still.

The distribution of the sum of squares depends on the sample size, so it is not an easily interpreted measure of independence.


13.2.5   Chi-squared test statistic

A better test statistic

The raw sum of squares on the previous page is a poor way to assess whether a contingency table has been sampled from a population with independence. A better statistic is χ2 (pronounced chi-squared), defined by

sum(nxy-exy)2/exy

This more fairly assesses differences between nxy and exy when the exy vary in magnitude. Its distribution still depends on the number of rows and columns in the contingency table, but is no longer affected by either the number of individuals (the total count for the table) or the margins of the table.

Only the number of rows and number of columns in the table have much influence on the distribution of χ2.


Simulation

The diagram below again samples from two independent categorical variables.

Click Accumulate then take several samples to build up the distribution of the χ2 statistic.

Now increase the sample size and repeat. Observe that χ2 is approximately the same magnitude (usually between 0.5 and 15.0) regardless of the sample size.

Finally, use the pop-up menu labelled Model to change the model to one where the marginal probabilities for the two categorical variables are unequal (but there is still independence). Observe that the distribution of χ2 remains approximately the same.

Distribution of chi-squared statistic

When there is independence, the χ2 statistic has approximately a standard distribution called a chi-squared distribution whose shape only depends on the number of rows and columns in the table but not the sample size or the underlying joint probabilities.

If a contingency table with r rows and c columns is sampled from a population with independence, χ2 has a chi-squared distribution with (r - 1)(c - 1) degrees of freedom.

The chi-squared distribution is skew, and

The mean of the chi-squared distribution equals its degrees of freedom.


Shape of the chi-squared distribution

The diagram below shows the probability density function for the chi-squared distribution.

Use the pop-up menus to change the number of rows and columns in the table. Observe that:

13.2.6   P-value for chi-squared test

Testing for independence

We now formally describe a hypothesis test for whether two categorical variables are independent.

H0:independent, HA:not independent

We have seen that the χ2 statistic

sum(nxy-exy)2/exy

describes whether the observed counts in a contingency table, nxy, are close to those expected for independent variables.

X and Y are independent
χ2 has (approximately) a chi-squared distribution with no unknown parameters
X and Y are associated
The pattern of observed counts, nxy, is expected to be different from that of the exy, so χ2 is expected to be larger.

P-value

In a similar way to other hypothesis tests, we evaluate a p-value — the probability of getting such an extreme χ2 when the two variables are independent (H0).

sum(nxy-exy)2/exy

If the p-value is close to zero, we conclude that the observed table would be unlikely for independent variables, so there is evidence that the variables are associated.

p-value Interpretation
over 0.1 no evidence against the null hypothesis (independence)
between 0.05 and 0.1    very weak evidence of dependence between the row and column variables
between 0.01 and 0.05    moderately strong evidence of dependence between the row and column variables
under 0.01 strong evidence of dependence between the row and column variables

Warning about low estimated cell counts

The p-value for the test can be found because the χ2 test statistic has approximately a chi-squared distribution. This approximation is close for most data sets that are encountered, but is less so when the sample size, n, is small. The guidelines that are often given suggest that the p-value can be relied on if:

If the cell counts are small enought that these conditions do not hold, the p-value is less reliable. (But advanced statistical methods are required to do better!)

Simulation: Independent variables

The diagram below shows a random sample from a model in which the row and column variables are independent. It also illustrates how the p-value is evaluated.

Click Take sample a few times to generate other samples from the model. Observe that the p-value is usually quite large (since H0 is true), but


An 'unlucky sample' might mislead you into erroneously concluding that the variables are dependent.

13.2.7   Examples

Analysing dependence

If it is concluded that dependence is likely in the table, you should examine carefully the cells of the table where there are the biggest mismatches between the observed and estimated cell counts. This should help you to discover the nature of the dependence.

Examples

The chi-squared test for association is applied to a few real contingency tables below.

In some examples, the value of χ2 is so far into the upper tail of the reference distribution that we are almost certain that the row and column variable are dependent. In others, the value of χ2 is small enough that it could have arisen by chance even if the variables are independent in the underlying population.

13.2.8   Comparing groups

Contingency tables from univariate data in several groups

Contingency tables often arise from bivariate categorical data. However they can also arise from univariate categorical data that is recorded separately from several groups.

'Group membership' can be treated as a second categorical variable.

Effect of false claims in adverts

In a study to assess how false or misleading adverts affect consumers, one group of 100 experimental subjects was exposed to a series of adverts falsely claiming that a new brand of coffee contained 'no bitterness'. These subjects and another control group of 100 people who had not seen the adverts were given a sample of coffee that had been prepared to be intentionally bitter. The contingency table below shows whether the subjects reported the coffee as 'having bitterness'.

  False advert No advert Total
Coffee described
as bitter
68 89 157
Coffee described as
not bitter
32 11 43
Total 100 100 200

In this example, two different groups of people were used in the experiment. The column variable (distinguishing between the two types of advert) is not a random variable, as it is controlled by the experiment. A single categorical measurement (whether or not the coffee sample was bitter) was made from each person.

Comparing groups

Although the chi-squared test was motivated as a test of independence of two categorical variables, the same test can be used when each row (or column) of a contingency table corresponds to a separate group of individuals.

Null hypothesis (corresponding to independence)
The category probabilities are the same within each group.
Alternative hypothesis (corresponding to association)
The different groups have different probabilities.

The χ2 test statistic and p-value are identical to those given earlier for testing independence.

Examples

In the following examples, we test whether the 'response' proportions are the same in several groups.

Note again that a visual comparison of the observed counts and those estimated from the margins assuming independence helps to explain the nature of the relationship in examples where we conclude that there is some difference between the groups.

Two groups and two categories

In the special case where there are two groups and the categorical measurement has two categories (that we will call 'success' and 'failure'), the chi-squared test is testing whether the probability of success is the same in both groups. For example, in the Bitter Coffee data set, we are testing whether the probability of reporting that the coffee was bitter is the same for the groups seeing the false adverts and those who did not.

This hypothesis can also be tested with a 2-sample test of equality of two proportions.

Fortunately, although the two tests have been motivated in a different way, it can be proved that:

The 2-sample test for equality of two proportions and the chi-squared test both result in the same p-value and conclusion.