If you don't want to print now,
Data sets with two categorical variables
In this section, we explain how to model data sets that consist of pairs of categorical values — bivariate categorical data.
Examples where pairs of categorical variables are measured from 'individuals' are:
| 'Individuals' | Variable X | Variable Y |
|---|---|---|
| Employees in a large company | Sex (M or F) | Education (none, high school or tertiary) |
| Customers leaving supermarket | Checkout operator type (full- or part-time) | Rating of quality of service (poor, OK or good) |
| TVs leaving a production line | Assembler (A, B, C or D) | Status (defective or OK) |
In each case, the data that are collected are pairs of categorical values that are measurements of the two variables. Data of this form are usually summarised in a contingency table.
Requests for promotional material by travelers
Travel agents provide 'destination-specific travel literature' about activities, facilities and prices to tourists free of charge on request. A study was made to investigate the differences between information seekers (who requested such literature) and nonseekers, with the aim of better targeting such material.
A sample of 686 tourists was selected and each was classified as an information seeker or non-seeker and in various other ways including educational level.
| Tourist | Educational level | Information seeker? |
|---|---|---|
| 1 | High school degree | Yes |
| 2 | College degree | Yes |
| 3 | Some high school | No |
| ... | ... | ... |
The data from the 686 tourists are summarised in the contingency table below.
| Information seeker? | |||
|---|---|---|---|
| Education | Yes | No | Total |
| Some high school | 13 | 27 | 40 |
| High school degree | 64 | 118 | 182 |
| Some college | 100 | 123 | 223 |
| College degree | 59 | 69 | 128 |
| Graduate degree | 67 | 46 | 113 |
| Total | 303 | 383 | 686 |
Joint probabilities
To model bivariate categorical data, we assume an underlying population of pairs of categorical values. The data are treated as a random sample of pairs of values from this population. A real finite population occasionally underlies the data, but we must usually hypothesise an infinite underlying population.
The proportion of times that the pair of categorical values (x, y) occurs in the population — its probability — is denoted by pxy. The probabilities pxy are called the joint probabilities for the two variables.
Gambling simulation
A gambler draws a card from a shuffled deck and also tosses a coin, resulting in a pair of categorical values,
| Variable | Possible values |
|---|---|
| Coin side, X | Head or Tail |
| Card suit, Y | Heart, Club, Diamond or Spade |
Each of the eight possible combinations of coin side and card suit is equally likely and would occur in the same proportions in the underlying population.
The probabilities for all pairs are therefore the same,
These joint probabilities are shown in blue in the table below.
The two lower tables (in black) describe 100 pairs of values sampled from this population — 100 coin-card pairs. The first of these tables shows the sample as a contingency table of counts; the other table displays the counts as proportions.
Click Take sample a few times to see the variability in samples of 100 coin-card pairs. Increase the sample size to 500 and repeat. Observe that the sample proportions are less variable when the sample size is large.
Interest in the model
We are usually interested in the joint probabilities in the underlying model, rather than the corresponding proportions from the sample data that have been collected (the contingency table). However the joint probabilities are unknown parameters in most practical situations and must be estimated from the sample data.
Requests for promotional material by travelers
In situations of practical importance, the underlying probabilities are unknown. In the data set that was collected to examine which travelers request promotional material, the 686 travelers in the study were not the focus of attention — the researcher wanted to generalise to all other similar travelers.
The population proportions are unknown, but the sample proportions provide estimates of them.
Probabilities for a single variable
A model for two categorical variables is characterised by the joint probabilities pxy. However we sometimes want to restrict attention to one of these variables on its own. The marginal probabilities for the variable X are defined and interpreted in a similar way to the marginal proportions that were defined earlier for bivariate categorical data.
The marginal probability, px, for the variable X is the proportion of (x, y) pairs in the population for which the value of X is x . For example, consider a situation where we are interested in hair colour and eye color of teenagers. The number of blue-eyed teenagers is the sum of those with either (blue eyes and blonde hair) or (blue eyes and brunette hair) or ... or (blue eyes and red hair),
The same holds for the proportion with blue eyes — its marginal probability,
This is generalised with the formula
where the right of the equation denotes summing the joint probabilities over all possible values of y. There is a similar formula for the marginal probabilities of the other variable,
Eye strain for office workers
It is difficult to find illustrative examples since population probabilities are unknown in most 'interesting' applications. The following example is based on a real data set which classifies 295 office workers by their type of work and whether they have symptoms of eye strain.
| Type of work | No eye strain | Eye strain |
|---|---|---|
| VDU data entry | 42 | 11 | General VDU use | 79 | 30 | Full-time typing | 64 | 14 | Standard clerical work | 52 | 3 |
We do not know the underlying population joint probabilities for workers in this type of office in general. However, to provide an illustrative example, we will pretend that the population probabilities are equal to the proportions in this data set. For example, we will pretend that the joint probability for a worker doing VDU data entry and not having eye strain is 42 / 295 = 0.1424.
| Type of work | No eye strain | Eye strain | Total |
|---|---|---|---|
| VDU data entry | 0.1424 | 0.0373 | 0.1797 |
| General VDU use | 0.2678 | 0.1017 | 0.3695 |
| Full-time typing | 0.2169 | 0.0475 | 0.2644 |
| Standard clerical work | 0.1763 | 0.0102 | 0.1864 |
| Total | 0.8034 | 0.1966 | 1.0000 |
The two marginal totals (red and orange) of the table give the marginal probabilities for the two variables. For example,
The diagram below illustrates the summing of joint probabilities to give marginal ones with a 3-dimensional barchart of the joint probabilities.
Click the formula for the marginal probabilities of 'X' (the type of work) on the right. The bars stack to show the marginal probabilities for type of work.
Similarly, clicking the formula for the marginal probabilities of 'Y' stacks the bars to show the overall probability that a worker has eye strain.
Probabilities in a sub-population
The concept of a conditional probability is similar to that of a conditional proportion that was described earlier for bivariate categorical data sets.
Conditional probabilities for Y, given X = x
Consider again hair colour (Y ) and eye colour (X ) in a population of teenagers. The probability of a teenager being blonde, conditional on blue eyes, is the proportion of blondes within the sub-population with blue eyes. The conditional probability is most easily understood as the ratio of the population numbers with (a) blue eyes and (b) both blonde hair and blue eyes.
However if the population is infinite, it is better to express it in terms of probabilities as the ratio of a joint and marginal probability (an equivalent definition for finite populations).
The general definition of the conditional probabilities for Y given that the value of X is x is
Conditional probabilities as a rescaling of joint probabilities
The conditional probabilities for Y, given X = x , can therefore be found by rescaling of that row of the table of joint probabilities (dividing by px) so that the row sums to 1.0, as shown in the diagram below.
Two sets of conditional probabilities
Note that there is an equivalent formula for conditional probabilities for X given the value of Y that corresponds to using the other variable to define the sub-population. When we restrict attention to population values for which Y has the value y , the conditional probabilities for X are
You should be careful to distinguish between
and
.
| The probability of being pregnant, given that a randomly selected person is female would be fairly small. The probability of being female, given that a person is pregnant is 1.0 !! |
Support and grief state after neonatal death
The diagram below again shows the joint probabilities in a 3-dimensional barchart.
Click the formula for the conditional probabilities of 'Y' (grief state) given 'X' (the level of support). The bars for each type of work are separately scaled up to add to 1.0. Observe that
Click the formula for joint probabilities, then the formula for conditional probabilities of 'X' given 'Y'. This time the joint probabilities are separately scaled for mothers in different grief states. These conditional probabilities are less useful for understanding this example.
Proportional Venn diagrams
Marginal and conditional probabilities are meaningful and useful summaries of the relationship between two categorical variables. Proportional Venn diagrams were used earlier to graphically display marginal and conditional proportions for bivariate categorical data sets. They can also be used in the same way to display marginal and conditional probabilities for a bivariate categorical model.
The proportional Venn diagram is drawn in a unit square (with both sides of length 1.0).
Area = joint probability
The definition of the conditional probability
is
and the relationship can be rewritten in the form
Since this is the product of the height and width of the rectangle representing categories x and y ,
The area of any rectangle in the diagram equals the joint probability of the categories it represents.
A similar diagram can be based on the marginal probabilities of Y and the conditional probabilites of X given Y, splitting the unit square first horizontally and then vertically. The areas of the resulting rectangles are again equal to the joint probabilities, so the two diagrams are just rearrangements of the same areas (the joint probabilities, pxy).
The use of the diagrams is best explained in an example.
Apple bruising
Before showing the relationship between joint, conditional and marginal probabilities, we illustrate the formulae for joint, conditional and marginal proportions.
The contingency table below describes bruising of 96 apples in a packing plant. The apples were classified by the variety of apple (Granny Smith or Fuji) and whether or not they were bruised. (The data are not real.)
| Bruised | Not bruised | |
|---|---|---|
| Granny Smith | 40 | 8 |
| Fuji | 24 | 24 |
The diagram below shows a Proportional Venn diagram for the data. Note that the four areas are proportional to the numbers of apples for each combination of apple type and bruising.
World population by age and region
The table below shows the world population in 2002, categorised by region and by age group.
| Age | |||
|---|---|---|---|
| 0-19 | 20-64 | 65+ | |
| Africa and Near East | 0,526.6 | 0,455.6 | 034.9 |
| Asia | 1,340.0 | 1,964.2 | 216.7 |
| America, Europe and Oceanea | 0,522.6 | 0,981.8 | 188.7 |
Consider randomly selecting one person in the world. The joint probabilities for this person being in each age/region are obtained by dividing the above values by the total world population.
| Age | |||
|---|---|---|---|
| 0-19 | 20-64 | 65+ | |
| Africa and Near East | 0.085 | 0.073 | 0.006 |
| Asia | 0.215 | 0.315 | 0.035 |
| America, Europe and Oceanea | 0.084 | 0.158 | 0.030 |
Marginal and conditional probabilities can be obtained using formulae from the previous pages. The proportional Venn diagram below displays them graphically.
The diagram initially splits the unit square horizontally using the marginal probabilities of Y — the probabilities of a random person being from each of the three regions. Each row is split according to the conditional probabilites for age group within that region. From the diagram, we can easily see that:
Click on any rectangle in the diagram to observe how its area equals the product of a marginal and conditional probability and therefore is the joint probability for the corresponding categories.
Click the rightmost formula under the diagram. The rectangles change in shape but retain the same areas to rearrange into vertical columns corresponding to the marginal probabilities for age group. Each column is split in proportion to the conditional probabilities of region given age group. From this version of the diagram, observe that
Marginal and conditional probs can be found from joint probs (and vice versa)
We have used three types of probability to describe a model for two categorical variables — the joint probabilities, the marginal probabilities for the two variables and the conditional probabilities for each variable given the value of the other variable. These sets of probabilities are closely related. Indeed, the model can be equivalently described by any of the following.
The diagram below shows how to find each set of probabilities from the others, using the formulae described in the earlier pages of this section.
Bayes theorem
In particular, note that it is possible to obtain the conditional probabilities
for X given Y,
,
from the marginal probabilities of X, px,
and the conditional probabilities for Y given
X,
.
This can be expressed in a single formula that is called Bayes Theorem,
but it is easier in practice to do the calculations in two steps, obtaining the
joint probabilities, pxy,
in the first step. There are several important applications of Bayes Theorem.
Fraudulent tax claims
Tax inspectors investigate some of the tax returns that are submitted by individuals if they think that some claims for expenses are too high or are unjustified.
A investigation of the tax return does not always conclude that the claims were fraudulent — their suspicions are rarely 100% accurate. There are two types of error:
There are commonly non-zero probabilities for each of these types of error. Consider tax inspectors who have probability 0.1 of investigating a correct claim and 0.2 of not investigating a bad claim. These are conditional probabilities and can be written formally as:
Since the probability (proportion) of investigating a bad claim is one minus the conditional probability of investigating it (and a similar result for correct claims), the remaining conditional probabilities are
We will also assume that 10% of tax returns are bad claims. This corresponds to a marginal probability, P(bad claim) = 0.10.
The diagram below shows how these marginal probabilities for Y (claim type) and conditional probabilities for X (investigation) given Y can be used to obtain the conditional probabilities for Y (claim type) given X (investigation).
The initial information is shown in blue at the top of the diagram. The joint probabilities (green) are first found from them. Click on any value in the table of joint probabilities to see how it is related to the initial information.
Marginal probabilities for the test results are next obtained by adding the columns of joint probabilities. Click on any of the black marginal probabilities to see how they are obtained from the joint probabilities.
Finally the conditional probabilities for claim type (given whether the tax return has been investigated) are obtained from the joint probabilities and the marginal probabilities for the claim types. Click on the conditional probabilities on the bottom right of the diagram to see the formula.
Initially there might seem to be a contradiction between the two conditional probabilities,
However the two probabilities are consistent since they have very different interpretations. The proportional Venn diagrams below help to explain the difference. The diagram on the left shows the marginal and conditional probabilities given in the question. The corresponding diagram on the right shows the marginal probabilities for the whether claims are investigated and the conditional probabilities for good/bad claims.
Remember that the areas of the rectangles equal the joint probabilities and are therefore the same in both diagrams.
Drag the slider to alter the proportion of people who make bad tax claims in the population. (We assume that the conditional probabilities of investigating the claims remain the same.) Observe that:
Relationships between numerical variables
When two or more measurements are made from each individual in a population, we are usually interested in whether these variables are related to each other. When both variables are numerical, the strength of the relationship can be described with a correlation coefficient and regression models allow us to test whether two variables are related on the basis of sample data.
Relationships between categorical variables
Two categorical measurements may also be related.
As with numerical variables, we may be able to conclude that any relationship between categorical variables is causal if it results from an experiment (e.g. a randomised experiment in which some pea seeds are coated and others are uncoated). From observational data however, we usually cannot deduce a causal relationship — all we can say is that the variables may be associated.
What does association mean?
We say that two variables are associated if knowledge of the value of one tells you something about the likely value of the other.
If the conditional distribution of Y given X = x depends on the value of x, we say that X and Y are associated.
For example, if the conditional distribution of the Job satisfaction of new employees given Job type = secretary is different from the conditional distribution of Job satisfaction given Job type = manager, then we say that Job satisfaction and Job type are associated.
In the next page, we will characterise two variables that are not associated, but first we give an example of variables that are related.
Absenteeism and weight
To illustrate the idea of association, we use a table of joint probabilities that constitute a possible model for absenteeism of employees in a supermarket chain and their weight.
Note that the joint probabilities in this model do not accurately represent the effect of weight on absenteeism — they are only used to illustrate the concepts.
| Attendance record | ||||
|---|---|---|---|---|
| Poor | Satisfactory | Above average | Marginal | |
| Underweight | 0.0450 | 0.0900 | 0.0150 | 0.1500 |
| Normal | 0.0825 | 0.3025 | 0.1650 | 0.5500 |
| Overweight | 0.0500 | 0.1200 | 0.0300 | 0.2000 |
| Obese | 0.0300 | 0.0650 | 0.0050 | 0.1000 |
| Marginal | 0.1700 | 0.5400 | 0.2900 | 1.0000 |
The implications of this model are best explained from conditional probabilities for athletic performance, given weight:
| Attendance record | ||||
|---|---|---|---|---|
| Poor | Satisfactory | Above average | Total | |
| Underweight | 0.30 | 0.60 | 0.10 | 1.0 |
| Normal | 0.15 | 0.55 | 0.30 | 1.0 |
| Overweight | 0.25 | 0.60 | 0.15 | 1.0 |
| Obese | 0.30 | 0.65 | 0.05 | 1.0 |
A proportional Venn diagram displays these conditional probabilities graphically.
If this model is correct, the conditional probability of poor attendance is lowest for staff with 'normal' weight, increasing as weight gets further from 'normal'. Similarly, the probability of above average attendance is highest for those with 'normal' weight.
Independence
If the conditional probabilities for Y are the same for all values of X, then Y is said to be independent of X.
If X and Y are independent, knowing the value of X does not give us any information about the likely value for Y.
Independence implies that the sub-populations corresponding to different values of X all contain values of Y in the same proportions.
Work performance and weight
As an example of independence, we continue with the (artificial) example on the previous page. We now show the relationship between weight and work performance (as assessed by a supervisor). In this model, weight and performance are independent — knowing someone's weight gives no clues as to that person's ability to do their job.
| Work performance | ||||
|---|---|---|---|---|
| Poor | Satisfactory | Above average | Marginal | |
| Underweight | 0.0225 | 0.1125 | 0.0150 | 0.1500 |
| Normal | 0.0825 | 0.4125 | 0.0550 | 0.5500 |
| Overweight | 0.0300 | 0.1500 | 0.0200 | 0.2000 |
| Obese | 0.0150 | 0.0750 | 0.0100 | 0.1000 |
| Marginal | 0.1500 | 0.7500 | 0.1000 | 1.0000 |
For this model, the conditional probabilities for work performance, given weight, are:
| Work performance | ||||
|---|---|---|---|---|
| Poor | Satisfactory | Above average | Total | |
| Underweight | 0.15 | 0.75 | 0.10 | 1.0 |
| Normal | 0.15 | 0.75 | 0.10 | 1.0 |
| Overweight | 0.15 | 0.75 | 0.10 | 1.0 |
| Obese | 0.15 | 0.75 | 0.10 | 1.0 |
The conditional probabilities are the same for each weight, so knowing that a student is, say, obese does not affect the probability of being rated as an above-average worker. The proportional Venn diagram has the form shown below.
Note that the Proportional Venn Diagram now consists of a grid of horizontal and vertical lines.
Mathematical definition of independence
If Y is independent of X, then:
Also, if Y is independent of X, then X is also independent of Y.
Since the conditional and marginal probabilities are equal if Y and X are independent, an equivalent definition of independence is:
X and Y are independent if
![]()
3-dimensional illustration of independence
The diagram below shows the joint probabilities in the model of independence above.
Click the formula for the conditional probability of Y given X. (This separately scales the bars for each X to have the same total, 1.0.) Observe that the distribution of performance is the same in each weight group.
Click the formula for the joint probabilities, then the formula for the conditional probabilities of X given Y. Observe that the distribution of weights is the same in each performance group.
Assessing independence, based on a sample
Independence is an important special case of models for bivariate data. However it is a property of the joint population probabilities and in most practical situations these are unknown.
We must assess independence from a sample of individuals — a contingency table.
Recruiting source and success
A sample of 1,400 store clerks hired during 1979 by a large US retailing chain was selected by researchers who wanted to determine whether the recruiting source for employees is related to whether they perform satisfactorily in their job (determined from supervisor evaluations). Four recruiting sources were defined.
| Unsatisfactory | Satisfactory | Total | |
|---|---|---|---|
| Employee referral | 167 | 85 | 252 |
| In-store notice | 383 | 261 | 644 |
| Employment agency | 33 | 17 | 50 |
| Media announcement | 250 | 204 | 454 |
| Total | 833 | 567 | 1400 |
Independence would be an important characteristic of employment since it would imply that employees recruited from all sources have the same probability of satisfactory performance.
Are those sample data consistent with a model of independence?
Marginal distributions and independence
The marginal counts in a contingency table describe the univariate distributions of the two variables on their own, but do not tell you anything about their relationship. For example, the two contingency tables below have the same margins.
|
|
However the table on the left supports an extremely strong relationship — if the row category is known, we can accurately predict the column category. On the other hand, there is no evidence of association in the table on the right — each row of the table contains the column categories in the same proportions.
Estimated cell counts under independence
In practice, the pattern of counts in a contingency table is rarely so easily interpreted. A first step is to determine the pattern that is most consistent with independence of the rows and columns, based on the observed margins.
| C1 | C2 | C3 | Total | |
|---|---|---|---|---|
| R1 | ? | ? | ? | 30 |
| R2 | ? | ? | ? | 40 |
| R3 | ? | ? | ? | 30 |
| Total | 30 | 40 | 30 | 100 |
If the rows and columns are independent, the conditional probabilities are the same for each row, so we distribute each marginal row total between the column categories in the same proportions — determined by the marginal proportions for the column categories.
This pattern is gives the estimated cell counts and the following formula can be used to evaluate them.
![]()
where n denotes the total for the whole table and nx and ny denote the marginal totals for row x and column y.
Recruiting source and success
We now find the pattern of estimated cell counts for the recruitment data that is most consistent with independence of recruiting source and success, based only on the margins of the observed contingency table.
| Unsatisfactory | Satisfactory | Total | |
|---|---|---|---|
| Employee referral | ? | ? | 252 |
| In-store notice | ? | ? | 644 |
| Employment agency | ? | ? | 50 |
| Media announcement | ? | ? | 454 |
| Total | 833 | 567 | 1400 |
If success is indeed independent of recruitment, then we estimate that the proportion of the 252 recruited from 'Employee referral' who are successful would be the same as the marginal proportion who are successful. Since 833 out of the total 1400 in the study are successful, we therefore expect that the number recruited from 'Employee referral' who are successful would be
This is an example of the general formula that was presented earlier,
![]()
If recruitment and success are indeed independent, then the observed cell counts in the sample data should be similar to these estimated cell counts.
| Unsatisfactory | Satisfactory | Total | |
|---|---|---|---|
| Employee referral | 167 (149.9) |
85 (102.1) |
252 |
| In-store notice | 383 (383.2) |
261 (260.8) |
644 |
| Employment agency | 33 (29.8) |
17 (20.2) |
50 |
| Media announcement | 250 (270.1) |
204 (183.9) |
454 |
| Total | 833 | 567 | 1400 |
Comparison of observed and estimated cell counts
The hypothesis of independence is assessed by asking whether the observed and estimated cell counts are 'sufficiently close' — are the observed counts consistent with the counts estimated under independence? We address this formally in the following pages.
Comparison of observed and estimated cell counts
The hypothesis of independence is assessed by asking whether the observed and estimated cell counts are 'sufficiently close' — are the observed counts consistent with the counts estimated under independence?
Recruiting source and success
If the recruitment source and work performance are indeed independent, then the observed cell counts in the sample data should be similar to these estimated cell counts.
| Unsatisfactory | Satisfactory | Total | |
|---|---|---|---|
| Employee referral | 167 (149.9) |
85 (102.1) |
252 |
| In-store notice | 383 (383.2) |
261 (260.8) |
644 |
| Employment agency | 33 (29.8) |
17 (20.2) |
50 |
| Media announcement | 250 (270.1) |
204 (183.9) |
454 |
| Total | 833 | 567 | 1400 |
Hypotheses
Did a sample contingency table come from a population in which the categorical row and column variables, X and Y are independent? This question can be formalised as the hypothesis test,
Possible test statistic?
In order to assess whether the data are consistent with the null hypothesis, we ask whether the observed cell counts in the contingency table, nxy, are similar to the estimated cell counts based on independence, exy. The simplest measure of their match is the sum of squares of the differences,
Small values of this statistic are expected when there is independence in the underlying population. However it does not behave entirely as desired. To be useful, a test statistic must have a known distribution when H0 is true and, ideally, this distribution should not depend too much on specific characteristics of the problem.
The raw sum of squares has a distribution that depends on the sample size and on the marginal probabilities.
It would be very unusual for a cell in a contingency table with estimated cell count exy< = 1 to have observed cell count exy = 11. However if the estimated cell count is exy = 1001 then sampling variability would mean that an observed cell count of exy = 1011 would not be unusual. Yet the difference is the same in both cases.
The raw sum of squares must be interpreted differently, depending on the size of the estimated cell counts, so it is a bad test statistic.
Distribution of sum of squares
The blue values in the contingency table on the left below have been sampled from a population in which each of the row categories is equally likely (with marginal probability 1/3), each column category is equally likely (marginal probability 1/3) and the row and column categories are independent. All joint probabilities are therefore know to be 1/9.
Click Sample a few times to observe the variability of the blue observed counts, nxy.
The estimated counts, exy, obtained from the margins of the table, are also shown in red. Observe the variability in the differences and their sum of squares.
Increase the sample size from 100 to 1000 and repeat. Observe that the differences are usually higher. Increase the sample size to 10000 and observe that the statistic is usually higher still.
The distribution of the sum of squares depends on the sample size, so it is not an easily interpreted measure of independence.
A better test statistic
The raw sum of squares on the previous page is a poor way to assess whether a contingency table has been sampled from a population with independence. A better statistic is χ2 (pronounced chi-squared), defined by
This more fairly assesses differences between nxy and exy when the exy vary in magnitude. Its distribution still depends on the number of rows and columns in the contingency table, but is no longer affected by either the number of individuals (the total count for the table) or the margins of the table.
Only the number of rows and number of columns in the table have much influence on the distribution of χ2.
Simulation
The diagram below again samples from two independent categorical variables.
Click Accumulate then take several samples to build up the distribution of the χ2 statistic.
Now increase the sample size and repeat. Observe that χ2 is approximately the same magnitude (usually between 0.5 and 15.0) regardless of the sample size.
Finally, use the pop-up menu labelled Model to change the model to one where the marginal probabilities for the two categorical variables are unequal (but there is still independence). Observe that the distribution of χ2 remains approximately the same.
Distribution of chi-squared statistic
When there is independence, the χ2 statistic has approximately a standard distribution called a chi-squared distribution whose shape only depends on the number of rows and columns in the table but not the sample size or the underlying joint probabilities.
If a contingency table with r rows and c columns is sampled from a population with independence, χ2 has a chi-squared distribution with (r - 1)(c - 1) degrees of freedom.
The chi-squared distribution is skew, and
The mean of the chi-squared distribution equals its degrees of freedom.
Shape of the chi-squared distribution
The diagram below shows the probability density function for the chi-squared distribution.
Use the pop-up menus to change the number of rows and columns in the table. Observe that:
Testing for independence
We now formally describe a hypothesis test for whether two categorical variables are independent.
We have seen that the χ2 statistic
describes whether the observed counts in a contingency table, nxy, are close to those expected for independent variables.
P-value
In a similar way to other hypothesis tests, we evaluate a p-value — the probability of getting such an extreme χ2 when the two variables are independent (H0).
If the p-value is close to zero, we conclude that the observed table would be unlikely for independent variables, so there is evidence that the variables are associated.
| p-value | Interpretation |
|---|---|
| over 0.1 | no evidence against the null hypothesis (independence) |
| between 0.05 and 0.1 | very weak evidence of dependence between the row and column variables |
| between 0.01 and 0.05 | moderately strong evidence of dependence between the row and column variables |
| under 0.01 | strong evidence of dependence between the row and column variables |
Warning about low estimated cell counts
The p-value for the test can be found because the χ2 test statistic has approximately a chi-squared distribution. This approximation is close for most data sets that are encountered, but is less so when the sample size, n, is small. The guidelines that are often given suggest that the p-value can be relied on if:
If the cell counts are small enought that these conditions do not hold, the p-value is less reliable. (But advanced statistical methods are required to do better!)
Simulation: Independent variables
The diagram below shows a random sample from a model in which the row and column variables are independent. It also illustrates how the p-value is evaluated.
Click Take sample a few times to generate other samples from the model. Observe that the p-value is usually quite large (since H0 is true), but
| An 'unlucky sample' might mislead you into erroneously concluding that the variables are dependent. |
Analysing dependence
If it is concluded that dependence is likely in the table, you should examine carefully the cells of the table where there are the biggest mismatches between the observed and estimated cell counts. This should help you to discover the nature of the dependence.
Examples
The chi-squared test for association is applied to a few real contingency tables below.

In some examples, the value of χ2 is so far into the upper tail of the reference distribution that we are almost certain that the row and column variable are dependent. In others, the value of χ2 is small enough that it could have arisen by chance even if the variables are independent in the underlying population.
Contingency tables from univariate data in several groups
Contingency tables often arise from bivariate categorical data. However they can also arise from univariate categorical data that is recorded separately from several groups.
'Group membership' can be treated as a second categorical variable.
Effect of false claims in adverts
In a study to assess how false or misleading adverts affect consumers, one group of 100 experimental subjects was exposed to a series of adverts falsely claiming that a new brand of coffee contained 'no bitterness'. These subjects and another control group of 100 people who had not seen the adverts were given a sample of coffee that had been prepared to be intentionally bitter. The contingency table below shows whether the subjects reported the coffee as 'having bitterness'.
| False advert | No advert | Total | |
|---|---|---|---|
| Coffee described as bitter |
68 | 89 | 157 |
| Coffee described as not bitter |
32 | 11 | 43 |
| Total | 100 | 100 | 200 |
In this example, two different groups of people were used in the experiment. The column variable (distinguishing between the two types of advert) is not a random variable, as it is controlled by the experiment. A single categorical measurement (whether or not the coffee sample was bitter) was made from each person.
Comparing groups
Although the chi-squared test was motivated as a test of independence of two categorical variables, the same test can be used when each row (or column) of a contingency table corresponds to a separate group of individuals.
The χ2 test statistic and p-value are identical to those given earlier for testing independence.
Examples
In the following examples, we test whether the 'response' proportions are the same in several groups.

Note again that a visual comparison of the observed counts and those estimated from the margins assuming independence helps to explain the nature of the relationship in examples where we conclude that there is some difference between the groups.
Two groups and two categories
In the special case where there are two groups and the categorical measurement has two categories (that we will call 'success' and 'failure'), the chi-squared test is testing whether the probability of success is the same in both groups. For example, in the Bitter Coffee data set, we are testing whether the probability of reporting that the coffee was bitter is the same for the groups seeing the false adverts and those who did not.
This hypothesis can also be tested with a 2-sample test of equality of two proportions.
Fortunately, although the two tests have been motivated in a different way, it can be proved that:
| The 2-sample test for equality of two proportions and the chi-squared test both result in the same p-value and conclusion. |