Tidy chi-squared stats in infer
This blog post expands on the excellent talk delivered by Andrew Bray at rstudio::conf2018. The slides (PDF) from Andrew’s presentation can be found at this github link.
The objective of this package is to perform statistical inference using an expressive statistical grammar that coheres with the tidyverse design framework.
Let’s see how that’s done. First we install
infer and load it into our environment.
# install.packages("infer") library(infer) library(cowplot) # for nice plots
infer is supposed to play well within the tidyverse, so we load that too, along with some data from the General Social Survey.
library(tidyverse) # loads some data from the general social survey load(url("http://bit.ly/2E65g15")) names(gss)
##  "id" "year" "age" "class" "degree" "sex" ##  "marital" "race" "region" "partyid" "happy" "relig" ##  "cappun" "finalter" "natspac" "natarms" "conclerg" "confed" ##  "conpress" "conjudge" "consci" "conlegis" "zodiac" "oversamp" ##  "postlife" "party" "space" "NASA"
Question: Is funding for space exploration (NASA) a partisan issue?
Let’s start by looking at responses to the question: “Do you think funding for space exploration is too little, about right, or too much?”
gss %>% select(party, NASA)
## # A tibble: 149 x 2 ## party NASA ## <fct> <fct> ## 1 Ind TOO LITTLE ## 2 Ind ABOUT RIGHT ## 3 Dem ABOUT RIGHT ## 4 Ind TOO LITTLE ## 5 Ind TOO MUCH ## 6 Ind TOO LITTLE ## 7 Ind ABOUT RIGHT ## 8 Dem ABOUT RIGHT ## 9 Dem TOO LITTLE ## 10 Ind TOO LITTLE ## # … with 139 more rows
We see that people can either select too little, about right, or too much. We can visualize this pretty easily:
gss %>% select(party, NASA) %>% ggplot(aes(x=party, fill = NASA)) + geom_bar()
Count data can be misleading when we’re looking for trends between categorical variables within groups, so let’s normalize within-group-percentages with a
position = "fill" argument in
gss %>% select(party, NASA) %>% ggplot(aes(x=party, fill = NASA)) + geom_bar(position = "fill") + ylab("within group percentage")
It doesn’t look like there’s much of a difference in how Democrats, Independents, and Republicans support space exploration, but let’s now drill down into this with some hypothesis testing, comparing base
What we essentially have is a contingency table of party affiliation and attitude towards space exploration, and we want to see if there’s a relationship between these variables. The chi-squared test of independence is used to determine if a significant relationship exists between two categorical variables, so we will use this test.
- Null hypothesis: There is no realationship between party (Democrat, Independent, Republican) and attitude towards space exploration (too little, about right, too much).
- Alternative hypothesis: There is a relationship between party and attitude towards space exploration.
Before we perform a chi-squared test, in base R and
infer, let’s quickly break down what a chi-squared test is all about.
We have our data (party and attitude towards space exploration), we assume:
- Assumption #1: Variable are categorical. There are 3 types of categorical data:
- Dichotomous: 2 groups (eg - Male and Female)
- Nominal: 3 or more categorical groups (eg - undergrad, professor, graduate student, postdoctoral scholar)
- Ordinal: ordered groups (eg - Pain Level 1, Pain Level 2, Pain Level 3, …)
- Dichotomous: 2 groups (eg - Male and Female)
- Assumption #2: Observations are independent of one another (eg - no relationship between any of the cases).
- Assumption #3: Categorical variable groupings must be mutually exclusive. In other words, we can’t have one participant as “Democrat” and “Independent”.
- Assumption #4: There must be at least 5 expected frequencies in each group of your categorical variable (only important for the analytical solution ).
We meet those assumptions, so let’s power ahead!
There are two main ways to solve this problem:
- Analytically (mathmematically)
We assume the expected values follow a Chi-squared distribution, with a probability density function that depends on the degrees of freedom. The plot above shows how the distribution varies with the degrees of freedom (k). On the x axis is the Chi-squared statistic, which we can calculate in
R. We could then see where it falls in the distribution, and observe the probability of arriving at that combination of variables, or a more extreme example. As our Chi-squared test statistic increases, we move further along the x axis to the right. There is less area under the curve to the right, and our p-value (the area under the curve to the right of the observed statistic) decreases.
Generally speaking, a larger Chi-squared statistic suggests stronger evidence for rejecting our null hhypothesis. If we observe a p-value <= .05, we would reject our null hypothesis.
What would it mean to accept our alternative hypothesis?
In the case of our example, if we we lived in a completely random universe, less than 5% of the time we would arrive at the particular combination of party and attitude towards space exporlation we observe in our data. In other words, the relationship between party and attitude towards space exploration we see in our data is significant.
But is there actually a significant relationship between these variables? We will calculate a Chi-squared statistic for our data to find out.
stats package has a function that takes two vectors of data, and returns a chi-squared test statistic, degrees of freedom, and p-value. Let’s save this observed chi statistic for later use.
## ## Pearson's Chi-squared test ## ## data: gss$party and gss$NASA ## X-squared = 1.3261, df = 4, p-value = 0.8569
observed_stat <- chisq.test(gss$party, gss$NASA)$stat
We might be tempted to look at this and say, “there’s a high p-value.” No significant relationship exists. Job done. This is what we expected looking at the bar plots earlier. But let’s go a step further with
Another way to test if there is a significant relationship in our data is to take a programatic approach.
In his talk, Andrew Bray says:
If we live in world where variables are totally unrelated, the ties between variables are arbitrary, so they might just as well have been shuffled.
What would that random world look like? Let’s take one of the columns of our data_frame and scramble it.
gss %>% select(party, NASA) %>% mutate(permutation_1 = sample(NASA), permutation_2 = sample(NASA))
## # A tibble: 149 x 4 ## party NASA permutation_1 permutation_2 ## <fct> <fct> <fct> <fct> ## 1 Ind TOO LITTLE TOO MUCH TOO MUCH ## 2 Ind ABOUT RIGHT ABOUT RIGHT TOO LITTLE ## 3 Dem ABOUT RIGHT TOO LITTLE TOO MUCH ## 4 Ind TOO LITTLE TOO MUCH ABOUT RIGHT ## 5 Ind TOO MUCH ABOUT RIGHT TOO MUCH ## 6 Ind TOO LITTLE TOO MUCH ABOUT RIGHT ## 7 Ind ABOUT RIGHT ABOUT RIGHT TOO LITTLE ## 8 Dem ABOUT RIGHT ABOUT RIGHT TOO MUCH ## 9 Dem TOO LITTLE TOO MUCH ABOUT RIGHT ## 10 Ind TOO LITTLE TOO LITTLE TOO MUCH ## # … with 139 more rows
These premutations represent what we would expect to see if the relationship between variables was completeley random. We could generate many, many permuations, calcualte an Chi-squared statistic for each, and we would expect their distribution to approach the density functions shown above. Then we could plot our data on that distribution and see where it fell. If the area under the curve to the right of the point was less than 5%, we could reject the null hypothesis.
Infer makes this programatic approach very simple.
dplyr::select(): choose the variables from your df to test
hypothesize()is where we select the null hypothesis
generate()creates permutated values
calculate()lets you choose what test statistic to calculate
visualize()automatically plots permutated values with ggplot, making it easy to add geoms onto it
- dataframe in dataframe out
- compose tests with pipes
- reading an inferential chain describes an inferential procedure
gss %>% specify(NASA ~ party) %>% hypothesize(null = "independence") %>% generate(reps = 1000, type = "permute") %>% calculate(stat = "Chisq") %>% visualize() + # add a vertical line for gss data geom_vline(aes(xintercept = observed_stat), color = "red")
If we wanted to get a p-value from this programatic appraoch, we can calculate the area under the curve to the right of the observed statistic.
gss %>% specify(NASA ~ party) %>% hypothesize(null = "independence") %>% generate(reps = 1000, type = "permute") %>% calculate(stat = "Chisq") %>% summarise(p_val = mean(stat > observed_stat))
## # A tibble: 1 x 1 ## p_val ## <dbl> ## 1 0.856
If we omit
visualize, we get a dataframe of the permuted values:
gss %>% specify(NASA ~ party) %>% hypothesize(null = "independence") %>% generate(reps = 1000, type = "permute") %>% calculate(stat = "Chisq") #%>%
## # A tibble: 1,000 x 2 ## replicate stat ## <int> <dbl> ## 1 1 2.85 ## 2 2 1.36 ## 3 3 0.342 ## 4 4 3.17 ## 5 5 1.82 ## 6 6 6.10 ## 7 7 2.73 ## 8 8 2.11 ## 9 9 0.846 ## 10 10 2.24 ## # … with 990 more rows
# visualize() + # geom_vline(aes(xintercept = observed_stat), color = "red") # add a vertical line for gss data
If we omit
hypothesize, we can bootstrap samples from our data.
gss %>% specify(NASA ~ party) %>% #hypothesize(null = "independence") %>% generate(reps = 1000, type = "bootstrap") %>% calculate(stat = "Chisq") %>% visualize() + geom_vline(aes(xintercept = observed_stat), color = "red")
One of the key features of
infer is that the functions are reusable. To illustrate, let’s consider the differences between men and women in their reported level of happiness in our dataset. We could observe this with a simple bar plot.
This is sample from a population, and there’s going to be some uncertainty associated with that sample. We can bound the uncertainty by adding confidence intervals, and accomplish this with reusable
gss %>% select(sex, happy) %>% ggplot() + geom_bar(aes(x = sex, fill = happy), position = "fill")
We could go on to measure the difference between male and female happiness by looking at the difference in the proportion of respondents claiming to be “Happy” or “Unhappy”. Let’s save these proportions for later.
gss %>% select(sex, happy) %>% group_by(sex, happy) %>% summarise(n = n()) %>% mutate(prop = n / sum(n))
## # A tibble: 4 x 4 ## # Groups: sex  ## sex happy n prop ## <fct> <fct> <int> <dbl> ## 1 MALE HAPPY 37 0.638 ## 2 MALE UNHAPPY 21 0.362 ## 3 FEMALE HAPPY 78 0.857 ## 4 FEMALE UNHAPPY 13 0.143
This table suggest that females in our sample population are on average, 12% happier than males. We can visualize that difference in a plot.
obs_stat <- gss %>% # p_female - p_male group_by(sex) %>% summarize(p = mean(happy == "HAPPY")) %>% summarize(diff(p)) %>% pull() x_pos <- "diff in happiness" p <- data_frame(diff = obs_stat) %>% mutate(x = x_pos) %>% ggplot() + geom_point(aes(x = x, y = diff), stat="identity", size = 4)
## Warning: `data_frame()` is deprecated, use `tibble()`. ## This warning is displayed once per session.
We’re not quite done.
infer helps us assess our uncertainty in this statistic by calculating a confidence interval, which adds two standard errors to the point estimate.
The standard error is the variablility we would expect in our data if we drew many new samples from the population (eg - more surveys). Assuming normality in our population, we can estimate that variability by bootstraping. The result will be centered on our statistic, rather than a null value.
boot <- gss %>% specify(happy ~ sex, success = "HAPPY") %>% generate(reps = 500, type = "bootstrap") %>% calculate(stat = "diff in props", order = c("FEMALE", "MALE")) visualize(boot) + geom_vline(aes(xintercept = obs_stat), color = "red")
We can easily calculate the standard error (SD) of the distribution of bootstrapped samples, and then formulate confidence intervals.
95.45% of a normal population lies between two standard deviations below and above the mean. If we take the more memorbale ‘2’ standard errors (compared to the less memorable ‘1.96’) above and below our sample statistic, we enclose 95% of the population.
SE <- boot %>% summarize(sd(stat)) %>% pull() SE
##  0.06946399
CI <- c(obs_stat - 2 * SE, obs_stat + 2 * SE) CI
##  0.08028384 0.35813981
And now we can add these confidence intervals to our etimate of the measured difference in happiness between men and women in our population with
x_pos <- "diff in happiness" data_frame(diff = obs_stat) %>% mutate(x = x_pos) %>% ggplot() + geom_point(aes(x = x, y = diff), stat="identity", size = 4) + geom_segment(aes(x = x_pos, xend = x_pos, y = CI, yend = CI))
Et voila. Confidence intervals made easy via bootstrapping in
Two things stand out to me about
infer. One is that it makes me think through the steps of what I’m doing, and the verbs in
infer are reusable. Two, I save time because everything happens in a tidy way that plays well with
dplyr. This makes it easy to hypothesis test, mutate and summarise new variables, pipe results into a
ggplot. The workflow and results are readable and memorable.
I look forward to seeing what’s down the line from
infer. It seems like a powerful and concise way to perform statistical inference and hypothesis test. I’m very grateful to Andrew Bray for the excellent talk, and the developers of
infer for their hard work on tidy and expressive statistical inference.