library(dplyr)
library(tidyverse)
library(infer)
11 STAT 201 R Cheatsheet
11.1 Set-up
To be able to run the functions below you need to load the following packages.
Throughout this cheatsheet we will use the Palmer penguin dataset to show examples of code. This dataset is found in the R package palmerpenguins
.
library(palmerpenguins)
We will focus mostly on the flipper length (mm) of the penguins species Chinstrap and Adelie, and thus will create new data frames called adelie
and chinstrap
.
<- penguins %>%
adelie filter(!is.na(flipper_length_mm) & species == "Adelie" & !is.na(sex)) %>%
droplevels() %>%
::select(species, flipper_length_mm, sex)
dplyr
<- penguins %>%
chinstrap filter(!is.na(flipper_length_mm) & species == "Chinstrap" & !is.na(sex)) %>%
droplevels() %>%
::select(species, flipper_length_mm, sex) dplyr
Using the penguins_raw
dataset, we will also looked at changes in body mass for Gentoo penguins that were captured on Biscoe Island in 2008 and recaptured on the same Island in 2009.
<- penguins_raw %>%
gentoo_mass_diff filter(Species == "Gentoo penguin (Pygoscelis papua)" & Island == "Biscoe" & !is.na("Body Mass (g)")) %>%
droplevels() %>%
rename( "ID" = `Individual ID`, "Body_mass_g" = `Body Mass (g)`) %>%
mutate(Year = as.numeric(paste("20", substring(studyName, 4, 5), sep = ""))) %>%
filter(Year == 2008 | Year == 2009) %>%
filter(duplicated(ID, fromLast = TRUE) | duplicated(ID) ) %>%
select(ID, Body_mass_g, Sex, Year) %>%
arrange(ID)
# Check that we have 2 rows (1 per year) per individual
%>% group_by(ID) %>% count() %>% pull(n) %>% unique() gentoo_mass_diff
[1] 2
11.2 Sampling
rep_sample_n(tbl = , size = , reps = , replace = FALSE)
rep_sample_n()
is in the infer package. You will need to run library(infer)
to access the function (see set-up section).
Takes a sample from the data frame provided by
tbl =
(which may be piped in). If you use%>%
, you don’t need to specifytbl
.Specify the sample size with
size =
.Specify the number of replicates to take with
reps =
.Specify if the sample should be taken with replacement (
replace = TRUE
) or without replacement (replace = FALSE
). The default is without replacement.
<- adelie %>% rep_sample_n(size = 20, reps = 1000, replace = FALSE)
resamples head(resamples)
# A tibble: 6 × 4
# Groups: replicate [1]
replicate species flipper_length_mm sex
<int> <fct> <int> <fct>
1 1 Adelie 200 male
2 1 Adelie 185 female
3 1 Adelie 193 female
4 1 Adelie 208 male
5 1 Adelie 197 male
6 1 Adelie 189 male
tail(resamples)
# A tibble: 6 × 4
# Groups: replicate [1]
replicate species flipper_length_mm sex
<int> <fct> <int> <fct>
1 1000 Adelie 186 female
2 1000 Adelie 190 female
3 1000 Adelie 195 female
4 1000 Adelie 184 male
5 1000 Adelie 181 female
6 1000 Adelie 184 female
The object returned, here resamples
, has a variable that contains which of the reps
replicate the row belongs to and the resampled values.
11.3 Grouping and ungrouping data
group_by(.data, ...)
Groups the data by the argument provided.
<- resamples %>% group_by(replicate)
res_group head(res_group)
# A tibble: 6 × 4
# Groups: replicate [1]
replicate species flipper_length_mm sex
<int> <fct> <int> <fct>
1 1 Adelie 200 male
2 1 Adelie 185 female
3 1 Adelie 193 female
4 1 Adelie 208 male
5 1 Adelie 197 male
6 1 Adelie 189 male
Note that resamples
was already grouped as rep_sample_n()
returns an object grouped by replicate.
ungroup(x, )
Ungroups the data.
<- resamples %>% ungroup()
res_ungroup head(res_ungroup)
# A tibble: 6 × 4
replicate species flipper_length_mm sex
<int> <fct> <int> <fct>
1 1 Adelie 200 male
2 1 Adelie 185 female
3 1 Adelie 193 female
4 1 Adelie 208 male
5 1 Adelie 197 male
6 1 Adelie 189 male
Notice the difference between the two.
11.4 Calulating statistics
summarize(.data, statistic_value = statistic_function(variable_of_interest))
Calculates statistics on a data frame that may or may not be grouped. It returns one row for each combination of grouping variables.
statistic_value is your name for the statistic
statistic_function is an R function
variable_of_interest is a column in the data frame
%>% summarize(sample_mean = mean(flipper_length_mm)) resamples
# A tibble: 1,000 × 2
replicate sample_mean
<int> <dbl>
1 1 188.
2 2 191.
3 3 189.
4 4 193.
5 5 194.
6 6 193.
7 7 190.
8 8 189.
9 9 190
10 10 191.
# ℹ 990 more rows
11.5 Bootstrap distributions with the infer package
specify(x, response = )
generate(x, type = "bootstrap", reps = )
calculate(x, stat = )
On the data frame x:
specify()
the variable you’re interested inresponse =
generate()
the bootstrap samples with the specified number ofreps =
for each sample, calculates the statistic specified in
stat =
which should be one of:- “mean”, “median”, “sum”, “sd”, “prop”, “count”, “diff in means”, “diff in medians”, “diff in props”
<- adelie %>%
median_bootstrap specify(response = flipper_length_mm) %>%
generate(type = "bootstrap", reps = 1000) %>%
calculate(stat = "median")
head(median_bootstrap)
Response: flipper_length_mm (numeric)
# A tibble: 6 × 2
replicate stat
<int> <dbl>
1 1 190
2 2 190
3 3 190
4 4 190
5 5 190
6 6 189
11.6 Confidence intervals with the infer package
get_ci(x, level = , type ="percentile")
Calculates a confidence interval of the specified level of confidence (level =
) on the distribution x.
<- median_bootstrap %>%
median_95ci get_ci(level = 0.95, type = "percentile")
median_95ci
# A tibble: 1 × 2
lower_ci upper_ci
<dbl> <dbl>
1 190. 191
11.7 Data visualization: ggplot2
ggplot2
is a flexible and powerful visualization package in R that creates plots by combining different components. It can be used for many types of plots, such as histograms, scatter plots, and line plots. The base function in ggplot2
is ggplot(), which initializes the plot. Additional layers can then be added to specify the details of the plot by using geom_*()
functions. geom_*()
functions define the type of plot (e.g., geom_histogram()
for histograms, geom_point()
for scatterplot, geom_boxplot()
for boxplot).
ggplot2
uses general functions for titles and axis labels. They are used to customize the plot but don’t define the data visualization itself. You can use them to add titles, and axis labels, and customize other aspects of the plot:
ggtitle()
: Adds a main title to the plotxlab()
: Labels the x-axisylab()
: Labels the y-axisscale_fill_manual()
: Colors for categorical variables when plotting. In the context of boxplots, scatter plots, bar charts, or any other plot type, it allows you to control the color aesthetics (e.g., the fill color) for different groups or categories.
ggplot(data = , aes(x = )) + geom_histogram(binwidth = )
Plots a histogram of the data.
Specify the variable on the x-axis with
aes(x = )
In
geom_histogram()
, specify thebinwidth =
or thebins =
<- median_bootstrap %>%
median_dist_plot ggplot(aes(x = stat)) +
geom_histogram(bins = 10) +
ggtitle("Sampling distribution") +
xlab("Median flipper length (mm)")
median_dist_plot
annotate(geom, "rect", xmin = , xmax = , ymin = , ymax = , fill = , alpha = )
Adds a shaded rectangle to a geom with the specified boundaries (xmin =, xmax =, ymin =, ymax = ), fill colour (fill = ), and transparency level (alpha = ).
Template:
+
median_dist_plot annotate("rect",
xmin = median_95ci$lower_ci,
xmax = median_95ci$upper_ci,
ymin = 0, ymax = Inf, fill = "deepskyblue",
alpha = 0.3)
11.8 ggplot(data = , aes(x = )) + geom_boxplot()
Plots a boxplot of the data.
ggplot(data = penguins, aes(x = species, y = flipper_length_mm, fill = species)) +
geom_boxplot() +
ggtitle("Boxplot of Flipper Length by Species") + # Add a title
xlab("Species") + # Label for the x-axis
ylab("Flipper Length (mm)") + # Label for the y-axis
scale_fill_manual(values = c("Adelie" = "skyblue", "Chinstrap" = "orange", "Gentoo" = "#2e8b57")) # Custom colors for each species
11.9 Calculating quantiles
quantile(x, probs = )
Calculates the sample quantile on the data x
corresponding to the given probability probs
.
%>% pull(stat) %>% quantile(0.05) median_bootstrap
5%
190
11.10 Normal and t-distribution functions
qnorm(p, mean = , sd = )
Calculates a quantile on the normal distribution with specified mean (mean =
) and standard deviation (sd =
) where the proportion p
of the distribution is less than the quantile. The default is mean = 0 and sd = 1 (i.e. the standard normal).
qnorm(0.25, 57, 54.4)
[1] 20.30776
dnorm(x, mean=, sd =)
Calculates the probability density function (PDF) for the normal distribution with mean mean
and standard deviation sd
. It returns the density of the normal distribution at a x
. The default values for mu is 0 and sd is 1.
dnorm(0.25, mean = 1, sd=1)
[1] 0.3011374
pnorm(x, mean=, sd =)
Calculates the cumulative distribution function for the normal distribution with mean mean
and standard deviation sd
. The default is a standard normal distribution (mean is 0 and sd is 1). It returns the probability that a normally distributed random variable is less than or equal to x
.
pnorm(0.25, mean = 1, sd=1)
[1] 0.2266274
pnorm
and qnorm
take a lower.tail
argument, for which default value is TRUE
. It controls whether the function computes probabilities or quantiles for the lower (TRUE) or upper tail (FALSE) of the distribution.
pnorm(2, mean = 0, sd = 1) # P(X <= 2) for a standard normal distribution
[1] 0.9772499
pnorm(2, mean = 0, sd = 1, lower.tail = FALSE) # P(X > 2) for a standard normal distribution
[1] 0.02275013
qt(p, df = )
Calculates a quantile on the t-distribution with specified degrees of freedom (df =
) where the proportion p
of the distribution is less than the quantile.
qt(0.25, 23)
[1] -0.6853063
dt(x, df = )
Calculates the density of the t-distribution with specified degrees of freedom (df =
) at observation x
. It gives the likelihood of a value occurring at x
in a t-distribution, as opposed to the cumulative probability (see below).
dt(0.25, 23)
[1] 0.3819868
pt(x, df = )
Calculates the cumulative distribution function (CDF) of the t-distribution. It is used when we want to determine the probability that a value drawn from a t-distribution with df
degrees of freedom is less than or equal to x
.
pt(0.25, 23)
[1] 0.5975966
pt
and qt
take a lower.tail
argument, for which default value is TRUE
. It controls whether the function computes probabilities or quantiles for the lower (TRUE) or upper tail (FALSE) of the distribution.
pt(2, df = 23) # P(X <= 2) for a t-distribution with 2 degrees of freedom
[1] 0.9712777
pt(2, df = 23, lower.tail = FALSE) # P(X > 2) for a t-distribution with 2 degrees of freedom
[1] 0.02872227
11.11 Hypothesis testing
11.12 Method 1: Simulation
Hypothesis testing with the infer package:
specify()
specifies the variable that you are interested in, and sometimes (e.g. independent two-sample tests) a formula that describes how the variable is groupedhypothesise(null)
states the null hypothesis, which should be one of:hypothesise(null = "point", mu = mu_0)
hypothesise(null = "independence")
hypothesise(null = "paired independence")
generate(type = ... , rep = ...)
generates the resamplescalculate(stat = ...)
calculates the statistic you are interested in for each re-sampleget_p_value()
calculates the p-value of the test.
Bootstrap hypothesis testing (resampling with replacement)
When using bootstrap hypothesis testing, we set the type
argument of generate()
to "bootstrap"
.
Test whether a single parameter (i.e., proportion, mean, median, sigma) differs from a hypothesised value
Example: mean
hypothesise(null = "point", mu = mu_0)
tests if the population mean is equal to mu_0
.
= 190
mu_0 # Create null distribution
<- adelie %>%
mean_null specify(response = flipper_length_mm) %>% # Step 1: Specify model
hypothesise(null = "point", mu = mu_0) %>% # Step 2: Set null hypothesis
generate(type = "bootstrap", reps = 1000) %>% # Step 3: Generate null distribution
calculate(stat = "mean") # Step 4: Calculate test statistic
# Get p-value
get_p_value(mean_null, obs_stat = mean(adelie$flipper_length_mm), direction = "both") # Step 5: Calculate the p value
# A tibble: 1 × 1
p_value
<dbl>
1 0.86
Example: median
hypothesise(null = "point", med = med_0)
tests if the population median is equal to mu_0
.
= 189
med_0 # Create null distribution
<- adelie %>%
median_null specify(response = flipper_length_mm) %>% # Step 1: Specify model
hypothesise(null = "point", med = med_0) %>% # Step 2: Set null hypothesis
generate(type = "bootstrap", reps = 1000) %>% # Step 3: Generate null distribution
calculate(stat = "median") # Step 4: Calculate test statistic
# Get p-value
get_p_value(median_null, obs_stat = median(adelie$flipper_length_mm), direction = "both") # Step 5: Calculate the p value
# A tibble: 1 × 1
p_value
<dbl>
1 0.232
Test whether a set of paired population differ
hypothesise(null = "paired independence")
should be used when testing paired data, where each observation in one sample has a corresponding observation in another sample.
Permutation resampling (sample without replacement, like shuffling)
When using permutation hypothesis testing, we set the type
argument of generate()
to "permute"
.
hypothesise(null = "independence")
should be used to test whether two groups are independent, meaning there is no association or relationship between them in the population (e.g., flipper length between two different penguin species).
In addition:
specify(formula)
needs a formula of the form y \sim x wherex
is the response (e.g., flipper_length_mm) andy
the explanatory variable (e.g., species). Permutation resampling won’t work withinfer
if the explanatory variable is not specified since permutation resampling is commonly used to test for independence or equality between two groups.
Example: difference in median
We will look at where there is a difference in the median flipper length (mm) between female and male adelin penguins.
<-
obs_diff_in_medians %>%
adelie specify(formula = flipper_length_mm ~ sex) %>%
calculate(stat = "diff in medians", order = c("female", "male"))
# get p-value
%>%
adelie specify(formula = flipper_length_mm ~ sex) %>% # Step 1: Specify model
hypothesize(null = "independence") %>% # Step 2: Set null hypothesis
generate(reps = 1000, type = "permute") %>% # Step 3: Generate null distribution
calculate(stat = "diff in medians", order = c("female", "male"))%>% # Step 4: Calculate test statistic
get_p_value(obs_stat = obs_diff_in_medians, direction = "both") # Step 5: Calculate the p value
# A tibble: 1 × 1
p_value
<dbl>
1 0
11.13 Method 2: Theory-based hypothesis testing
Theory-based hypothesis tests require assumptions on the data.
Proportions (categorical): Z-tests
For one group:
The test statistic of the z-test is:
Z = \frac{\hat{p} - p_0}{\sqrt{\frac{p_0 (1 - p_0)}{n}}},
Z \sim N(0, 1), (approx.)
where \hat{p} is the sample proportion, p_0 is the hypothesized proportion value, and n is the sample size.
Example: we think that the proportion p of female adelie penguins is greater than p_0 = 0.5. It can be written as follows: H_0: p = 0.5, H_1: p > 0.5
# Hypothesized proportion
<- 0.5
adelie_p0_f # Observed proportion of females
<- mean(adelie$sex == "female")
adelie_prop_f # Estimated standard error
<- sqrt(adelie_p0_f * (1 - adelie_p0_f) / nrow(adelie))
adelie_prop_f_se
# Calculate the p-value
pnorm(adelie_prop_f, 0.5, adelie_prop_f_se, lower.tail = FALSE)
[1] 0.5
Difference between two groups:
The test statistic used when comparing the proportion from two independent populations is:
Z = \frac{(\hat{p}_1 - \hat{p}_2) - \Delta_0}{\sqrt{\hat{p}(1 - \hat{p})\left(\frac{1}{n_1} + \frac{1}{n_2}\right)}},
Z \sim N(0, 1), (approx.)
where \hat{p}_1 is the sample proportion of sample 1, \hat{p}_2 is the sample proportion of sample 2, \hat{p} is the sample proportion of for both samples combined, \Delta_0 is the hypothesized difference, n_1 is the size of sample 1, and n_2 is the size of sample 2.
Example:
We want to test whether there is a difference in proportion of flipper longer than 200 mm between species.
# Calculate proportions of flipper length > 200 for Adelie and Chinstrap
<- mean(adelie$flipper_length_mm > 200)
prop_adelie <- mean(chinstrap$flipper_length_mm > 200)
prop_chinstrap <- mean(c(chinstrap$flipper_length_mm > 200, adelie$flipper_length_mm > 200))
prop_both
# Standard error of the difference in proportions
<- sqrt((prop_both * (1 - prop_both)) *
std_error_diff 1/nrow(adelie) + 1/nrow(chinstrap)))
(
# Z-score for the difference in proportions
<- (prop_chinstrap - prop_adelie) / std_error_diff
z_score
# P-value for the two-tailed test
<- 2 * pnorm(abs(z_score), lower.tail = FALSE) p_value
Means: t-test and ANOVA
One group: one sample t-test
The test statistic to test whether the population mean differ from a hypothesized value is:
- T = \frac{(\bar{X} - \mu_0)}{s/\sqrt{n}},
- T \sim t_{n-1}, (approx.)
where \bar{X} is the sample mean, \mu_0 is the hypothesized mean, s is the sample standard deviation, n is the sample size, and n-1 is the degrees of freedom of the t-distribution.
Example: we want to test if the average flipper length of adelie penguins is different from 190.
<- 190
mu0
# Calculate sample statistics
<- mean(adelie$flipper_length_mm)
sample_mean <- sd(adelie$flipper_length_mm)
sample_sd
# Calculate the t-statistic
<- (sample_mean - mu0) / (sample_sd / sqrt(nrow(adelie)))
t_statistic
# Calculate the p-value (two-tailed) using pt()
2 * pt(-abs(t_statistic), df = (nrow(adelie) - 1))
[1] 0.8493036
Alternatively we can use the t.test(x,...)
function as follows:
x
: data for the variable of interest (e.g., flipper_length_mm)alternative
: the direction of the alternative. It can be eithertwo.sided
,less
orgreater
.
t.test(x = adelie$flipper_length_mm,
mu = 190,
alternative = "two.sided")
Difference between two independent groups: two-sample t-test
The test statistics to compare the mean of two independent population is:
- T = \frac{\bar{X}_1 - \bar{X}_2 - \Delta_0}{\sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}}},
- T \sim t_{\min(n_1 - 1, n_2 - 1)}, (approx.)
where \bar{X}_1 is the mean of sample 1, \bar{X}_2 is the mean of sample 2, \Delta_0 is the hypothesized difference between the mean of two population under the null, s_1 is the standard deviation of sample 1, s_2 is the standard deviation of sample 2, n_1 is the size of sample 1, n_2 is the size of sample 2, and \min(n_1 - 1, n_2 - 1) is the degrees of freedom of the t-distribution. Note that other, and better ways to calculate the degrees of freedom are often used. We use \min(n_1 - 1, n_2 - 1) for simplicity.
Example: we want to test whether there is significant difference between the average flipper length of difference penguin species.
# Compute the means for both groups
<- mean(adelie$flipper_length_mm)
mean_adelie <- mean(chinstrap$flipper_length_mm)
mean_chinstrap
# Compute sample sd for both groups
<- sd(adelie$flipper_length_mm)
sd_adelie <- sd(chinstrap$flipper_length_mm) sd_chinstrap
# Calculate the pooled standard error
<- sqrt((sd_adelie^2 / nrow(adelie)) + (sd_chinstrap^2 / nrow(chinstrap)))
pooled_se
# Calculate the t-statistic for the difference in means
<- (mean_adelie - mean_chinstrap) / pooled_se
t_statistic
# Calculate the two-tailed p-value using pt()
2 * pt(-abs(t_statistic), df = min( nrow(adelie) - 1, nrow(chinstrap) - 1))
[1] 4.143797e-07
Alternatively, we can use the t.test()
function, where:
x
: variable for group 1y
: variable for group 2mu
: hypothesized difference. The default is0
(i.e. there is no difference between the two populations).alternative
: direction of the alternative hypothesis. The options are “two.sided”, “less”, “greater”.var.equal
: the method we learned in class assumes that the variance is equal in the two population, thus this argument is set toTRUE
. But one useFALSE
if there is evidence that the variance is unequal.
t.test(adelie$flipper_length_mm,
$flipper_length_mm,
chinstrapalternative = "two.sided",
var.equal = TRUE)
Two Sample t-test
data: adelie$flipper_length_mm and chinstrap$flipper_length_mm
t = -5.7979, df = 212, p-value = 2.413e-08
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-7.66579 -3.77579
sample estimates:
mean of x mean of y
190.1027 195.8235
t.test()
uses a different approximation of the degree of freedom df
than \min(n_1 - 1, n_2 - 1), which explains the different result. The use of \min(n_1 - 1, n_2 - 1) is a “conservative” estimate of the degrees of freedom. The function t.test()
uses df = n_1 + n_2 - 2.
Difference between paired groups: paired-sample t-test
The test statistic to compare the mean of two paired populations is:
- T = \frac{\bar{d} - \Delta_0}{\frac{s_d}{\sqrt{n}}},
- T \sim t_{n - 1}, (approx.)
where \bar{d} is the mean difference in paired values from both samples, \Delta_0 is the hypothesized difference between the paired population under the null, s_d is the standard deviation of the differences in paired values from both samples, and n - 1 is the degree of freedom of the t-distribution.
Example:
Using the t.test()
function
x
: variable for sample 1y
: variable for sample 2mu
: hypothesized difference. The default is0
(i.e. there is no difference between the two populations).alternative
: direction of the alternative hypothesis. The options are “two.sided”, “less”, “greater”.paired
: set toTRUE
for paired t-test
Test whether the body mass of Gentoo penguins on Biscoe Island differ between 2008 and 2009.
<- gentoo_mass_diff %>% filter(Year == 2008)
mass_2008 <- gentoo_mass_diff %>% filter(Year == 2009)
mass_2009
t.test(mass_2008$Body_mass_g, mass_2009$Body_mass_g,
alternative = "two.sided", paired = TRUE)
Paired t-test
data: mass_2008$Body_mass_g and mass_2009$Body_mass_g
t = -0.87943, df = 15, p-value = 0.393
alternative hypothesis: true mean difference is not equal to 0
95 percent confidence interval:
-385.1638 160.1638
sample estimates:
mean difference
-112.5
Difference between three or more independent groups ANOVA
One-way ANOVA
11.14 aov()
Fits an ANOVA to the data.
<- aov(flipper_length_mm ~ species, data = penguins)
one_way_anova summary(one_way_anova)
Df Sum Sq Mean Sq F value Pr(>F)
species 2 52473 26237 594.8 <2e-16 ***
Residuals 339 14953 44
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
2 observations deleted due to missingness
aov()
takes formula
as an argument (of usual form y \sim x). This should specify the response of interest (e.g., flipper_length_mm) and the explanatory variable (e.g., species). The function summary returns the following:
Df
: Degrees of freedom for the factor (species) and residuals.Sum Sq
: The sum of squares for species and residuals.Mean Sq
: The mean sum of squares, calculated by dividing the sum of squares by the degrees of freedom.F value
: The F-statistic for the factor (here species). It tests whether the means of species differ significantly).Pr(>F)
: p-value.
11.15 Multiple testing
p.adjust(p, method = "")
Provides a numeric vector of corrected p-values (of the same length as p, with names copied from p).
Specify
p
, a numeric vector of p-valuesSpecify the method that you want to use to adjust p-values in the argument
method
which should be one of:- “holm”, “hochberg”, “hommel”, “bonferroni”, “BH”, “BY”, “fdr”, “none”.
# Create vector of p-values
= c(0.001,0.34,0.21,0.012,0.4,0.01)
p_values <- p.adjust(p_values, method = "bonferroni") pval_bonf