Chapter 7 Statistical Analysis

7.1 t-test

7.1.1 example: Independent t-test

Let’s start by analyzing the difference in strength between superheroes from different publishers.

heroes2 |>
  group_by(Publisher) |>
  summarize(mean_strength = mean(Strength, na.rm = TRUE))
## # A tibble: 4 × 2
##   Publisher     mean_strength
##   <chr>                 <dbl>
## 1 DC Comics              64.6
## 2 George Lucas           33.3
## 3 Image Comics           60  
## 4 Marvel Comics          43.1

We can observe from the dataset that the mean strength of superheroes from different publishers seems to vary. But, is this difference statistically significant?

Let’s conduct a t-test to compare heroes from DC Comics and Marvel Comics.

heroes2 |>
  filter(Publisher %in% c("Marvel Comics", "DC Comics")) |>
  group_by(Publisher) |>
  summarize(Strengths = list(Strength)) |>
  with(t.test(Strengths[[1]], Strengths[[2]], paired = FALSE))
## 
##  Welch Two Sample t-test
## 
## data:  Strengths[[1]] and Strengths[[2]]
## t = 2.8807, df = 32.441, p-value = 0.00698
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##   6.297946 36.652054
## sample estimates:
## mean of x mean of y 
##    64.600    43.125

That looks complicated. I do not normally use that.

7.1.1.1 Step 1: Subset the data

First, we need to subset the data into two groups: one for DC Comics and one for Marvel Comics.

subset(heroes2, Publisher == "Marvel Comics") -> marvel
subset(heroes2, Publisher == "DC Comics") -> DC

7.1.1.2 Step 2: Conduct the t-test

Next, we use the t.test() function to calculate the difference in strength between the two groups.

heroes2 |>
  with(t.test(marvel$Strength, DC$Strength, paired = F))
## 
##  Welch Two Sample t-test
## 
## data:  marvel$Strength and DC$Strength
## t = -2.8807, df = 32.441, p-value = 0.00698
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -36.652054  -6.297946
## sample estimates:
## mean of x mean of y 
##    43.125    64.600

From the results, we see that there is a significant difference in the strength levels between superheroes from DC Comics and Marcel Comics.

Important Note: Independent vs Paired t-test

In this case, we are comparing superheroes from two different groups (DC vs Marvel), which makes this an independent t-test. A paired t-test would be inappropriate here because the two groups consist of entirely different individuals.

However, when comparing repeated measurements on the same group of people (e.g., before and after a task), a paired t-test would be more suitable.

7.1.2 example: Paired t-test

Remember the performance data from earlier? Let’s say we want to analyze if there’s any difference in exam scores between Chapter 1 and Chapter 3 for the same group of students.

We can perform a paired t-test in this case:

# group them first
subset(performance, Subject_em == "Eng" & Chapter == "C1") -> eng_c1
subset(performance, Subject_em == "Eng" & Chapter == "C3") -> eng_c3

performance |>
  with(t.test(eng_c1$Score, eng_c3$Score, paired = T))
## 
##  Paired t-test
## 
## data:  eng_c1$Score and eng_c3$Score
## t = -0.44114, df = 5, p-value = 0.6775
## alternative hypothesis: true mean difference is not equal to 0
## 95 percent confidence interval:
##  -17.06789  12.06789
## sample estimates:
## mean difference 
##            -2.5

This test checks whether the students’ performance significantly changed between Chapter 1 and Chapter 3. The paired argument ensures that the test accounts for the fact that these are the same individuals in both conditions.

7.2 ANOVA

ANOVA is a powerful statistical tool used to compare the means of multiple groups. It helps in determining whether the observed differences between group means are statistically significant. Here’s how we can use anova_test() in RStudio to perform various types of ANOVA.

library(rstatix)
heroes2 |>
  anova_test(
    dv = # dependent variable, 
    wid = # subject name/id, 
    type = NULL, # default: type II ANOVA
    between = # between-subject independent variable,
    within = # within-subject independent variable,
    effect.size = "ges", 
    error = NULL,
    white.adjust = FALSE,
    observed = NULL,
    detailed = TRUE)
  • detailed = TRUE: Requests a more detailed statistical output. FALSE is the default.

  • type = NULL: Type 2 ANOVA is the default. You can change this to 1 or 3 based on your needs.

  • effect.size =: Computes effect size, the default is using generalizing eta squared (ges) or partial eta squared (pes). to compute.

    • The Help section in RStudio says I could choose both, but I haven’t figure out how to do so yet.
  • For ANCOVA, you can add covariate = to include continuous variables as covariates. Insert it below within =, above type =.

7.2.1 one-way ANOVA

One-way ANOVA is used when we have one independent variable (IV) and one dependent variable (DV).

7.2.1.1 example 1: Strength ~ Intelligence

  • IV: Intelligence

  • DV: Strength

heroes2 |>
  anova_test(
    dv = Strength, 
    wid = Name, 
    between = Intelligence,
    detailed = T)
## ANOVA Table (type II tests)
## 
##         Effect      SSn    SSd DFn DFd     F     p p<.05   ges
## 1 Intelligence 8758.544 129262   4 136 2.304 0.062       0.063

7.2.1.2 example 2: Strength ~ Gender

  • IV: Gender

  • DV: Strength

heroes2 |>
  anova_test(
    dv = Strength, 
    wid = Name, 
    between = Gender,
    detailed = T)
## ANOVA Table (type II tests)
## 
##   Effect      SSn      SSd DFn DFd     F    p p<.05   ges
## 1 Gender 3772.182 134248.4   1 139 3.906 0.05       0.027

7.2.2 two-way ANOVA

Two-way ANOVA is used when we have two independent variables and one dependent variable. Here, we look at how these two IVs interact with each other.

7.2.2.1 example 1: Strength ~ BMI_status * Gender

heroes2 |>
   anova_test(
    dv = Strength, 
    wid = Name, 
    between = c(Gender, BMI_status), 
    detailed = TRUE) 
## ANOVA Table (type III tests)
## 
##              Effect       SSn      SSd DFn DFd      F        p p<.05      ges
## 1       (Intercept) 68473.044 109803.3   1 133 82.938 1.11e-15     * 0.384000
## 2            Gender    51.278 109803.3   1 133  0.062 8.04e-01       0.000467
## 3        BMI_status  8413.707 109803.3   3 133  3.397 2.00e-02     * 0.071000
## 4 Gender:BMI_status  6885.839 109803.3   3 133  2.780 4.40e-02     * 0.059000

Note: Be mindful of the ANOVA type. Even if you don’t specify a type, RStudio may default to type III ANOVA. If you prefer type II ANOVA, specify type = 2.

7.2.2.2 example 2: strength ~ gender * BMI

heroes2 |>
  anova_test(
    dv = Strength, 
    wid = Name, 
    between = c(Gender, BMI), 
    detailed = T)

However, we would encounter an error here. Why is that?

Step 1: Check for mismatched variable types. For example, putting both categorical (e.g., Gender) and numerical (e.g., BMI) in the between = section.

Solution 1

  • convert numerical variables into categorical variables

    For example, you can classify BMI into levels like obese, overweight, normal, and underweight. (We did this in the previous section.)

# Solution 1
heroes2 |>
   anova_test(
    dv = Strength, 
    wid = Name, 
    between = c(Gender, BMI_status), 
    detailed = T)   
## ANOVA Table (type III tests)
## 
##              Effect       SSn      SSd DFn DFd      F        p p<.05      ges
## 1       (Intercept) 68473.044 109803.3   1 133 82.938 1.11e-15     * 0.384000
## 2            Gender    51.278 109803.3   1 133  0.062 8.04e-01       0.000467
## 3        BMI_status  8413.707 109803.3   3 133  3.397 2.00e-02     * 0.071000
## 4 Gender:BMI_status  6885.839 109803.3   3 133  2.780 4.40e-02     * 0.059000

Solution 2

  • use ANCOVA (Analysis of Covariance) (continuous * category ~ continuous)

    ANCOVA is suitable when you have continuous variable that you want to adjust for or control while analyzing the main effects of categorical independent variables. In ANCOVA, you can include the continuous variable as a covariate.

# Solution 2 (ANCOVA)
heroes2 |>
   anova_test(
    dv = Strength, 
    wid = Name, 
    between = Gender, 
    covariate = BMI,
    detailed = T)   
## ANOVA Table (type II tests)
## 
##   Effect       SSn      SSd DFn DFd      F        p p<.05      ges
## 1    BMI 23336.275 110912.1   1 138 29.036 2.99e-07     * 1.74e-01
## 2 Gender    10.606 110912.1   1 138  0.013 9.09e-01       9.56e-05
In Solution 2, the BMI variable is used as a covariate. This allows you to account for the continuous variable (BMI) while examining the effect of gender on strength.

7.2.2.3 example 3: strength ~ gender * Intelligence

Let us take a look at another example.

In the heroes2 dataset, the Intelligence is a categorical variable. The intelligence levels of superheroes are labelled as high, good, average, moderate, and low.

heroes2 |>
  distinct(Intelligence)
## # A tibble: 5 × 1
##   Intelligence
##   <chr>       
## 1 moderate    
## 2 good        
## 3 high        
## 4 average     
## 5 low

Therefore, we should not encounter the same problem as example 2.

heroes2 |>
  anova_test(
    dv = Strength, 
    wid = Name, 
    between = c(Gender, Intelligence), 
    detailed = T)

However, we still see errors here. After examining Step 1, we can conclude that it is not the issue here. We can try to debug using the following methods:

Step 2

We examine both independent variables separately to ensure they are functioning properly individually. We can delete one of the variables each time for testing. (Although this is usually not the problem, it’s still good practice to check.)

  • Checking the variable Gender
heroes2 |>
  anova_test(
    dv = Strength, 
    wid = Name, 
    between = Gender, 
    detailed = T)
## ANOVA Table (type II tests)
## 
##   Effect      SSn      SSd DFn DFd     F    p p<.05   ges
## 1 Gender 3772.182 134248.4   1 139 3.906 0.05       0.027
  • Checking the variable Intelligence
heroes2 |>
  anova_test(
    dv = Strength, 
    wid = Name, 
    between = Intelligence, 
    detailed = T)
## ANOVA Table (type II tests)
## 
##         Effect      SSn    SSd DFn DFd     F     p p<.05   ges
## 1 Intelligence 8758.544 129262   4 136 2.304 0.062       0.063

Both variables look fine, and the code works smoothly when tested individually.

Step 3

We examine if there are any missing values in any of the category. (I often forgot about this step, and spend hours confused about why my ANOVA isn’t ANOVAing…)

table(heroes2$Gender, heroes2$Intelligence)
##    
##     average good high low moderate
##   F      14   20    4   0        2
##   M      24   41   27   2        7

Here, we see that in female superheroes category, there is a zero for low intelligence level. This prevents the ANOVA from proceeding. Now that we have identified the problem, we can remove the intelligence == "low" from our analysis, as it hinders the process.

heroes2 |>
  filter(Intelligence != "low") |>
   anova_test(
    dv = Strength, 
    wid = Name, 
    between = c(Gender, Intelligence), 
    type = 2,
    detailed = T)   
## ANOVA Table (type II tests)
## 
##                Effect      SSn      SSd DFn DFd     F     p p<.05   ges
## 1              Gender 1982.076 126823.2   1 131 2.047 0.155       0.015
## 2        Intelligence 3719.933 126823.2   3 131 1.281 0.284       0.028
## 3 Gender:Intelligence  406.757 126823.2   3 131 0.140 0.936       0.003

You might think: “I’ll just add group_by() before the anova_test().” However, while this approach would not give you error message, it would not yield a two-way ANOVA as we intended. Instead, it would just separate the data into two groups before performing a one-way ANOVA on each. See the codes below:

heroes2 |>
  filter(Intelligence != "low") |>
  group_by(Gender) |>
   anova_test(
    dv = Strength, 
    wid = Name, 
    between = Intelligence, 
    detailed = T)   
## # A tibble: 2 × 10
##   Gender Effect         SSn    SSd   DFn   DFd     F     p `p<.05`   ges
## * <chr>  <chr>        <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl> <chr>   <dbl>
## 1 F      Intelligence 1564. 33535.     3    36  0.56 0.645 ""      0.045
## 2 M      Intelligence 2563. 93288.     3    95  0.87 0.46  ""      0.027

We can see that the result table differs from the two-way ANOVA table (refer to the upper table), and the numbers are also different (see DFn and DFd in both results).

7.2.3 three-way ANOVA

7.2.3.1 example 1: Strength ~ Gender * Intelligence * BMI_status

If we want to add more independent variables, we can simply include additional column names in the between = section.

heroes2 |>
  filter(Intelligence != "low") |>
   anova_test(
    dv = Strength, 
    wid = Name, 
    between = c(Gender, Intelligence, BMI_status), 
    type = 2,
    detailed = T)   
## ANOVA Table (type II tests)
## 
##                           Effect       SSn      SSd DFn DFd     F        p p<.05
## 1                         Gender  1083.194 96626.99   1 118 1.323 0.252000      
## 2                   Intelligence  3465.586 96626.99   3 118 1.411 0.243000      
## 3                     BMI_status 16524.941 96626.99   3 118 6.727 0.000314     *
## 4            Gender:Intelligence  2703.193 96626.99   2 118 1.651 0.196000      
## 5              Gender:BMI_status  8295.111 96626.99   3 118 3.377 0.021000     *
## 6        Intelligence:BMI_status  7715.048 96626.99   7 118 1.346 0.235000      
## 7 Gender:Intelligence:BMI_status        NA 96626.99   0 118    NA       NA  <NA>
##     ges
## 1 0.011
## 2 0.035
## 3 0.146
## 4 0.027
## 5 0.079
## 6 0.074
## 7    NA

Note that if the independent variable is within-subject (in the case of a mixed analysis), we place the within-subject independent variables in the within = section. This is commonly used when analyzing experimental results, such as comparing test scores before learning and after learning. See below for more details on mixed ANOVA.

7.2.4 mixed ANOVA

As mentioned, mixed ANOVA is often used for analyzing experimental results, as it allows for tracking individual changes over time, which is difficult to achieve with other methods.

Let’s assume a high school wants to investigate whether different teaching styles affect students’ exam scores. They conducted an experiments, dividing students into two groups: one received education in a fun way, while the other was taught in a more rigid manner. The school also tested the students before and after learning the subject.

  • Independent variable: teaching style (fun/rigid), exam time (before/after)

  • Dependent variable: exam score

teaching_style <- data.frame(
  student = c("1", "2", "3", "4", "5", "6", 
              "1", "2", "3", "4", "5", "6"),
  style = as.factor(c("fun", "rigid", "fun", "rigid", "fun", "rigid",
                      "fun", "rigid", "fun", "rigid", "fun", "rigid")),
  timing = as.factor(c("before", "before", "before", "before", "before", "before", 
                       "after", "after", "after", "after", "after", "after")),
  score = c(50, 13, 10, 70, 23, 70, 70, 90, 66, 83, 80, 80))

The data frame is created with 6 students: 3 participated in the fun teaching style, and 3 experienced the rigid style. All of their before and after exam scores were recorded.

Here, we want to see if the teaching style influenced their exam scores.

teaching_style |>
   anova_test(
    dv = score, 
    wid = student, 
    between = style, 
    within = timing,
    detailed = T)   
## ANOVA Table (type II tests)
## 
##         Effect DFn DFd       SSn      SSd       F       p p<.05   ges
## 1  (Intercept)   1   4 41418.750 1278.667 129.569 0.00034     * 0.929
## 2        style   1   4   954.083 1278.667   2.985 0.15900       0.232
## 3       timing   1   4  4524.083 1876.667   9.643 0.03600     * 0.589
## 4 style:timing   1   4    90.750 1876.667   0.193 0.68300       0.028

From the analysis, we can observe that teaching style does not have a significant impact on their scores. However, there is a main effect of the timing variable.

By examining the mean scores across different times, we can see that students performed better after learning, regardless of the teaching style applied.

teaching_style |>
  group_by(timing) |>
  summarise(mean_score = mean(score))
## # A tibble: 2 × 2
##   timing mean_score
##   <fct>       <dbl>
## 1 after        78.2
## 2 before       39.3

7.2.5 Additional notes

  • Use filter() to choose specific groups for analysis.

    For instance, filter(Intelligence %in% c("high", "good") or filter(Hair_color != "No Hair").

  • group_by() is useful if you want to examine ANOVA results for both conditions’ simultaneously.

I recently learned that the anova_test() function does not support more than one within variable, therefore, you might encounter error messages when doing that (e.g., within = c(Var1, Var2)).

One way to avoid this is to merge the dataset temporary into the two variables you need.

method 1: merge data temporary

heroes2 |>
  filter(Publisher %in% c("Marvel Comics", "DC Comics")) |>
  filter(Gender %in% c("F", "M")) |>
  group_by(Publisher, Gender, Name) |>
  summarize(Height = mean(Height, na.rm = TRUE), .groups = "drop") -> temporary_hero_data_for_anova

temporary_hero_data_for_anova
## # A tibble: 137 × 4
##    Publisher Gender Name         Height
##    <chr>     <chr>  <chr>         <dbl>
##  1 DC Comics F      Miss Martian   178.
##  2 DC Comics F      Raven          165.
##  3 DC Comics F      Starfire       194.
##  4 DC Comics F      Supergirl      165.
##  5 DC Comics F      Vixen          176.
##  6 DC Comics F      Wonder Woman   183.
##  7 DC Comics M      Adam Strange   185.
##  8 DC Comics M      Alan Scott     181.
##  9 DC Comics M      Aqualad        179.
## 10 DC Comics M      Aquaman        186.
## # ℹ 127 more rows

Here, as you can see from the temporary dataset, I only need two main variables: Publisher and Gender. Therefore, I tell R that none of the other variables matter to me right now. I only need these two variables and Name to identify each unique hero. I use the summarize() function to calculate each hero’s mean Height across different records, using na.rm = TRUE to ignore missing values.

The .groups = "drop" argument tells R to drop all grouping levels after the summarize() operation is complete. This means that the resulting dataset will no longer retain any of the grouping structure (Publisher, Gender, or Name), making the output a regular data frame without group attributes. This is useful when we don’t need to continue further grouped calculations on this summarized data.

Then you could use this data for further ANOVA analysis.

temporary_hero_data_for_anova |>
  filter(Publisher %in% c("Marvel Comics", "DC Comics")) |>
  filter(Gender %in% c("F", "M")) |>
  group_by(Publisher) |>
  anova_test(
    dv = Height,
    wid = Name,
    between = Gender,
    effect.size = "ges",  
    detailed = FALSE
  )
## # A tibble: 2 × 8
##   Publisher     Effect   DFn   DFd     F     p `p<.05`   ges
## * <chr>         <chr>  <dbl> <dbl> <dbl> <dbl> <chr>   <dbl>
## 1 DC Comics     Gender     1    23  1.93 0.178 ""      0.078
## 2 Marvel Comics Gender     1   110  3.62 0.06  ""      0.032

method 2: ez ANOVA

library(ez)
ezANOVA(
  final_not_temp, 
  dv = acc,
  wid = Subject_pid, 
  within = c(Var1, Var2),
  between = if any,
  observed = NULL,
  diff = NULL,
  reverse_diff = FALSE,
  type = 2,
  white.adjust = FALSE,
  detailed = F,
  return_aov = FALSE)

7.3 Post hoc analysis

After conducting an ANOVA, if we want to explore the data further (assuming the ANOVA results are significant), we can perform a post hoc analysis.

I used to be confused about the difference between post hoc analysis and pairwise t-test. It’s important to note that post hoc analysis is done after obtaining a significant result from ANOVA, while pairwise t-test can be applied to any two groups of data.

teaching_style |>
 group_by(style) |>
  pairwise_t_test(
    score ~ timing,
    paired = TRUE,
    p.adjust.method = "bonferroni")
## # A tibble: 2 × 11
##   style .y.   group1 group2    n1    n2 statistic    df     p p.adj p.adj.signif
## * <fct> <chr> <chr>  <chr>  <int> <int>     <dbl> <dbl> <dbl> <dbl> <chr>       
## 1 fun   score after  before     3     3      3.64     2 0.068 0.068 ns          
## 2 rigid score after  before     3     3      1.53     2 0.267 0.267 ns

In this example, we use post hoc analysis:

  • score ~ timing: Here, we put the dependent variable on the left and the independent variable on the right. We are comparing scores across different exam timing.

  • paired =: Set this to TRUE if it’s a repeated measures design , and FALSE if it’s a between-subject design (i.e., different participants in each group).

  • p.adjust.method: This argument allows us to choose a method for adjusting p values.

7.4 Regression analysis