How to run ANOVA on a wide format data.frame?

Asked 29/4, 2018 at 23:3 Answered 30/4, 2018 at 1:1

I've been taught to run an ANOVA with the formula: aov(dependent variable~independent variable, dataset)

but I am struggling with how to run an ANOVA for a particular dataset because it is broken up into three columns that each contain a value. The three columns are designated newborn, adolescent and adult (which is hamster age) and the values within each column represent blood pressure values. I need to run a test to determine if there is a relationship between blood pressure and age.

This is what the data looks like in R:

> hamster
   Newborn adolescent adult
1      108        110   105
2      110        105   100
3       90        100    95
4       80         90    85
5      100        102    97
6      120        110   105
7      125        105   100
8      130        115   110
9      120        100    95
10     130        120   115
11     145        130   125
12     150        125   120
13     130        135   130
14     155        130   125
15     140        120   115

Confused because the dependent variable are those values ^ within each column

Contrite answered 29/4, 2018 at 23:3 Comment(0)

R has a useful function called stack to convert your data format into the one needed for ANOVA.

aov(values ~ ind, stack(hamster))

# Call:
#
# aov(formula = values ~ ind, data = stack(hamster))
#
# Terms:
#                       ind Residuals
# Sum of Squares   1525.378 11429.867
# Deg. of Freedom         2        42
#
# Residual standard error: 16.49666
# Estimated effects may be unbalanced

Subulate answered 29/4, 2018 at 23:34 Comment(2)

There is just a little problem: the output is all wrong. The residuals should have 28 degrees of freedom, not 42, and the SSerror should equal 1417, not 11429... There is a id column missing and the error term must be specified in aov with "+ Error(factor(id))". – Inez 8/12, 2023 at 12:18

@DenisCousineau there is no "id" in the table of the original question. There are just dataframe row numbers, which in R are always there. The design might as well be measuring the blood pressure of 3 different hamsters (one young, one middle aged, one old) across 15 different days. Or might be completely different animals, with equal number of samples per blood pressure group. – Vulcanism 11/12, 2023 at 10:39

The first step is to rearrange your data so it's in a "long" format instead of a "wide" format. This can be done in base R using the reshape function, but it's much easier to use the gather function in the tidyr package:

library(tidyr)
result <- hampster %>%
  gather(age, bp) %>%
  aov(bp ~ age, .)

Using tidyr also gives us the pipe operator (%>%), which let's you chain commands together in a pretty way. By default, it works by taking the result of the previous function and inserting it as the first argument of the next function. In your aov function, we overrode this using the . operator to explicitly put the data set resulting from the gather function in as the 2nd argument.

Freeness answered 29/4, 2018 at 23:12 Comment(3)

Though understand that you are violating the assumption of independence by having repeated measures of the same hamster. – Nalchik 29/4, 2018 at 23:15

If that is the case, that's a different question entirely - and requires a different analysis tool (or at a minimum, an extra step or two to keep track of which animal is which). My understanding was that the OP wanted to rearrange the data set to make aov work. – Freeness 29/4, 2018 at 23:18

Yes that is what was asked, which you answered, and I upvoted the answer. However OP should know that what I said is an issue. – Nalchik 29/4, 2018 at 23:21

Code to run a repeated measures analysis of variance with one within subject variable and no between subjects variables is as follows. Note that we use group_by() from the dplyr package to retain the hamster id number so we can use it as the error term in the ANOVA.

hamsterData <- "id   Newborn adolescent adult
1      108        110   105
2      110        105   100
3       90        100    95
4       80         90    85
5      100        102    97
6      120        110   105
7      125        105   100
8      130        115   110
9      120        100    95
10     130        120   115
11     145        130   125
12     150        125   120
13     130        135   130
14     155        130   125
15     140        120   115"

hamster <- read.table(text = hamsterData,header = TRUE )
library(tidyr)
library(dplyr)
result <- hamster %>% group_by(id) %>%
     gather(age,bp, Newborn,adolescent,adult)
result$age <- factor(result$age,levels=c("Newborn","adolescent","adult"))
options(contrasts=c("contr.sum","contr.poly"))
modelAOV <- aov(bp ~ age + Error(factor(id)),data = result)
summary(modelAOV)

...and the output:

> modelAOV <- aov(bp ~ age + Error(factor(id)),data = result)
> summary(modelAOV)

Error: factor(id)
          Df Sum Sq Mean Sq F value Pr(>F)
Residuals 14  10013   715.2               

Error: Within
          Df Sum Sq Mean Sq F value  Pr(>F)    
age        2   1525   762.7   15.07 3.6e-05 ***
Residuals 28   1417    50.6                    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
>

Acquisitive answered 30/4, 2018 at 1:1 Comment(0)

Recommended topics

Hot tags