Trying to use dplyr to group_by and apply scale()
Asked Answered
G

3

33

Trying to use dplyr to group_by the stud_ID variable in the following data frame, as in this SO question:

> str(df)
'data.frame':   4136 obs. of  4 variables:
 $ stud_ID         : chr  "ABB112292" "ABB112292" "ABB112292" "ABB112292" ...
 $ behavioral_scale: num  3.5 4 3.5 3 3.5 2 NA NA 1 2 ...
 $ cognitive_scale : num  3.5 3 3 3 3.5 2 NA NA 1 1 ...
 $ affective_scale : num  2.5 3.5 3 3 2.5 2 NA NA 1 1.5 ...

I tried the following to obtain scale scores by student (rather than scale scores for observations across all students):

scaled_data <- 
          df %>%
              group_by(stud_ID) %>%
                  mutate(behavioral_scale_ind = scale(behavioral_scale),
                         cognitive_scale_ind = scale(cognitive_scale),
                         affective_scale_ind = scale(affective_scale))

Here is the result:

> str(scaled_data)
Classes ‘grouped_df’, ‘tbl_df’, ‘tbl’ and 'data.frame': 4136 obs. of  7 variables:
 $ stud_ID             : chr  "ABB112292" "ABB112292" "ABB112292" "ABB112292" ...
 $ behavioral_scale    : num  3.5 4 3.5 3 3.5 2 NA NA 1 2 ...
 $ cognitive_scale     : num  3.5 3 3 3 3.5 2 NA NA 1 1 ...
 $ affective_scale     : num  2.5 3.5 3 3 2.5 2 NA NA 1 1.5 ...
 $ behavioral_scale_ind: num [1:12, 1] 0.64 1.174 0.64 0.107 0.64 ...
  ..- attr(*, "scaled:center")= num 2.9
  ..- attr(*, "scaled:scale")= num 0.937
 $ cognitive_scale_ind : num [1:12, 1] 1.17 0.64 0.64 0.64 1.17 ...
  ..- attr(*, "scaled:center")= num 2.4
  ..- attr(*, "scaled:scale")= num 0.937
 $ affective_scale_ind : num [1:12, 1] 0 1.28 0.64 0.64 0 ...
  ..- attr(*, "scaled:center")= num 2.5
  ..- attr(*, "scaled:scale")= num 0.782

The three scaled variables (behavioral_scale, cognitive_scale, and affective_scale) have only 12 observations - the same number of observations for the first student, ABB112292.

What's going on here? How can I obtain scaled scores by individual?

Georgiana answered 3/3, 2016 at 15:3 Comment(3)
Have you looked into summarise() in dplyr ?Benedict
I think you should mutate before you group, or you are going to center every student's score on him/herselfDetour
@C8H10N4O2, on him/herself, so each student's observations will have M = 0 and SD = 1Georgiana
D
46

The problem seems to be in the base scale() function, which expects a matrix. Try writing your own.

scale_this <- function(x){
  (x - mean(x, na.rm=TRUE)) / sd(x, na.rm=TRUE)
}

Then this works:

library("dplyr")

# reproducible sample data
set.seed(123)
n = 1000
df <- data.frame(stud_ID = sample(LETTERS, size=n, replace=TRUE),
                 behavioral_scale = runif(n, 0, 10),
                 cognitive_scale = runif(n, 1, 20),
                 affective_scale = runif(n, 0, 1) )
scaled_data <- 
  df %>%
  group_by(stud_ID) %>%
  mutate(behavioral_scale_ind = scale_this(behavioral_scale),
         cognitive_scale_ind = scale_this(cognitive_scale),
         affective_scale_ind = scale_this(affective_scale))

Or, if you're open to a data.table solution:

library("data.table")

setDT(df)

cols_to_scale <- c("behavioral_scale","cognitive_scale","affective_scale")

df[, lapply(.SD, scale_this), .SDcols = cols_to_scale, keyby = factor(stud_ID)] 
Detour answered 3/3, 2016 at 15:30 Comment(0)
O
18

This was a known problem in dplyr, a fix has been merged to the development version, which you can install via

# install.packages("devtools")
devtools::install_github("hadley/dplyr")

In the stable version, the following should work, too:

scale_this <- function(x) as.vector(scale(x))
One answered 24/9, 2016 at 2:13 Comment(5)
Hi krlmlr, given the age of this answer, would it be fair to say that this behaviour has been reverted in a subsequent version change?Drusie
Yes, I should have added a version number to the answer back then. These days, in dplyr >= 1.0.2, matrices can be used without problems in columns, so I suspect the original problem no longer occurs?One
Ah nevermind. It works, but the column names then show up with [,1] which can be a little confusing.Drusie
The column1[,1] does not vanish even after piping set_colnamesNole
To remove [,1], it seems you can either: df%>% mutate(scaled=as.vector(scale(value)) or df %>% mutate(scaled=scale(value)[,1])Coumarin
D
12
df <- df %>% mutate(across(is.numeric, ~ as.numeric(scale(.))))
Dionne answered 9/10, 2021 at 0:59 Comment(5)
While this code may answer the question, providing additional context regarding how and/or why it solves the problem would improve the answer's long-term value. You can find more information on how to write good answers in the help center: stackoverflow.com/help/how-to-answer . Good luck 🙂Carpenter
@novonimo's point is always important, but it's especially important here where there's already a well-established accepted answer that's been validated by the community with nearly 40 upvotes. Under what circumstances is your approach preferable over the accepted answer? Are you taking advantage of new capabilities or syntax?Frumenty
A simple and elegant solution! In my opinion, the best of all of the above. Brawo @SparklingWater!Solidary
As of 2024-04-30 this works but gives a warning that "use of bare predicate functions was deprecated in tidyselect 1.1.0. Please use wrap predicates in where() instead." Following that suggestion, you can use mutate(across(where(is.numeric), scale)).Bronchiole
I shortened the code in my previous comment too much; this should work better: mutate(across(where(is.numeric), ~ as.numeric(scale(.)))Bronchiole

© 2022 - 2024 — McMap. All rights reserved.