Trying to use dplyr to group_by and apply scale()

Asked 3/3, 2016 at 15:3 Answered 9/10, 2021 at 0:59

Trying to use dplyr to group_by the stud_ID variable in the following data frame, as in this SO question:

> str(df)
'data.frame':   4136 obs. of  4 variables:
 $ stud_ID         : chr  "ABB112292" "ABB112292" "ABB112292" "ABB112292" ...
 $ behavioral_scale: num  3.5 4 3.5 3 3.5 2 NA NA 1 2 ...
 $ cognitive_scale : num  3.5 3 3 3 3.5 2 NA NA 1 1 ...
 $ affective_scale : num  2.5 3.5 3 3 2.5 2 NA NA 1 1.5 ...

I tried the following to obtain scale scores by student (rather than scale scores for observations across all students):

scaled_data <- 
          df %>%
              group_by(stud_ID) %>%
                  mutate(behavioral_scale_ind = scale(behavioral_scale),
                         cognitive_scale_ind = scale(cognitive_scale),
                         affective_scale_ind = scale(affective_scale))

Here is the result:

> str(scaled_data)
Classes ‘grouped_df’, ‘tbl_df’, ‘tbl’ and 'data.frame': 4136 obs. of  7 variables:
 $ stud_ID             : chr  "ABB112292" "ABB112292" "ABB112292" "ABB112292" ...
 $ behavioral_scale    : num  3.5 4 3.5 3 3.5 2 NA NA 1 2 ...
 $ cognitive_scale     : num  3.5 3 3 3 3.5 2 NA NA 1 1 ...
 $ affective_scale     : num  2.5 3.5 3 3 2.5 2 NA NA 1 1.5 ...
 $ behavioral_scale_ind: num [1:12, 1] 0.64 1.174 0.64 0.107 0.64 ...
  ..- attr(*, "scaled:center")= num 2.9
  ..- attr(*, "scaled:scale")= num 0.937
 $ cognitive_scale_ind : num [1:12, 1] 1.17 0.64 0.64 0.64 1.17 ...
  ..- attr(*, "scaled:center")= num 2.4
  ..- attr(*, "scaled:scale")= num 0.937
 $ affective_scale_ind : num [1:12, 1] 0 1.28 0.64 0.64 0 ...
  ..- attr(*, "scaled:center")= num 2.5
  ..- attr(*, "scaled:scale")= num 0.782

The three scaled variables (behavioral_scale, cognitive_scale, and affective_scale) have only 12 observations - the same number of observations for the first student, ABB112292.

What's going on here? How can I obtain scaled scores by individual?

Georgiana answered 3/3, 2016 at 15:3 Comment(3)

Have you looked into summarise() in dplyr ? – Benedict 3/3, 2016 at 15:8

I think you should mutate before you group, or you are going to center every student's score on him/herself – Detour 3/3, 2016 at 15:12

@C8H10N4O2, on him/herself, so each student's observations will have M = 0 and SD = 1 – Georgiana 3/3, 2016 at 15:23

The problem seems to be in the base scale() function, which expects a matrix. Try writing your own.

scale_this <- function(x){
  (x - mean(x, na.rm=TRUE)) / sd(x, na.rm=TRUE)
}

Then this works:

library("dplyr")

# reproducible sample data
set.seed(123)
n = 1000
df <- data.frame(stud_ID = sample(LETTERS, size=n, replace=TRUE),
                 behavioral_scale = runif(n, 0, 10),
                 cognitive_scale = runif(n, 1, 20),
                 affective_scale = runif(n, 0, 1) )
scaled_data <- 
  df %>%
  group_by(stud_ID) %>%
  mutate(behavioral_scale_ind = scale_this(behavioral_scale),
         cognitive_scale_ind = scale_this(cognitive_scale),
         affective_scale_ind = scale_this(affective_scale))

Or, if you're open to a data.table solution:

library("data.table")

setDT(df)

cols_to_scale <- c("behavioral_scale","cognitive_scale","affective_scale")

df[, lapply(.SD, scale_this), .SDcols = cols_to_scale, keyby = factor(stud_ID)]

Detour answered 3/3, 2016 at 15:30 Comment(0)

This was a known problem in dplyr, a fix has been merged to the development version, which you can install via

# install.packages("devtools")
devtools::install_github("hadley/dplyr")

In the stable version, the following should work, too:

scale_this <- function(x) as.vector(scale(x))

One answered 24/9, 2016 at 2:13 Comment(5)

Hi krlmlr, given the age of this answer, would it be fair to say that this behaviour has been reverted in a subsequent version change? – Drusie 17/10, 2020 at 19:11

Yes, I should have added a version number to the answer back then. These days, in dplyr >= 1.0.2, matrices can be used without problems in columns, so I suspect the original problem no longer occurs? – One 18/10, 2020 at 4:22

Ah nevermind. It works, but the column names then show up with [,1] which can be a little confusing. – Drusie 18/10, 2020 at 6:10

The column1[,1] does not vanish even after piping set_colnames – Nole 25/3, 2021 at 9:57

To remove [,1], it seems you can either: df%>% mutate(scaled=as.vector(scale(value)) or df %>% mutate(scaled=scale(value)[,1]) – Coumarin 3/5, 2021 at 21:17

df <- df %>% mutate(across(is.numeric, ~ as.numeric(scale(.))))

Dionne answered 9/10, 2021 at 0:59 Comment(5)

While this code may answer the question, providing additional context regarding how and/or why it solves the problem would improve the answer's long-term value. You can find more information on how to write good answers in the help center: stackoverflow.com/help/how-to-answer . Good luck 🙂 – Carpenter 10/10, 2021 at 11:20

@novonimo's point is always important, but it's especially important here where there's already a well-established accepted answer that's been validated by the community with nearly 40 upvotes. Under what circumstances is your approach preferable over the accepted answer? Are you taking advantage of new capabilities or syntax? – Frumenty 11/10, 2021 at 2:17

A simple and elegant solution! In my opinion, the best of all of the above. Brawo @SparklingWater! – Solidary 10/1, 2022 at 18:40

As of 2024-04-30 this works but gives a warning that "use of bare predicate functions was deprecated in tidyselect 1.1.0. Please use wrap predicates in where() instead." Following that suggestion, you can use mutate(across(where(is.numeric), scale)). – Bronchiole 30/4 at 20:30

I shortened the code in my previous comment too much; this should work better: mutate(across(where(is.numeric), ~ as.numeric(scale(.))) – Bronchiole 30/4 at 20:52

Recommended topics

Hot tags