scale/normalize columns by group

Asked 15/11, 2014 at 19:55 Answered 30/10, 2021 at 12:30

I have a data frame that looks like this:

  Store Temperature Unemployment Sum_Sales
1     1       42.31        8.106   1643691
2     1       38.51        8.106   1641957
3     1       39.93        8.106   1611968
4     1       46.63        8.106   1409728
5     1       46.50        8.106   1554807
6     1       57.79        8.106   1439542

For each 'Store', I want to normalize/scale two columns ("Sum_sales" and "Temperature").

Desired output:

  Store Temperature Unemployment Sum_Sales
1     1       1.000        8.106   1.00000
2     1       0.000        8.106   0.94533
3     1       0.374        8.106   0.00000
4     2       0.012        8.106   0.00000
5     2       0.000        8.106   1.00000
6     2       1.000        8.106   0.20550

Here is the normalizing function that I created:

 normalit<-function(m){
   (m - min(m))/(max(m)-min(m))
 }

What I have tried:

df2 <- df %.%
  group_by('Store') %.%
  summarise(Temperature = normalit(Temperature), Sum_Sales = normalit(Sum_Sales)))

Any suggestions/help would be greatly appreciated. Thanks.

Sapro answered 15/11, 2014 at 19:55 Comment(2)

You have an unemployment column in your result. Is this the same for each store> – Astri 15/11, 2014 at 20:2

yes.. sorry for not specifying – Sapro 15/11, 2014 at 22:46

The issue is that you are using the wrong dplyr verb. Summarize will create one result per group per variable. What you want is mutate. Mutate changes variables and returns a result of the same length as the original. See http://cran.rstudio.com/web/packages/dplyr/vignettes/dplyr.html. Below two approaches using dplyr:

df %>%
    group_by(Store) %>%
    mutate(Temperature = normalit(Temperature), Sum_Sales = normalit(Sum_Sales))

df %>%
    group_by(Store) %>%
    mutate_each(funs(normalit), Temperature, Sum_Sales)

Note: The Store variable is different between your data and desired result. I assumed that @jlhoward got the right data.

Dao answered 16/11, 2014 at 8:6 Comment(4)

I am getting an error "could not find function normalit", is there some package I need to install for this? @Dao – Schnabel 17/12, 2016 at 1:26

See OPs normalit function – Dao 14/3, 2017 at 12:45

Link to vignette is dead – Isolde 29/3, 2019 at 5:16

Link fixed. Thanks @AlexReynolds – Dao 29/3, 2019 at 21:43

Update with dplyr 1.0.0

The solution by @Vincent can be written in the new dplyr syntax that provides across:

df %>%
    group_by(Store) %>%
    mutate(across(c(Temperature, Sum_Sales), normalit)

If you don't remember the definition of normalit, you can use the scales package that provides the rescale function which normalizes with default parameters:

df %>%
    group_by(Store) %>%
    mutate(across(c(Temperature, Sum_Sales), scales::rescale)

Caritacaritas answered 30/10, 2021 at 12:30 Comment(0)

Here's a data.table solution. I changed your example a bit to have two type of store.

df <- read.table(header=T,text="Store Temperature Unemployment Sum_Sales
1     1       42.31        8.106   1643691
2     1       38.51        8.106   1641957
3     1       39.93        8.106   1611968
4     2       46.63        8.106   1409728
5     2       46.50        8.106   1554807
6     2       57.79        8.106   1439542")

library(data.table)
DT <- as.data.table(df)
DT[,list(Temperature=normalit(Temperature),Sum_Sales=normalit(Sum_Sales)),
    by=list(Store,Unemployment)]
#    Store Unemployment Temperature Sum_Sales
# 1:     1        8.106  1.00000000 1.0000000
# 2:     1        8.106  0.00000000 0.9453393
# 3:     1        8.106  0.37368421 0.0000000
# 4:     2        8.106  0.01151461 0.0000000
# 5:     2        8.106  0.00000000 1.0000000
# 6:     2        8.106  1.00000000 0.2055018

Note that your normalization will have problems if there is only 1 row for a stoer.

Astri answered 15/11, 2014 at 23:18 Comment(3)

Nice solution @jlhoward. I assume Unemployment is added to the by list so it doesn't get dropped. Just out of curiosity, how would you keep Unemployment if it were not constant in each store and still get the desired result? – Dao 16/11, 2014 at 8:14

OP wants 1 row per Store. If Unemployment is not constant for a given store, what value would you use? – Astri 16/11, 2014 at 17:24

I just wondered if there are three rows per store (as in OP's desired output), how would you keep Unemployment if not through by=list(). – Dao 17/11, 2014 at 6:44

Update with dplyr 1.0.0

Recommended topics

Hot tags