scale/normalize columns by group
Asked Answered
S

3

17

I have a data frame that looks like this:

  Store Temperature Unemployment Sum_Sales
1     1       42.31        8.106   1643691
2     1       38.51        8.106   1641957
3     1       39.93        8.106   1611968
4     1       46.63        8.106   1409728
5     1       46.50        8.106   1554807
6     1       57.79        8.106   1439542

For each 'Store', I want to normalize/scale two columns ("Sum_sales" and "Temperature").

Desired output:

  Store Temperature Unemployment Sum_Sales
1     1       1.000        8.106   1.00000
2     1       0.000        8.106   0.94533
3     1       0.374        8.106   0.00000
4     2       0.012        8.106   0.00000
5     2       0.000        8.106   1.00000
6     2       1.000        8.106   0.20550

Here is the normalizing function that I created:

 normalit<-function(m){
   (m - min(m))/(max(m)-min(m))
 }

What I have tried:

df2 <- df %.%
  group_by('Store') %.%
  summarise(Temperature = normalit(Temperature), Sum_Sales = normalit(Sum_Sales)))

Any suggestions/help would be greatly appreciated. Thanks.

Sapro answered 15/11, 2014 at 19:55 Comment(2)
You have an unemployment column in your result. Is this the same for each store>Astri
yes.. sorry for not specifyingSapro
D
23

The issue is that you are using the wrong dplyr verb. Summarize will create one result per group per variable. What you want is mutate. Mutate changes variables and returns a result of the same length as the original. See http://cran.rstudio.com/web/packages/dplyr/vignettes/dplyr.html. Below two approaches using dplyr:

df %>%
    group_by(Store) %>%
    mutate(Temperature = normalit(Temperature), Sum_Sales = normalit(Sum_Sales))

df %>%
    group_by(Store) %>%
    mutate_each(funs(normalit), Temperature, Sum_Sales)

Note: The Store variable is different between your data and desired result. I assumed that @jlhoward got the right data.

Dao answered 16/11, 2014 at 8:6 Comment(4)
I am getting an error "could not find function normalit", is there some package I need to install for this? @DaoSchnabel
See OPs normalit functionDao
Link to vignette is deadIsolde
Link fixed. Thanks @AlexReynoldsDao
C
4

Update with dplyr 1.0.0

The solution by @Vincent can be written in the new dplyr syntax that provides across:

df %>%
    group_by(Store) %>%
    mutate(across(c(Temperature, Sum_Sales), normalit)

If you don't remember the definition of normalit, you can use the scales package that provides the rescale function which normalizes with default parameters:

df %>%
    group_by(Store) %>%
    mutate(across(c(Temperature, Sum_Sales), scales::rescale)
Caritacaritas answered 30/10, 2021 at 12:30 Comment(0)
A
2

Here's a data.table solution. I changed your example a bit to have two type of store.

df <- read.table(header=T,text="Store Temperature Unemployment Sum_Sales
1     1       42.31        8.106   1643691
2     1       38.51        8.106   1641957
3     1       39.93        8.106   1611968
4     2       46.63        8.106   1409728
5     2       46.50        8.106   1554807
6     2       57.79        8.106   1439542")

library(data.table)
DT <- as.data.table(df)
DT[,list(Temperature=normalit(Temperature),Sum_Sales=normalit(Sum_Sales)),
    by=list(Store,Unemployment)]
#    Store Unemployment Temperature Sum_Sales
# 1:     1        8.106  1.00000000 1.0000000
# 2:     1        8.106  0.00000000 0.9453393
# 3:     1        8.106  0.37368421 0.0000000
# 4:     2        8.106  0.01151461 0.0000000
# 5:     2        8.106  0.00000000 1.0000000
# 6:     2        8.106  1.00000000 0.2055018

Note that your normalization will have problems if there is only 1 row for a stoer.

Astri answered 15/11, 2014 at 23:18 Comment(3)
Nice solution @jlhoward. I assume Unemployment is added to the by list so it doesn't get dropped. Just out of curiosity, how would you keep Unemployment if it were not constant in each store and still get the desired result?Dao
OP wants 1 row per Store. If Unemployment is not constant for a given store, what value would you use?Astri
I just wondered if there are three rows per store (as in OP's desired output), how would you keep Unemployment if not through by=list().Dao

© 2022 - 2024 — McMap. All rights reserved.