R SparkR - equivalent to melt function
Asked Answered
S

1

2

Is there a function similar to melt in SparkR library?

Transform data with 1 row and 50 columns to 50 rows and 3 columns?

Superabundant answered 12/10, 2018 at 15:19 Comment(2)
@sramalingam24 That's not even remotely true. In general utilities designed to work with data.frame are not compatible with SparkDatafarme.Corruption
@sramalingam24 Well, if you do that, data will be no longer distributed and likely crash your program (if using SparkR makes any sense in the first place). In general there is no efficient and scalable way to use arbitrary R function with SparkR data structures.Corruption
C
6

There is no built-in function that provides a similar functionality in SparkR. You can built your own with explode

library(magrittr)

df <- createDataFrame(data.frame(
  A = c('a', 'b', 'c'),
  B = c(1, 3, 5),
  C = c(2, 4, 6)
))

melt <- function(df, id.vars, measure.vars, 
                 variable.name = "key", value.name = "value") {

   measure.vars.exploded <- purrr::map(
       measure.vars, function(c) list(lit(c), column(c))) %>% 
     purrr::flatten() %>% 
     (function(x) do.call(create_map, x)) %>% 
     explode()
   id.vars <- id.vars %>% purrr::map(column)

   do.call(select, c(df, id.vars, measure.vars.exploded)) %>%
     withColumnRenamed("key", variable.name) %>%
     withColumnRenamed("value", value.name)
}

melt(df, c("A"), c("B", "C")) %>% head()
  A key value                                                                   
1 a   B     1
2 a   C     2
3 b   B     3
4 b   C     4
5 c   B     5
6 c   C     6

or use SQL with Hive's stack UDF:

stack <- function(df, id.vars, measure.vars, 
                  variable.name = "key", value.name = "value") { 
  measure.vars.exploded <- glue::glue('"{measure.vars}", `{measure.vars}`') %>%  
    glue::glue_collapse(" , ") %>%
    (function(x) glue::glue(
      "stack({length(measure.vars)}, {x}) as ({variable.name}, {value.name})"
    )) %>%
    as.character()
    do.call(selectExpr, c(df, id.vars, measure.vars.exploded))
}

stack(df, c("A"), c("B", "C")) %>% head()
  A key value
1 a   B     1
2 a   C     2
3 b   B     3
4 b   C     4
5 c   B     5
6 c   C     6

Related questions:

Corruption answered 19/10, 2018 at 17:25 Comment(1)
very elegant solution. thanks! I think it can be further simplified by replacing id.vars <- id.vars %>% purrr::map(column) do.call(select, c(df, id.vars, measure.vars.exploded)) with df %>% select(c(id.vars, measure.vars.exploded)) at least it worked on my version of RSpark (2.4)Platinize

© 2022 - 2024 — McMap. All rights reserved.