Using SparkR, how to split a string column into 'n' multiple columns?
Asked Answered
N

2

3

I’m working with SparkR 1.6 and I have a dataFrame of millions rows. One of the df's column, named « categories », contains strings that have the following pattern :

      categories
1 cat1,cat2,cat3
2      cat1,cat2
3     cat3, cat4
4           cat5

I would like to split each string and create « n » new columns, where « n » is the number of possible categories (here n = 5, but in reality it could be more than 50).
Each new column will contains a boolean for the presence/absence of the category, such as :

   cat1  cat2  cat3  cat4  cat5
1  TRUE  TRUE  TRUE FALSE FALSE
2  TRUE  TRUE FALSE FALSE FALSE
3 FALSE FALSE  TRUE  TRUE FALSE
4 FALSE FALSE FALSE FALSE  TRUE

How can this be performed using the sparkR api only ?

Thanks for your time.
Regards.

Naughty answered 10/3, 2016 at 14:26 Comment(0)
K
3

Lets start with imports and dummy data:

library(magrittr)

df <- createDataFrame(sqlContext, data.frame(
  categories=c("cat1,cat2,cat3", "cat1,cat2", "cat3,cat4", "cat5")
))

Separate strings:

separated <- selectExpr(df, "split(categories, ',') AS categories")

get distinct categories:

categories <- select(separated, explode(separated$categories)) %>% 
  distinct() %>% 
  collect() %>%
  extract2(1)

build expressions list:

exprs <- lapply(
  categories, function(x) 
  alias(array_contains(separated$categories, x), x)
)

select and check results

select(separated, exprs) %>% head()
##    cat1  cat2  cat3  cat4  cat5
## 1  TRUE  TRUE  TRUE FALSE FALSE
## 2  TRUE  TRUE FALSE FALSE FALSE
## 3 FALSE FALSE  TRUE  TRUE FALSE
## 4 FALSE FALSE FALSE FALSE  TRUE
Kisner answered 10/3, 2016 at 18:59 Comment(1)
Thanks for the answer @zero323, but it's an overkill!Shreveport
F
0

This is a pure Spark solution without using SparkR::collect(). If the column of the given Spark dataframe has the certain number of separators, here is my solution with the following assumptions:

# separator = '::'
# number of separators = 3
# name of the respective column = col

First you have to create the schema of the output dataframe with the split columns:

AddFieldsToSchema = function(existingSchema, newFieldNames, newFieldTypes) {
  # This somewhat tortured syntax is necessary because the existingSchema
  # variable is actually a Java object under the hood
  existingNames = unlist(lapply(existingSchema$fields(), function(field) {
    field$name()
  }))
  existingTypes = unlist(lapply(existingSchema$fields(), function(field) {
    field$dataType.simpleString()
  }))
  
  combinedNames = c(existingNames, newFieldNames)
  combinedTypes = c(existingTypes, newFieldTypes)
  
  return(CreateSchema(combinedNames, combinedTypes))
}
num_separator = 3 
sdf_schema = SparkR::schema(sdf) %>%
              AddFieldsToSchema(paste0('col_', seq(1, num_separator)),
                                c(rep('string', num_separator)))

Then you have a splitting function for the given column that will be used in SparkR::dapply:

my_func = function(x) {cbind(x, stringr::str_split_fixed(x$col, '::', 3))}

sdf_split = sdf %>% 
              SparkR::dapply(my_func, df_schema) 
Finzer answered 11/9, 2022 at 8:1 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.