Say I have 40 continuous (DoubleType
) variables that I've bucketed into quartiles using ft_quantile_discretizer
. Identifying the quartiles on all of the variables is super fast, as the function supports execution of multiple variables at once.
Next, I want to one hot code those bucketed variables, but there is no functionality currently supported to one hot code all of those variables with a single call. So I'm piping ft_string_indexer
, ft_one_hot_encoder
, and sdf_separate_column
for each of the bucketed variables one at a time, by looping through the variables. This gets the job done. However, as the loop progresses, it slows down considerably. I'm thinking it's running out of memory, but can't figure out how to program this so that it executes with the same speed across the variables.
If q_vars
is a character array of variable names (say 40 of them) for continuous variables, how can I code this up in a more spark-efficient way?
for (v in q_vars) {
data_sprk_q<-data_sprk_q %>%
ft_string_indexer(v,paste0(v,"b"),"keep",string_order_type = "alphabetAsc") %>%
ft_one_hot_encoder(paste0(v,"b"),paste0(v,"bc")) %>%
sdf_separate_column(paste0(v,"bc"),into=q_vars_cat_list[[v]])
}
I also tried executing as a single massive pipeline with all of the variables referenced, but that too didn't solve the issue, so I'm thinking it doesn't have anything to do with the loop itself.
test_text<-paste0("data_sprk_q<-data_sprk_q %>% ", paste0("ft_string_indexer('",q_vars,"',paste0('",q_vars,"','b'),'keep',string_order_type = 'alphabetAsc') %>% ft_one_hot_encoder(paste0('",q_vars,"','b'),paste0('",q_vars,"','bc')) %>% sdf_separate_column(paste0('",q_vars,"','bc'),into=",q_vars_cat_list,")",collapse=" %>% "))
eval(parse(text=test_text))
Any help would be appreciated.
sdf_separate_column
? – Maniac