cSplit library(splitstackshape) is always dropping the column
Asked Answered
H

2

8

I was searching for a way to split the column content by a separator and converting a table into a long format. I found cSplit from the splitstackshape package and it is almost doing what I was looking for.

Problem is now with the drop option. I expected my split column to be copied in a way, but this does not happen. Am I doing it wrong? Somebody experienced the problem?

I am not sure if I do something wrong, but the drop = FALSE option is not working in my case.

Here is an example:

library(splitstackshape)
jnk <- data.table(a = '1,2,3,4,5', b = 5)
jnk
#            a b
# 1: 1,2,3,4,5 5

cSplit(jnk, 'a', ',', 'long', drop = FALSE)
#    a b
# 1: 1 5
# 2: 2 5
# 3: 3 5
# 4: 4 5
# 5: 5 5

What I expected was something like this:

cSplit(jnk, 'a', ',', 'long', drop = FALSE)
#    a b    a.orig
# 1: 1 5 1,2,3,4,5
# 2: 2 5 1,2,3,4,5
# 3: 3 5 1,2,3,4,5
# 4: 4 5 1,2,3,4,5
# 5: 5 5 1,2,3,4,5

I am using version 1.4.2

Hubris answered 13/5, 2015 at 6:45 Comment(1)
This does for sure not work if there are more lines in the df, since the jnk[['a']] needs to have the same length or it gets appended each time... So it is not working with jnk <- data.table(a=c('1,2,3,4,5','1,2,3','2,3'),b=c(5,4,3))Hubris
E
10

The "long" format modifies the column in place using list(unlist(...)) within "data.table", assigned with :=. Hence, if drop were used, you would be splitting the column and then removing it!

I'll try to make it explicit in the documentation that drop is only for the wide format, or add a message if a user attempts to use drop in the long format. Feel free to file a FR or submit a PR.

The workaround would be to assign another column (say, "a_orig") and then do the splitting:

jnk <- data.table(a=c('1,2,3,4,5','1,2,3','2,3'),b=c(5,4,3))
cSplit(jnk[, a_orig := a], "a", ",", "long")
#     a b    a_orig
#  1: 1 5 1,2,3,4,5
#  2: 2 5 1,2,3,4,5
#  3: 3 5 1,2,3,4,5
#  4: 4 5 1,2,3,4,5
#  5: 5 5 1,2,3,4,5
#  6: 1 4     1,2,3
#  7: 2 4     1,2,3
#  8: 3 4     1,2,3
#  9: 2 3       2,3
# 10: 3 3       2,3

I haven't tested extensively, but a possible fix could be:

cSplit2 <- function(indt, splitCols, sep = ",", direction = "wide", 
                    fixed = TRUE, drop = TRUE, stripWhite = TRUE, 
                    makeEqual = NULL, type.convert = TRUE) {
  if (direction == "long" & !drop) {
    indt <- as.data.table(indt)
    indt[, `:=`(eval(paste(splitCols, "orig", sep = "_")),
                lapply(splitCols, function(x) indt[[x]]))]
  }
  cSplit(indt, splitCols, sep, direction, fixed, drop, stripWhite, 
         makeEqual, type.convert)
}

The basic idea is to only change the input dataset if direction == "wide" and drop = FALSE. This is similar to the idea that you had, but can possibly be the solution integrated into the actual package, somewhere around line 94. In this case, only the indt[, `:=`(eval(paste(splitCols, "orig", sep = "_")), lapply(splitCols, function(x) indt[[x]]))] part should be necessary.

Execution answered 13/5, 2015 at 7:0 Comment(0)
H
0

Thanks for the feedback, I wrote a little function as a workaround. I had to change the data.table into data.frame to get it to work properly. In case of the data.table I need to set an additional parameter but it crashes with data.frame. In my case I need most of the time data.frame so I optimised it for it.

library(splitstackshape)
jnk <- data.frame(a = c('1,2,3,4,5','1,2,3','2,3'),
                  b = c('a,b,c,d,e','a,b,c','a,b'),
                  c = c(5,4,3))
jnk

myCSplit <- function(data_set, splitCols, sep = ',', direction = 'long', drop = TRUE, ...) {
  if(direction == 'long' & !drop) {
    orig_names <- sub('$', '_orig', splitCols)
    df <- as.data.frame(data_set[,splitCols])
    names(df) <- orig_names
    df2 <- cbind(data_set, df)
    return(cSplit(df2, splitCols, sep, 'long'))
  } else {
    return(cSplit(data_set, splitCols, sep, direction, drop = drop,...))
  }
}
myCSplit(jnk, 'a', ',')
myCSplit(jnk, 'a', ',', drop = FALSE)
myCSplit(jnk, 'a', ',', 'wide')
myCSplit(jnk, 'a', ',', 'wide', drop = FALSE)
myCSplit(jnk, c('a','b'), ',', 'long', drop = FALSE)
Hubris answered 13/5, 2015 at 7:29 Comment(2)
Hey @AnandaMahto maybe you could add this feature to the next release... In my case it would be very useful.Hubris
As far as I can tell, it should be possible for me to make the fix with just one line. I'll have to do some tests to see though.Execution

© 2022 - 2024 — McMap. All rights reserved.