How to prevent data.table to force numeric variables into character variables without manually specifying these?

Asked 22/7, 2015 at 16:8 Answered 22/7, 2015 at 16:14

Consider the following dataset:

dt <- structure(list(lllocatie = structure(c(1L, 6L, 2L, 4L, 3L), .Label = c("Assen", "Oosterwijtwerd", "Startenhuizen", "t-Zandt", "Tjuchem", "Winneweer"), class = "factor"), 
                 lat = c(52.992, 53.32, 53.336, 53.363, 53.368), 
                 lon = c(6.548, 6.74, 6.808, 6.765, 6.675), 
                 mag.cat = c(3L, 2L, 1L, 2L, 2L), 
                 places = structure(c(2L, 4L, 5L, 6L, 3L), .Label = c("", "Amen,Assen,Deurze,Ekehaar,Eleveld,Geelbroek,Taarlo,Ubbena", "Eppenhuizen,Garsthuizen,Huizinge,Kantens,Middelstum,Oldenzijl,Rottum,Startenhuizen,Toornwerd,Westeremden,Zandeweer", "Loppersum,Winneweer", "Oosterwijtwerd", "t-Zandt,Zeerijp"), class = "factor")),
            .Names = c("lllocatie", "lat", "lon", "mag.cat", "places"), 
            class = c("data.table", "data.frame"), 
            row.names = c(NA, -5L))

When I want to split the strings in the last column into separate rows, I use (with data.table version 1.9.5+):

dt.new <- dt[, lapply(.SD, function(x) unlist(tstrsplit(x, ",", fixed=TRUE))), by=list(lllocatie,lat,lon,mag.cat)]

However, when I use:

dt.new2 <- dt[, lapply(.SD, function(x) unlist(tstrsplit(x, ",", fixed=TRUE))), by=lllocatie]

I get the the same result except that all columns are forced into character variables. The problem is that for small datasets it is not a big problem to specify the variables that do not have to split in the by argument, but for datasets with many columns/variables it is. I know it is possible to do this with the splitstackshape package (as is mentioned by @ColonelBeauvel in his answer), but I'm looking for a data.table solution as i want to chain more operations to this.

How can I prevent that without manually specifying the variables that do not have to be split in the by argument?

Sodom answered 22/7, 2015 at 16:8 Comment(3)

Use type.convert=TRUE argument for tstrsplit(). – Immure 22/7, 2015 at 16:15

If it's an issue of specifying a lot of variables, can you use something like setdiff(names(dt),badcol) to construct your by and .SD? – Scuff 22/7, 2015 at 16:18

@Immure that worked (and it sounds like an answer :-) ) – Sodom 22/7, 2015 at 16:30

Two solutions with data.table:

1: Use the type.convert=TRUE argument inside tstrsplit() as proposed by @Arun:

dt.new1 <- dt[, lapply(.SD, function(x) unlist(tstrsplit(x, ",", fixed=TRUE, type.convert=TRUE))), by=lllocatie]

2: Use setdiff(names(dt),"places") in the by argument as proposed by @Frank:

dt.new2 <- dt[, lapply(.SD, function(x) unlist(tstrsplit(x, ",", fixed=TRUE))), by=setdiff(names(dt),"places")]

Both approaches give the same result:

> identical(dt.new1,dt.new2)
[1] TRUE

The advantage of the second solution is that when you have more thanone columns with string values, only the one you specify in setdiff(names(dt),"places") is being split (supposing you want only that specific one, in this case places, to split). The splitstackshape package also offers this advantage.

Sodom answered 22/7, 2015 at 16:8 Comment(0)

It's exactly a job for cSplit from splitstackshape package:

library(splitstackshape)

cSplit(dt, 'places', ',')

Phonologist answered 22/7, 2015 at 16:14 Comment(4)

Thanx, I know I can do that with splitstackshape (but forgot to mention that in the question), but wanted a data.table solution as want to chain more stuff to that ( a +1 though). – Sodom 22/7, 2015 at 16:23

@Sodom the splitstackshape is a data.table based package. But ofcourse Arun provided the correct answer. I'm surprised you didn't know that though, as we discussed this over a week when I eventually filed a bug report regarding type.convert and it eventually will be fixed in the next release of R :) – Basketball 22/7, 2015 at 17:0

@Jaap, if you can import data.table you can import splitstackshape ! and you can chain operation after the cSplit! – Phonologist 22/7, 2015 at 17:20

Didn't know that splitstackshape is a data.table based package. Consequently it's logical that you can chain operations after cSplit. I consider this as a learning moment :-) – Sodom 22/7, 2015 at 18:24

Recommended topics

Hot tags