best way to transpose data.table
Asked Answered
P

6

61

[UPDATE: there is now a native transpose() function in data.table package]

I often need to transpose a data.table, every time it takes several lines of code and I am wondering if there's any better solution than mine.

if we take sample table

library(data.table)
mydata <- data.table(col0=c("row1","row2","row3"),
                     col1=c(11,21,31),
                     col2=c(12,22,32),
                     col3=c(13,23,33))

mydata
# col0 col1 col2 col3
# row1   11   12   13
# row2   21   22   23
# row3   31   32   33

and just transpose it with t(), it will be transposed to the matrix with conversion to character type, while applying data.table to such matrix will lose row.names:

t(mydata)
# [,1]   [,2]   [,3]  
# col0 "row1" "row2" "row3"
# col1 "11"   "21"   "31"  
# col2 "12"   "22"   "32"  
# col3 "13"   "23"   "33"  

data.table(t(mydata))
#   V1   V2   V3
# row1 row2 row3
#   11   21   31
#   12   22   32
#   13   23   33

so I had to write a function for this:

tdt <- function(inpdt){
  transposed <- t(inpdt[,-1,with=F]);
  colnames(transposed) <- inpdt[[1]];
  transposed <- data.table(transposed, keep.rownames=T);
  setnames(transposed, 1, names(inpdt)[1]);
  return(transposed);
}

 tdt(mydata)
# col0 row1 row2 row3
# col1   11   21   31
# col2   12   22   32
# col3   13   23   33

is there anything I could optimize here or do it in "nicer" way?

Potentilla answered 22/2, 2015 at 2:35 Comment(6)
I'm curious why you'd need to transpose a data.table? Transposing is a natural operation for numeric matrices but I have a hard time thinking of problems where it would be a good idea to transpose a by-definition non-uniform object like a data.table or a data.frame. Care to give an example?Pasteurize
well, most of my experimental data files are tables where main part is indeed numeric matrix, but accompanied by several non-numeric fields, so using data.table is quite logical for me. Most typical example would be the table where each column is a sample and each row is a probe; characteristics of probes and/or samples are recorded in the same table.Potentilla
can you fread with header = FALSE and then use transpose?Radon
@Pasteurize one example of where transposing a data.frame is useful: in single cell genomics two concepts of storing data are present. one with genes in rows and cells in columns and one vice versa. so if you want to use different software packages you would need to transpose..Northman
Given that there now is the native data.table function transpose, it might be worth re-considering what answer to accept.Stacked
thanks @gofvonx, I've added this info.Potentilla
B
62

Why not just melt and dcast the data.table?

require(data.table)

dcast(melt(mydata, id.vars = "col0"), variable ~ col0)
#    variable row1 row2 row3
# 1:     col1   11   21   31
# 2:     col2   12   22   32
# 3:     col3   13   23   33
Basically answered 22/2, 2015 at 3:24 Comment(4)
hmmm, it looks neat, thanks! (I will keep the question open for a while to see if there are other suggestions)Potentilla
@VasilyA, I think that the main question comes back to why you would be using a data.table for matrix-type data. Are you doing a lot of work with grouping and so on?Basically
yes, I use a lot of grouping and "subsetting", often need to select only a part of that big matrix.Potentilla
@buhtz, I suppose you know that you should use library(data.table) before dcast and melt? If that doesn't solve your problem, please share the version of "data.table" that you're using.Basically
A
42

The current docs show a builtin transpose method.

Specifically, you can do:

transpose(mydata, keep.names = "col", make.names = "col0")
##     col row1 row2 row3
## 1: col1   11   21   31
## 2: col2   12   22   32
## 3: col3   13   23   33
About answered 31/10, 2016 at 10:51 Comment(3)
This should be the way to go. But at the moment it doesn't keep the column names as rownames. There is a issue opened github.com/Rdatatable/data.table/issues/1886 to ask for it. I guess if we vote there it will be implemented sooner.Ildaile
Is there a way to keep numeric as they are instead of converting to characters?Derian
In case anybody else had trouble with it: I had to call it explicitly as data.table::transpose() to not get purrr's transpose().Northman
M
21

Here's an alternative solution that only uses data.table and that is closer to the original idea of using t to transpose.

mydata[, data.table(t(.SD), keep.rownames=TRUE), .SDcols=-"col0"]
##      rn V1 V2 V3
## 1: col1 11 21 31
## 2: col2 12 22 32
## 3: col3 13 23 33

If keeping the rownames is important, setnames can be used. Admittedly this becomes a bit clumsy and probably the recast solution is preferable.

setnames(mydata[, data.table(t(.SD), keep.rownames=TRUE), .SDcols=-"col0"], 
         mydata[, c('rn', col0)])[]
##      rn row1 row2 row3
## 1: col1   11   21   31
## 2: col2   12   22   32
## 3: col3   13   23   33
Motherwell answered 23/2, 2015 at 13:58 Comment(2)
This approach is certainly easier to understand. By the way, release notes for v1.9.5 of data.table suggest that reshape2 will soon not be required for melt and dcast. github.com/Rdatatable/…Udella
thanks, @shadow! I like that we only use data.table, but I should note that column names (i.e. original row names - "row1","row2","row3") - are lost here.Potentilla
P
8
df <- as.data.frame(t(mydata))

is what I tried and df is a data.frame and the column names on mydata are now row names on df

Pisarik answered 20/2, 2019 at 23:21 Comment(0)
T
2

Here's a solution that uses a wrapper to tidy up the output of the data.table transpose function.

With really large data sets this seems to be more efficient than the dcast/melt approach (I tested it on a 8000 row x 29000 column data set, the below function works in about 3 minutes but dcast/melt crashed R):

# Function to clean up output of data.table transpose:

transposedt <- function(dt, varlabel) {
  require(data.table)
  dtrows = names(dt)
  dtcols = as.list(c(dt[,1]))
  dtt = transpose(dt)
  dtt[, eval(varlabel) := dtrows]
  setnames(dtt, old = names(dtt), new = c(dtcols[[1]], eval(varlabel)))
  dtt = dtt[-1,]
  setcolorder(dtt, c(eval(varlabel), names(dtt)[1:(ncol(dtt) - 1)]))
  return(dtt)
}

# Some dummy data 
mydt <- data.table(col0 = c(paste0("row", seq_along(1:100))), 
                   col01 = c(sample(seq_along(1:100), 100)),
                   col02 = c(sample(seq_along(1:100), 100)),
                   col03 = c(sample(seq_along(1:100), 100)),
                   col04 = c(sample(seq_along(1:100), 100)),
                   col05 = c(sample(seq_along(1:100), 100)),
                   col06 = c(sample(seq_along(1:100), 100)),
                   col07 = c(sample(seq_along(1:100), 100)),
                   col08 = c(sample(seq_along(1:100), 100)),
                   col09 = c(sample(seq_along(1:100), 100)),
                   col10 = c(sample(seq_along(1:100), 100)))


# Apply the function:
mydtt <- transposedt(mydt, "myvariables")

# View the results:
> mydtt[,1:10]
    myvariables row1 row2 row3 row4 row5 row6 row7 row8 row9
 1:       col01   58   53   14   96   51   30   26   15   68
 2:       col02    6   72   46   62   69    9   63   32   78
 3:       col03   21   36   94   41   54   74   82   64   15
 4:       col04   68   41   66   30   31   78   51   67   26
 5:       col05   49   30   52   78   73   71    5   66   44
 6:       col06   89   35   79   67    6   88   62   97   73
 7:       col07   66   15   27   29   58   40   35   82   57
 8:       col08   55   47   83   30   23   65   48   56   87
 9:       col09   41   10   21   33   55   81   94   25   34
10:       col10   35   17   41   44   21   66   69   61   46

What is also useful is that columns (ex rows) occur in their original order and you can name the variables column something meaningful.

Taco answered 28/4, 2017 at 18:39 Comment(0)
T
-1

The tdt function which I provide below should be faster

tdt <- function(DT, transpose.col, ...) {
# The transpose function is efficient, but lacks the keeping of row and colnames
new.row.names <- colnames(DT)
new.row.names <- new.row.names[!new.row.names %in% transpose.col]
new.col.names <- DT[, transpose.col, with = F]
DT <- DT[, !colnames(DT) %in% transpose.col, with = F]
DT <- transpose(DT, ...)
colnames(DT) <- unlist(new.col.names)
DT$var <- new.row.names
# change order of DT after transposing 
setcolorder(DT, c("var", setdiff(names(DT), "var")))
colnames(DT)[1] <- transpose.col
return(DT)
}

library(microbenchmark); library(microbenchmarkCore)
DT <- data.table(x=1:1000, y=paste("name", 1:1000, sep = "_"), z = paste("test", 1:1000, sep = "."))

rbind(microbenchmark(tdt(DT, "y")), 
microbenchmark(dcast(melt(DT, id.vars = "y"), variable ~ y)),
microbenchmark(DT[, data.table(t(.SD), keep.rownames=TRUE), .SDcols=-"y"]))

Unit: milliseconds
expr       min        lq      mean    median        uq        max neval cld

tdt(DT, "y")  3.463842  3.719341  4.308158  3.911599  4.576477  20.406940   100  a 

dcast(melt(DT, id.vars = "y"), variable ~ y)  5.146119  5.496761  5.826647  5.580796  5.870584   9.536541   100  a 

DT[, data.table(t(.SD), keep.rownames = TRUE), .SDcols = -"y"] 29.975567 34.554989 40.807036 36.724430 39.102396 104.242218   100   b

d <- tdt(DT, "y") 

d[1:2, 1:11]
y name_1 name_2 name_3 name_4 name_5 name_6 name_7 name_8 name_9 name_10
1: x      1      2      3      4      5      6      7      8      9      10
2: z test.1 test.2 test.3 test.4 test.5 test.6 test.7 test.8 test.9 test.10

DT[1:10, 1:3]
x       y       z
1:  1  name_1  test.1
2:  2  name_2  test.2
3:  3  name_3  test.3
4:  4  name_4  test.4
5:  5  name_5  test.5
6:  6  name_6  test.6
7:  7  name_7  test.7
8:  8  name_8  test.8
9:  9  name_9  test.9
10: 10 name_10 test.10

class(d)
[1] "data.table" "data.frame"
Truelove answered 8/2, 2018 at 15:33 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.