Is there an alternative to reshape2::melt for multidimensional arrays in base R or tidyverse?

Asked 25/8, 2022 at 17:4 Answered 2/6, 2023 at 8:9

Suppose I have a 3-dimensional array g with dimensions [x,y,z]. reshape2::melt(g) would produce a data frame with columns giving indices of x,y,z and value where value contains the values in each entry of the prior array.

Given that reshape2 is superseded, is there a "one function" alternative to the functionality of reshape2::melt in base R or a more actively supported tidyverse package that I'm missing?

reshape2 recommends people use tidyr instead but I can't seem to find solutions to multi-dimensional arrays in tidyr. There is cubelyr but doesn't seem like that is very active these days either.

I can write a custom solution, just looking for something stable with the easy functionality of reshape2::melt for this kind of problem

library(reshape2)

g_as_array <- array(rnorm(9), dim = c(3,3,3)) # create a 3D array

g_as_data_frame <- reshape2::melt(g_as_array) # melt down to "tidy" format

head(g_as_data_frame)
#>   Var1 Var2 Var3      value
#> 1    1    1    1  1.4092362
#> 2    2    1    1 -2.1606972
#> 3    3    1    1  0.4334404
#> 4    1    2    1  0.2390544
#> 5    2    2    1 -0.9673617
#> 6    3    2    1  0.5668378

^{Created on 2022-08-25 by the reprex package (v2.0.1)}

Flyfish answered 25/8, 2022 at 17:4 Comment(6)

Base approach and data.table answers here: #63311905 as.data.frame(ftable(g_as_array)) or data.table::as.data.table(g_as_array) – Caldwell 25/8, 2022 at 17:10

That's great, thank you. For the base approach would just need to convert factor labels back to numeric indices then e.g. as.data.frame(ftable(g_as_array)) %>% dplyr::mutate(dplyr::across(dplyr::starts_with("Var"), as.numeric)) – Flyfish 26/8, 2022 at 17:2

pivot_longer and pivot_wider from tidyr seem to be the main alternatives these days. – Took 15/9, 2022 at 19:59

True in general but pivot_longer and pivot_wider only work for two-dimensional table-like data, not multi-dimensional arrays, unless I'm missing some functionality there? – Flyfish 20/9, 2022 at 21:36

An ideal answer would preserve dimnames if they existed – Brentbrenton 31/5, 2023 at 20:46

as.data.frame.table(g_as_array) will give the results but using LETTERS instead of numbers – Adrianople 1/6, 2023 at 19:30

a <- array(1:27, dim = c(3,3,3))

library(reshape2)
DF1 <- melt(a)

DF2 <- data.frame(
  expand.grid(lapply(dim(a), seq_len)),
  value = as.vector(a)
)

identical(DF1, DF2)
#[1] TRUE

If the array has dimension names:

a <-array(letters[1:27], dim = c(3, 3, 3), dimnames = list(letters[1:3],
                                                           letters[4:6],
                                                           letters[7:9]))

library(reshape2)
DF1 <- melt(a)
    
DF2 <- data.frame(
  expand.grid(dimnames(a)),
  value = as.vector(a)
)

identical(DF1, DF2)
#[1] TRUE

If not all dimensions have names, you would need to fill in the missing names first, e.g.:

Map(\(x, y) if (is.null(x)) seq_len(y) else x , dimnames(a), dim(a))

Punctate answered 1/6, 2023 at 8:46 Comment(3)

Nice solution! +1! I think you can use expand.grid(lapply(dim(g),seq.int)) without any do.call, i.e., cbind(expand.grid(lapply(dim(g), seq.int)), value = c(g)) – Taxiway 1/6, 2023 at 9:1

Thanks, I always forget that. Luckily, the overhead from do.call is minimal. – Punctate 1/6, 2023 at 9:3

And I remember a BDR fortune quote regarding misuse of c to strip attributes. – Punctate 1/6, 2023 at 9:4

An option would be to use arrayInd.

A <- array(1:8, c(2,2,2))

data.frame(arrayInd(seq_along(A), dim(A)), value = as.vector(A))
#  X1 X2 X3 value
#1  1  1  1     1
#2  2  1  1     2
#3  1  2  1     3
#4  2  2  1     4
#5  1  1  2     5
#6  2  1  2     6
#7  1  2  2     7
#8  2  2  2     8

Or quite similar to @ThomasIsCoding using which.

data.frame(which(array(TRUE, dim(A)), arr.ind = TRUE), value = as.vector(A))
#  dim1 dim2 dim3 value
#1    1    1    1     1
#2    2    1    1     2
#3    1    2    1     3
#4    2    2    1     4
#5    1    1    2     5
#6    2    1    2     6
#7    1    2    2     7
#8    2    2    2     8

If the array has dimension names.

A <- array(1:8, c(2,2,2), list(X=c("a","b"), Y=c("c","d"), Z=c("e","f")))

i <- arrayInd(seq_along(A), dim(A), dimnames(A), TRUE)
data.frame(mapply(`[`, dimnames(A), asplit(i, 2)), value = as.vector(A))
#  X Y Z value
#1 a c e     1
#2 b c e     2
#3 a d e     3
#4 b d e     4
#5 a c f     5
#6 b c f     6
#7 a d f     7
#8 b d f     8

But this can be achieved, as shown in the comments, with as.data.frame(ftable(A)) @Jon Spring or as.data.frame.table(A) @Onyambu.
If you look at the source of as.data.frame.table you see that it is using expand.grid.

as.data.frame.table(A)    #@Onyambu.
#as.data.frame(ftable(A)) #@Jon Spring
#  X Y Z Freq
#1 a c e    1
#2 b c e    2
#3 a d e    3
#4 b d e    4
#5 a c f    5
#6 b c f    6
#7 a d f    7
#8 b d f    8

But if numeric indices are wanted this can be used.

sapply(as.data.frame.table(A), unclass)
#     X Y Z Freq
#[1,] 1 1 1    1
#[2,] 2 1 1    2
#[3,] 1 2 1    3
#[4,] 2 2 1    4
#[5,] 1 1 2    5
#[6,] 2 1 2    6
#[7,] 1 2 2    7
#[8,] 2 2 2    8

Or more robust and giving a data.frame:

tt <- as.data.frame.table(A)
tt[-length(tt)] <- lapply(tt[-length(tt)], unclass)
tt
#  Var1 Var2 Var3 Freq
#1    1    1    1    1
#2    2    1    1    2
#3    1    2    1    3
#4    2    2    1    4
#5    1    1    2    5
#6    2    1    2    6
#7    1    2    2    7
#8    2    2    2    8

#or
list2DF(lapply(as.data.frame.table(A), unclass))

Or a variant - Thanks to @Onyambu for the hint!

type.convert(as.data.frame.table(`dimnames<-`(A, NULL),
             base = list(as.character(seq_len(max(dim(A)))))), as.is = TRUE)
#  Var1 Var2 Var3 Freq
#1    1    1    1    1
#2    2    1    1    2
#3    1    2    1    3
#4    2    2    1    4
#5    1    1    2    5
#6    2    1    2    6
#7    1    2    2    7
#8    2    2    2    8

Another option is to calculate it "by hand" with %% and %/%.

cbind(1 + mapply(`%%`,
    Reduce(`%/%`, dim(A)[-length(dim(A))], 0:(length(A)-1), accumulate = TRUE),
    dim(A)), Value=as.vector(A))
#           Value
#[1,] 1 1 1     1
#[2,] 2 1 1     2
#[3,] 1 2 1     3
#[4,] 2 2 1     4
#[5,] 1 1 2     5
#[6,] 2 1 2     6
#[7,] 1 2 2     7
#[8,] 2 2 2     8

#Alternative
. <- 0:(length(A)-1)
cbind(1 +
    t(t(cbind(., outer(., cumprod(dim(A)[-length(dim(A))]), `%/%`))) %% dim(A)),
    Value=A)

or using rep.

list2DF(c(Map(\(i, j, n) rep(rep(1:i, each=j), length.out=n),
    dim(A),
    c(1, cumprod(dim(A)[-length(dim(A))])),
    length(A)), Value=list(as.vector(A))))
#        Value
#1 1 1 1     1
#2 2 1 1     2
#3 1 2 1     3
#4 2 2 1     4
#5 1 1 2     5
#6 2 1 2     6
#7 1 2 2     7
#8 2 2 2     8

Or basically the same but keeping names and make use of auto repetition.

d <- setNames(dim(A), names(dimnames(A)))
do.call(data.frame, c(
  Map(\(i,j) rep(1:i, each=j), d, c(1, cumprod(d[-length(d)]))),
  Value=list(as.vector(A) ), fix.empty.names = FALSE) )
  X Y Z Value
1 1 1 1     1
2 2 1 1     2
3 1 2 1     3
4 2 2 1     4
5 1 1 2     5
6 2 1 2     6
7 1 2 2     7
8 2 2 2     8

Benchmark

A <- array(0, c(1e5, 12, 30), list(T=NULL, Month=NULL, Year=NULL))

bench::mark(check=FALSE,
reshape2 = reshape2::melt(A),
expand.grid = {data.frame(  #@Roland
  expand.grid(lapply(dim(A), seq_len)),
  value = as.vector(A)) },
data.frame.table = {tt <- as.data.frame.table(A)
  tt[-length(tt)] <- lapply(tt[-length(tt)], unclass)
  tt},
rep = {d <- setNames(dim(A), names(dimnames(A)))
do.call(data.frame, c(
  Map(\(i,j) rep(1:i, each=j), d, c(1, cumprod(d[-length(d)]))),
  Value=list(as.vector(A) ), fix.empty.names = FALSE) )} )
#  expression            min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc
#  <bch:expr>       <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>
#1 reshape2            812ms    812ms      1.23    1.21GB     1.23     1     1
#2 expand.grid         733ms    733ms      1.36    1.21GB     2.73     1     2
#3 data.frame.table    605ms    605ms      1.65    1.23GB     3.31     1     2
#4 rep                 293ms    331ms      3.02  691.99MB     1.51     2     1

In this case the variant using rep is the fastest and allocates the lowest amount of memory.

Genovera answered 1/6, 2023 at 12:35 Comment(4)

also data.frame(arrayInd(seq_along(A), dim(A)), value = c(A)) – Adrianople 2/6, 2023 at 4:21

Thanks! I had similar but instead of c with as.vector(see comment Roland in his answer) - see history of edits. But arrayInd "wants" a logical vector so I provide one. – Genovera 2/6, 2023 at 4:25

Why would you say arrayInd wants a logical vector? arrayInd takes in an Integer valued vector. Not a logical vector. Please read the help page for array ind. – Adrianople 2/6, 2023 at 5:18

sapply...unclass is quite risky, I would suggest type.convert(as.data.frame.table(A, base = list(as.character(1:20))), as.is = TRUE) – Adrianople 2/6, 2023 at 19:56

Here are some base R alternatives with the which trick that should work for general arrays, i.e., numeric and character:

which(1^is.na(g) > 0, arr.ind = TRUE)

cbind(as.data.frame(which(1^is.na(g) > 0, arr.ind = TRUE)), value = c(g))

which(TRUE | is.na(g), arr.ind = TRUE)

cbind(as.data.frame(which(TRUE | is.na(g), arr.ind = TRUE)), value = c(g))

nchar(g, "width") > -1

cbind(as.data.frame(which(nchar(g, "width") > -1, arr.ind = TRUE)), value = c(g))

and we will obtain

   dim1 dim2 dim3 value
1     1    1    1     a
2     2    1    1     b
3     3    1    1     c
4     1    2    1     d
5     2    2    1     e
6     3    2    1     f
7     1    3    1     g
8     2    3    1     h
9     3    3    1     i
10    1    1    2     j
11    2    1    2     k
12    3    1    2     l
13    1    2    2     m
14    2    2    2     n
15    3    2    2     o
16    1    3    2     p
17    2    3    2     q
18    3    3    2     r
19    1    1    3     s
20    2    1    3     t
21    3    1    3     u
22    1    2    3     v
23    2    2    3     w
24    3    2    3     x
25    1    3    3     y
26    2    3    3     z
27    3    3    3  <NA>

Dummy Data

> (g <- array(letters[1:27], dim = c(3, 3, 3)))
, , 1

     [,1] [,2] [,3]
[1,] "a"  "d"  "g"
[2,] "b"  "e"  "h"
[3,] "c"  "f"  "i"

, , 2

     [,1] [,2] [,3]
[1,] "j"  "m"  "p"
[2,] "k"  "n"  "q"
[3,] "l"  "o"  "r"

, , 3

     [,1] [,2] [,3]
[1,] "s"  "v"  "y"
[2,] "t"  "w"  "z"
[3,] "u"  "x"  NA

Taxiway answered 31/5, 2023 at 20:57 Comment(11)

is is.finite() just a trick to get all of the elements? I guess I could define isTRUE(x) as is.finite(x) | !is.finite(x) ... – Brentbrenton 31/5, 2023 at 21:13

@BenBolker Yes, I assume all entries are finite to play that trick. but yours is absolutely more generalized :) – Taxiway 31/5, 2023 at 21:15

@BenBolker Probably ^ is more efficient for numerical entries or NA or Inf. – Taxiway 31/5, 2023 at 21:23

You mean ^0 ? – Brentbrenton 31/5, 2023 at 21:42

@BenBolker I think both 1^g and g^0 should work – Taxiway 31/5, 2023 at 21:44

Based on minimal testing, looks like you're right. (Why would you prefer 1^g > 0 to 1^g == 1 or as.logical(1^g) ? fewer characters/code golfing?) – Brentbrenton 31/5, 2023 at 22:2

Solutions should work for any data type. Try yours with (g <- array(letters[1:27], dim = c(3, 3, 3))) – Punctate 1/6, 2023 at 8:50

@Punctate yes, you are right. then we can try 1^nchar(g) instead – Taxiway 1/6, 2023 at 8:58

@BenBolker yes, I am used to code golfing. – Taxiway 1/6, 2023 at 8:59

@Punctate But nchar won't work if there is "". Anyway, thanks for your feedback. I just assumed OP has numeric arrays so I played that trick. – Taxiway 1/6, 2023 at 9:6

@Punctate I updated my solution with 1^is.na(g), which should fit for arrays with all types of entries. – Taxiway 1/6, 2023 at 11:54

Benchmarking, Just for Fun

Here are some interesting benchmarking observations for arrays of different dimensions (without considering the dimension names for simplifications), where multiple existing solutions to the posted questions are taken into account.

Disclaimer: We DON'T provide a conclusion which is the "best", but you (not limited to the OP but for everyone that might need this sort of functionality, i.e., indexing of multi-dimensional arrays) have the degree of freedom to define the one suits your purpose best.

Below is the benchmarking function with respect to the dimension argument of a random array

library(microbenchmark)
library(data.table)

fbench <- function(dims) {
    # dummy data for test
    set.seed(0)
    g <- array(sample(prod(dims)), dim = dims)

    # list of approaches
    expgrd <- function() {
        data.frame(expand.grid(lapply(dim(g), seq_len)), value = as.vector(g))
    }

    arrind <- function() {
        data.frame(arrayInd(seq_along(g), dim(g)), value = as.vector(g))
    }

    which0 <- function() {
        data.frame(which(array(TRUE, dim(g)), arr.ind = TRUE), value = as.vector(g))
    }

    which1 <- function() {
        cbind(as.data.frame(which(1^is.na(g) > 0, arr.ind = TRUE)), value = c(g))
    }

    which2 <- function() {
        cbind(as.data.frame(which(TRUE | is.na(g), arr.ind = TRUE)), value = c(g))
    }

    which3 <- function() {
        cbind(as.data.frame(which(nchar(g, "width") > -1, arr.ind = TRUE)), value = c(g))
    }

    dftable0 <- function() {
        list2DF(lapply(as.data.frame.table(g), unclass))
    }

    dftable1 <- function() {
        list2DF(lapply(as.data.frame(ftable(g)), unclass))
    }

    dttable <- function() {
        as.data.table(g, sorted = FALSE, na.rm = FALSE)
    }

    rem0 <- function() {
        as.data.frame(cbind(1 + mapply(
            `%%`,
            Reduce(`%/%`, dim(g)[-length(dim(g))], 0:(length(g) - 1), accumulate = TRUE),
            dim(g)
        ), Value = as.vector(g)))
    }

    rem1 <- function() {
        . <- 0:(length(g) - 1)
        as.data.frame(cbind(
            1 +
                t(t(cbind(., outer(., cumprod(dim(g)[-length(dim(g))]), `%/%`))) %% dim(g)),
            Value = g
        ))
    }

    reprep <- function() {
        list2DF(c(Map(
            \(i, j, n) rep(rep(1:i, each = j), length.out = n),
            dim(g),
            c(1, cumprod(dim(g)[-length(dim(g))])),
            length(g)
        ), Value = list(as.vector(g))))
    }

    # benchmarking module
    # benchmarking module
    mbm <- microbenchmark(
        expgrd(),
        arrind(),
        which0(),
        which1(),
        which2(),
        which3(),
        dftable0(),
        dftable1(),
        dttable(),
        rem0(),
        rem1(),
        reprep(),
        times = 50L,
        check = "equivalent"
    )

    boxplot(mbm, main = sprintf("dim = [%s]", toString(dims)), las = 2)
}

For dim <- rep(5, 3), we run fbench(dims) and obtain
For dims <- rep(5, 4), we run fbench(dims) and obtain
For dims <- rep(5, 5), we run fbench(dims) and obtain
For dims <- rep(5, 6), we run fbench(dims) and obtain
For dims <- rep(5, 7), we run fbench(dims) and obtain
For dims <- rep(5, 8), we run fbench(dims) and obtain

Taxiway answered 2/6, 2023 at 8:9 Comment(13)

My takeaway from this would be "dftable0 is usually pretty good"... – Brentbrenton 4/6, 2023 at 21:5

@BenBolker well...I would say that the size of array matters to the performance, where dftable0 is always the middle-class regardless of the size :) – Taxiway 4/6, 2023 at 21:15

I guess (1) I don't expect this component to be a significant performance bottleneck (2) dftable0 never seems to be terrible and (3) I prefer the list2DF solutions on aesthetic grounds ... – Brentbrenton 4/6, 2023 at 21:20

Maybe you can add reshape2::melt and the variants using %% and %/% or rep? – Genovera 5/6, 2023 at 4:31

@BenBolker Yes, that's fair enough :) – Taxiway 5/6, 2023 at 7:37

@Genovera yes, added. Interesting that rep has such a strong performance! Cool! – Taxiway 5/6, 2023 at 8:17

Thanks! Performance will change by size and might not be that important. But anyway nice comparison. – Genovera 5/6, 2023 at 8:50

Now I'm thinking about adding the rep-based solution to gtools ... – Brentbrenton 5/6, 2023 at 14:9

@BenBolker Fine to read that my code will maybe be used in gtools! – Genovera 5/6, 2023 at 16:13

Hmm. I thought these would all preserve dimnames but apparently reprep doesn't ... ?? – Brentbrenton 6/6, 2023 at 17:14

@BenBolker Nope. none of the approaches for this benchmark will preserve the dimension names. – Taxiway 6/6, 2023 at 20:46

I think that's not true -

a <- array(1:8, dim  = c(2,2,2), dimnames=list(d1 = letters[1:2], d2 = LETTERS[1:2], d3 = c("x", "y"))); as.data.frame.table(a)

is built-in, is never worst, and preserves dimnames and dimname-names ... – Brentbrenton 6/6, 2023 at 20:53

@BenBolker Yes, you are right on that point. However, in this benchmark, I enforce all approaches to have an uniform output for the fair comparison, i.e., integer indexing manner, instead of dimension names. That's why as.data.frame.table(a) is followed by unclass in my benchmarking script. – Taxiway 6/6, 2023 at 21:10

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Dummy Data

Benchmarking, Just for Fun

Recommended topics

Hot tags