Should I use a data.frame or a matrix?

Asked 1/3, 2011 at 18:36 Answered 3/7, 2023 at 17:3

166

When should one use a data.frame, and when is it better to use a matrix?

Both keep data in a rectangular format, so sometimes it's unclear.

Are there any general rules of thumb for when to use which data type?

Ragi answered 1/3, 2011 at 18:36 Comment(1)

Often a matrix can be better suited to a particular type of data, but if the package you want to use to analyze said matrix expects a data frame, you will always have to needlessly convert it. I think there is no way to avoid remebering which package uses which. – Fregger 4/9, 2013 at 8:26

185

Part of the answer is contained already in your question: You use data frames if columns (variables) can be expected to be of different types (numeric/character/logical etc.). Matrices are for data of the same type.

Consequently, the choice matrix/data.frame is only problematic if you have data of the same type.

The answer depends on what you are going to do with the data in data.frame/matrix. If it is going to be passed to other functions then the expected type of the arguments of these functions determine the choice.

Also:

Matrices are more memory efficient:

m = matrix(1:4, 2, 2)
d = as.data.frame(m)
object.size(m)
# 216 bytes
object.size(d)
# 792 bytes

Matrices are a necessity if you plan to do any linear algebra-type of operations.

Data frames are more convenient if you frequently refer to its columns by name (via the compact $ operator).

Data frames are also IMHO better for reporting (printing) tabular information as you can apply formatting to each column separately.

Insurgent answered 1/3, 2011 at 19:0 Comment(1)

One thing I would add to this answer is that if you plan on using the ggplot2 package to make graphs, ggplot2 only works with data.frames and not matrices. Just something to be aware of! – Scaphoid 28/3, 2017 at 15:1

Something not mentioned by @Michal is that not only is a matrix smaller than the equivalent data frame, using matrices can make your code far more efficient than using data frames, often considerably so. That is one reason why internally, a lot of R functions will coerce to matrices data that are in data frames.

Data frames are often far more convenient; one doesn't always have solely atomic chunks of data lying around.

Note that you can have a character matrix; you don't just have to have numeric data to build a matrix in R.

In converting a data frame to a matrix, note that there is a data.matrix() function, which handles factors appropriately by converting them to numeric values based on the internal levels. Coercing via as.matrix() will result in a character matrix if any of the factor labels is non-numeric. Compare:

> head(as.matrix(data.frame(a = factor(letters), B = factor(LETTERS))))
     a   B  
[1,] "a" "A"
[2,] "b" "B"
[3,] "c" "C"
[4,] "d" "D"
[5,] "e" "E"
[6,] "f" "F"
> head(data.matrix(data.frame(a = factor(letters), B = factor(LETTERS))))
     a B
[1,] 1 1
[2,] 2 2
[3,] 3 3
[4,] 4 4
[5,] 5 5
[6,] 6 6

I nearly always use a data frame for my data analysis tasks as I often have more than just numeric variables. When I code functions for packages, I almost always coerce to matrix and then format the results back out as a data frame. This is because data frames are convenient.

Longshore answered 1/3, 2011 at 19:14 Comment(3)

I've been wondering the difference between data.matrix() and as.matrix(), too. Thanks to clarify them and your tips in programming. – Ragi 2/3, 2011 at 0:17

Thanks for sharing @Gavin Simpson! Could you introduce a bit more about how to go back from 1-6 to a-f? – Preliminary 20/7, 2015 at 14:40

@YZhang You'd need to store the labels for each factor and a logical vector indicating which columns of the matrix were factors. Then it would be relatively trivial to convert just those columns that were factors back into factors with the correct labels. Comments aren't good places for code, so see if the Q has been asked & answered before and if not ask a new question. – Longshore 20/7, 2015 at 14:56

@Michal: Matrices aren't really more memory efficient:

m <- matrix(1:400000, 200000, 2)
d <- data.frame(m)
object.size(m)
# 1600200 bytes
object.size(d)
# 1600776 bytes

... unless you have a large number of columns:

m <- matrix(1:400000, 2, 200000)
d <- data.frame(m)
object.size(m)
# 1600200 bytes
object.size(d)
# 22400568 bytes

Sculpsit answered 2/2, 2012 at 19:19 Comment(1)

the memory efficiency argument is really about data.frames offering more flexibility over column types. data.frame(a = rnorm(1e6), b = sample(letters, 1e6, TRUE)) will be much smaller (6x by my quick calculation) in memory than the matrix version because of type coercion. – Eyeshot 13/12, 2017 at 11:12

The matrix is actually a vector with additional methods. while data.frame is a list. The difference is down to vector vs list. for computation efficiency, stick with matrix. Using data.frame if you have to.

Maddis answered 1/3, 2011 at 21:28 Comment(1)

Hmm, a matrix is a vector with dimensions, I don't see where methods come in to it? – Longshore 1/3, 2011 at 22:7

I cannot stress out more the efficiency difference between the two! While it is true that DFs are more convenient in some especially data analysis cases, they also allow heterogeneous data, and some libraries accept them only, these all is really secondary unless you write a one-time code for a specific task.

Let me give you an example. There was a function that would calculate the 2D path of the MCMC method. Basically, this means we take an initial point (x,y), and iterate a certain algorithm to find a new point (x,y) at each step, constructing this way the whole path. The algorithm involves calculating a quite complex function and the generation of some random variable at each iteration, so when it run for 12 seconds I thought it is fine given how much stuff it does at each step. That being said, the function collected all points in the constructed path together with the value of an objective function in a 3-column data.frame. So, 3 columns is not that large, and the number of steps was also more than reasonable 10,000 (in this kind of problems paths of length 1,000,000 are typical, so 10,000 is nothing). So, I thought a DF 10,000x3 is definitely not an issue. The reason a DF was used is simple. After calling the function, ggplot() was called to draw the resulting (x,y)-path. And ggplot() does not accept a matrix.

Then, at some point out of curiosity I decided to change the function to collect the path in a matrix. Gladly the syntax of DFs and matrices is similar, all I did was to change the line specifying df as a data.frame to one initializing it as a matrix. Here I need also to mention that in the initial code the DF was initialized to have the final size, so later in the code of the function only new values were recorded into already allocated spaces, and there was no overhead of adding new rows to the DF. This makes the comparison even more fair, and it also made my job simpler as I did not need to rewrite anything further in the function. Just one line change from the initial allocation of a data.frame of the required size to a matrix of the same size. To adapt the new version of the function to ggplot(), I converted the now returned matrix to a data.frame to use in ggplot().

After I rerun the code I could not believe the result. The code run in a fraction of a second! Instead of about 12 seconds. And again, the function during the 10,000 iterations only read and wrote values to already allocated spaces in a DF (and now in a matrix). And this difference is also for the reasonable (or rather small) size 10000x3.

So, if your only reason to use a DF is to make it compatible with a library function such as ggplot(), you can always convert it to a DF at the last moment -- work with matrices as far as you feel convenient. If on the other hand there is a more substantial reason to use a DF, such as you use some data analysis package that would require otherwise constant transforming from matrices to DFs and back, or you do not do any intensive calculations yourself and only use standard packages (many of them actually internally transform a DF to a matrix, do their job, and then transform the result back -- so they do all efficiency work for you), or do a one-time job so you do not care and feel more comfortable with DFs, then you should not worry about efficiency.

Or a different more practical rule: if you have a question such as in the OP, use matrices, so you would use DFs only when you do not have such a question (because you already know you have to use DFs, or because you do not really care as the code is one-time etc.).

But in general keep this efficiency point always in mind as a priority.

Maes answered 25/10, 2018 at 6:39 Comment(0)

Matrices and data frames are rectangular 2D arrays and can be heterogeneous by rows and columns. They share some methods and properties, but not all.

Examples:

M <- list(3.14,TRUE,5L,c(2,3,5),"dog",1i)  # a list
dim(M) <- c(2,3)                           # set dimensions
print(M)                                   # print result

#      [,1]  [,2]      [,3]
# [1,] 3.14  5         "dog"
# [2,] TRUE  Numeric,3 0+1i

DF <- data.frame(M)                   # a data frame
print(DF)                             # print result

#      X1      X2   X3
#  1 3.14       5  dog
#  2 TRUE 2, 3, 5 0+1i

M <- matrix(c(1,1,1,1,2,3,1,3,6),3)   # a numeric matrix
DF <- data.frame(M)                   # a all numeric data frame

solve(M)                              # obtains inverse matrix
solve(DF)                             # obtains inverse matrix
det(M)                                # obtains determinant
det(DF)                               # error

Highminded answered 10/12, 2017 at 2:38 Comment(0)

Here is an interesting result. Working with matrices vs tibbles is faster but the difference shrinks as the matrix gets larger. Note the overhead of converting a tibble to a matrix is larger than for converting a tibble to a matrix. As a broad generalization, working exclusively (no type coercion) in matrix space is about 20% faster than working in dplyr space.

library(dplyr)

# big (?) matrix
m <- matrix(runif(50000), 10000, 5, dimnames =list(NULL,c(letters[1:5])))
d <- as_tibble(m)

# return a matrix, convert tibble to matrix after operation
bench::mark(
   zm <- apply(m, 2, FUN = \(x) cumprod(1+x)*100),
   zd <- d |> 
      mutate(across(everything(),.fns = \(x) cumprod(1+x)*100)) |> 
      as.matrix()
)
#> # A tibble: 2 × 6
#>   expression                             min median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>                          <bch:> <bch:>     <dbl> <bch:byt>    <dbl>
#> 1 zm <- apply(m, 2, FUN = function(x… 5.49ms 5.66ms      176.    2.52MB     8.67
#> 2 zd <- as.matrix(mutate(d, across(e…  6.9ms 7.05ms      140.    3.54MB     4.17

# return a tibble. Convert matrix to tibble after operation
bench::mark(
   zm <- as_tibble(apply(m, 2, FUN = \(x) cumprod(1+x)*100)),
   zd <- d |> 
      mutate(across(everything(),.fns = \(x) cumprod(1+x)*100))
)
#> # A tibble: 2 × 6
#>   expression                             min median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>                          <bch:> <bch:>     <dbl> <bch:byt>    <dbl>
#> 1 zm <- as_tibble(apply(m, 2, FUN = … 5.71ms 5.92ms      167.    2.87MB    11.7 
#> 2 zd <- mutate(d, across(everything(… 6.72ms 6.85ms      144.  788.62KB     4.18

# small matrix
m <- matrix(runif(500), 100, 5, dimnames =list(NULL,c(letters[1:5])))
d <- as_tibble(m,.name_repair = "unique")

# return a matrix
bench::mark(
   zm <- apply(m, 2, FUN = \(x) cumprod(1+x)*100),
   zd <- d |> 
      mutate(across(everything(),.fns = \(x) cumprod(1+x)*100)) |> 
      as.matrix()
)
#> # A tibble: 2 × 6
#>   expression                            min  median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>                        <bch:t> <bch:t>     <dbl> <bch:byt>    <dbl>
#> 1 zm <- apply(m, 2, FUN = function…  24.1µs  27.2µs    29303.    35.8KB     17.6
#> 2 zd <- as.matrix(mutate(d, across…  1.69ms  1.79ms      548.    19.2KB     17.6
# return a tibble
bench::mark(
   zm <- as_tibble(apply(m, 2, FUN = \(x) cumprod(1+x)*100)),
   zd <- d |> 
      mutate(across(everything(),.fns = \(x) cumprod(1+x)*100))
)
#> # A tibble: 2 × 6
#>   expression                           min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>                      <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 zm <- as_tibble(apply(m, 2, FU…  111.3µs  126.1µs     7586.    39.9KB     19.2
#> 2 zd <- mutate(d, across(everyth…   1.63ms   1.72ms      565.    15.2KB     17.1

Created on 2023-07-03 with reprex v2.0.2

Cleric answered 3/7, 2023 at 17:3 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags