When should one use a data.frame
, and when is it better to use a matrix
?
Both keep data in a rectangular format, so sometimes it's unclear.
Are there any general rules of thumb for when to use which data type?
When should one use a data.frame
, and when is it better to use a matrix
?
Both keep data in a rectangular format, so sometimes it's unclear.
Are there any general rules of thumb for when to use which data type?
Part of the answer is contained already in your question: You use data frames if columns (variables) can be expected to be of different types (numeric/character/logical etc.). Matrices are for data of the same type.
Consequently, the choice matrix/data.frame is only problematic if you have data of the same type.
The answer depends on what you are going to do with the data in data.frame/matrix. If it is going to be passed to other functions then the expected type of the arguments of these functions determine the choice.
Also:
Matrices are more memory efficient:
m = matrix(1:4, 2, 2)
d = as.data.frame(m)
object.size(m)
# 216 bytes
object.size(d)
# 792 bytes
Matrices are a necessity if you plan to do any linear algebra-type of operations.
Data frames are more convenient if you frequently refer to its columns by name (via the compact $ operator).
Data frames are also IMHO better for reporting (printing) tabular information as you can apply formatting to each column separately.
Something not mentioned by @Michal is that not only is a matrix smaller than the equivalent data frame, using matrices can make your code far more efficient than using data frames, often considerably so. That is one reason why internally, a lot of R functions will coerce to matrices data that are in data frames.
Data frames are often far more convenient; one doesn't always have solely atomic chunks of data lying around.
Note that you can have a character matrix; you don't just have to have numeric data to build a matrix in R.
In converting a data frame to a matrix, note that there is a data.matrix()
function, which handles factors appropriately by converting them to numeric values based on the internal levels. Coercing via as.matrix()
will result in a character matrix if any of the factor labels is non-numeric. Compare:
> head(as.matrix(data.frame(a = factor(letters), B = factor(LETTERS))))
a B
[1,] "a" "A"
[2,] "b" "B"
[3,] "c" "C"
[4,] "d" "D"
[5,] "e" "E"
[6,] "f" "F"
> head(data.matrix(data.frame(a = factor(letters), B = factor(LETTERS))))
a B
[1,] 1 1
[2,] 2 2
[3,] 3 3
[4,] 4 4
[5,] 5 5
[6,] 6 6
I nearly always use a data frame for my data analysis tasks as I often have more than just numeric variables. When I code functions for packages, I almost always coerce to matrix and then format the results back out as a data frame. This is because data frames are convenient.
@Michal: Matrices aren't really more memory efficient:
m <- matrix(1:400000, 200000, 2)
d <- data.frame(m)
object.size(m)
# 1600200 bytes
object.size(d)
# 1600776 bytes
... unless you have a large number of columns:
m <- matrix(1:400000, 2, 200000)
d <- data.frame(m)
object.size(m)
# 1600200 bytes
object.size(d)
# 22400568 bytes
data.frames
offering more flexibility over column types. data.frame(a = rnorm(1e6), b = sample(letters, 1e6, TRUE))
will be much smaller (6x by my quick calculation) in memory than the matrix
version because of type coercion. –
Eyeshot The matrix is actually a vector with additional methods. while data.frame is a list. The difference is down to vector vs list. for computation efficiency, stick with matrix. Using data.frame if you have to.
I cannot stress out more the efficiency difference between the two! While it is true that DFs are more convenient in some especially data analysis cases, they also allow heterogeneous data, and some libraries accept them only, these all is really secondary unless you write a one-time code for a specific task.
Let me give you an example. There was a function that would calculate the 2D path of the MCMC method. Basically, this means we take an initial point (x,y), and iterate a certain algorithm to find a new point (x,y) at each step, constructing this way the whole path. The algorithm involves calculating a quite complex function and the generation of some random variable at each iteration, so when it run for 12 seconds I thought it is fine given how much stuff it does at each step. That being said, the function collected all points in the constructed path together with the value of an objective function in a 3-column data.frame. So, 3 columns is not that large, and the number of steps was also more than reasonable 10,000 (in this kind of problems paths of length 1,000,000 are typical, so 10,000 is nothing). So, I thought a DF 10,000x3 is definitely not an issue. The reason a DF was used is simple. After calling the function, ggplot() was called to draw the resulting (x,y)-path. And ggplot() does not accept a matrix.
Then, at some point out of curiosity I decided to change the function to collect the path in a matrix. Gladly the syntax of DFs and matrices is similar, all I did was to change the line specifying df as a data.frame to one initializing it as a matrix. Here I need also to mention that in the initial code the DF was initialized to have the final size, so later in the code of the function only new values were recorded into already allocated spaces, and there was no overhead of adding new rows to the DF. This makes the comparison even more fair, and it also made my job simpler as I did not need to rewrite anything further in the function. Just one line change from the initial allocation of a data.frame of the required size to a matrix of the same size. To adapt the new version of the function to ggplot(), I converted the now returned matrix to a data.frame to use in ggplot().
After I rerun the code I could not believe the result. The code run in a fraction of a second! Instead of about 12 seconds. And again, the function during the 10,000 iterations only read and wrote values to already allocated spaces in a DF (and now in a matrix). And this difference is also for the reasonable (or rather small) size 10000x3.
So, if your only reason to use a DF is to make it compatible with a library function such as ggplot(), you can always convert it to a DF at the last moment -- work with matrices as far as you feel convenient. If on the other hand there is a more substantial reason to use a DF, such as you use some data analysis package that would require otherwise constant transforming from matrices to DFs and back, or you do not do any intensive calculations yourself and only use standard packages (many of them actually internally transform a DF to a matrix, do their job, and then transform the result back -- so they do all efficiency work for you), or do a one-time job so you do not care and feel more comfortable with DFs, then you should not worry about efficiency.
Or a different more practical rule: if you have a question such as in the OP, use matrices, so you would use DFs only when you do not have such a question (because you already know you have to use DFs, or because you do not really care as the code is one-time etc.).
But in general keep this efficiency point always in mind as a priority.
Matrices and data frames are rectangular 2D arrays and can be heterogeneous by rows and columns. They share some methods and properties, but not all.
Examples:
M <- list(3.14,TRUE,5L,c(2,3,5),"dog",1i) # a list
dim(M) <- c(2,3) # set dimensions
print(M) # print result
# [,1] [,2] [,3]
# [1,] 3.14 5 "dog"
# [2,] TRUE Numeric,3 0+1i
DF <- data.frame(M) # a data frame
print(DF) # print result
# X1 X2 X3
# 1 3.14 5 dog
# 2 TRUE 2, 3, 5 0+1i
M <- matrix(c(1,1,1,1,2,3,1,3,6),3) # a numeric matrix
DF <- data.frame(M) # a all numeric data frame
solve(M) # obtains inverse matrix
solve(DF) # obtains inverse matrix
det(M) # obtains determinant
det(DF) # error
Here is an interesting result. Working with matrices vs tibbles is faster but the difference shrinks as the matrix gets larger. Note the overhead of converting a tibble to a matrix is larger than for converting a tibble to a matrix. As a broad generalization, working exclusively (no type coercion) in matrix space is about 20% faster than working in dplyr space.
library(dplyr)
# big (?) matrix
m <- matrix(runif(50000), 10000, 5, dimnames =list(NULL,c(letters[1:5])))
d <- as_tibble(m)
# return a matrix, convert tibble to matrix after operation
bench::mark(
zm <- apply(m, 2, FUN = \(x) cumprod(1+x)*100),
zd <- d |>
mutate(across(everything(),.fns = \(x) cumprod(1+x)*100)) |>
as.matrix()
)
#> # A tibble: 2 × 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:> <bch:> <dbl> <bch:byt> <dbl>
#> 1 zm <- apply(m, 2, FUN = function(x… 5.49ms 5.66ms 176. 2.52MB 8.67
#> 2 zd <- as.matrix(mutate(d, across(e… 6.9ms 7.05ms 140. 3.54MB 4.17
# return a tibble. Convert matrix to tibble after operation
bench::mark(
zm <- as_tibble(apply(m, 2, FUN = \(x) cumprod(1+x)*100)),
zd <- d |>
mutate(across(everything(),.fns = \(x) cumprod(1+x)*100))
)
#> # A tibble: 2 × 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:> <bch:> <dbl> <bch:byt> <dbl>
#> 1 zm <- as_tibble(apply(m, 2, FUN = … 5.71ms 5.92ms 167. 2.87MB 11.7
#> 2 zd <- mutate(d, across(everything(… 6.72ms 6.85ms 144. 788.62KB 4.18
# small matrix
m <- matrix(runif(500), 100, 5, dimnames =list(NULL,c(letters[1:5])))
d <- as_tibble(m,.name_repair = "unique")
# return a matrix
bench::mark(
zm <- apply(m, 2, FUN = \(x) cumprod(1+x)*100),
zd <- d |>
mutate(across(everything(),.fns = \(x) cumprod(1+x)*100)) |>
as.matrix()
)
#> # A tibble: 2 × 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:t> <bch:t> <dbl> <bch:byt> <dbl>
#> 1 zm <- apply(m, 2, FUN = function… 24.1µs 27.2µs 29303. 35.8KB 17.6
#> 2 zd <- as.matrix(mutate(d, across… 1.69ms 1.79ms 548. 19.2KB 17.6
# return a tibble
bench::mark(
zm <- as_tibble(apply(m, 2, FUN = \(x) cumprod(1+x)*100)),
zd <- d |>
mutate(across(everything(),.fns = \(x) cumprod(1+x)*100))
)
#> # A tibble: 2 × 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 zm <- as_tibble(apply(m, 2, FU… 111.3µs 126.1µs 7586. 39.9KB 19.2
#> 2 zd <- mutate(d, across(everyth… 1.63ms 1.72ms 565. 15.2KB 17.1
Created on 2023-07-03 with reprex v2.0.2
© 2022 - 2024 — McMap. All rights reserved.