R: avoid turning one-row data frames into a vector when using apply functions

Asked 17/3, 2022 at 9:24 Answered 17/3, 2022 at 9:46

I often have the problem that R converts my one column data frames into character vectors, which I solve by using the drop=FALSE option.

However, there are some instances where I do not know how to put a solution to this kind of behavior in R, and this is one of them.

I have a data frame like the following:

mydf <- data.frame(ID=LETTERS[1:3], value1=paste(LETTERS[1:3], 1:3), value2=paste(rev(LETTERS)[1:3], 1:3))

that looks like:

> mydf
  ID value1 value2
1  A    A 1    Z 1
2  B    B 2    Y 2
3  C    C 3    X 3

The task I am doing here, is to replace spaces by _ in every column except the first, and I want to use an apply family function for this, sapply in this case.

I do the following:

new_df <- as.data.frame(sapply(mydf[,-1,drop=F], function(x) gsub("\\s+","_",x)))
new_df <- cbind(mydf[,1,drop=F], new_df)

The resulting data frame looks exactly how I want it:

> new_df
  ID value1 value2
1  A    A_1    Z_1
2  B    B_2    Y_2
3  C    C_3    X_3

My problem starts with some rare cases where my input can have one row of data only. For some reason I never understood, R has a completely different behavior in these cases, but no drop=FALSE option can save me here...

My input data frame now is:

mydf <- data.frame(ID=LETTERS[1], value1=paste(LETTERS[1], 1), value2=paste(rev(LETTERS)[1], 1))

which looks like:

> mydf
  ID value1 value2
1  A    A 1    Z 1

However, when I apply the same code, my resulting data frame looks hideous like this:

> new_df
       ID sapply(mydf[, -1, drop = F], function(x) gsub("\\\\s+", "_", x))
value1  A                                                              A_1
value2  A                                                              Z_1

How to solve this issue so that the same line of code gives me the same kind of result for input data frames of any number of rows?

A deeper question would be why on earth does R do this? I keep going back to my codes when I have some new weird inputs with one row/column cause they break everything... Thanks!

Yl answered 17/3, 2022 at 9:24 Comment(0)

You can solve your problem by using lapply instead of sapply, and then combine the result using do.call as follows

new_df <- as.data.frame(lapply(mydf[,-1,drop=F], function(x) gsub("\\s+","_",x)))
new_df <- do.call(cbind, new_df)
new_df
#     value1 value2
#[1,] "A_1"  "Z_1" 

new_df <- cbind(mydf[,1,drop=F], new_df)
#new_df
#  ID value1 value2
#1  A    A_1    Z_1

As for your question about unpredictable behavior of sapply, it is because s in sapply represent simplification, but the simplified result is not guaranteed to be a data frame. It can be a data frame, a matrix, or a vector.

According to the documentation of sapply:

sapply is a user-friendly version and wrapper of lapply by default returning a vector, matrix or, if simplify = "array", an array if appropriate, by applying simplify2array().

On the simplify argument:

logical or character string; should the result be simplified to a vector, matrix or higher dimensional array if possible? For sapply it must be named and not abbreviated. The default value, TRUE, returns a vector or matrix if appropriate, whereas if simplify = "array" the result may be an array of “rank” (=length(dim(.))) one higher than the result of FUN(X[[i]]).

The Details part explain its behavior that loos similar with what you experienced (emphasis is from me) :

Simplification in sapply is only attempted if X has length greater than zero and if the return values from all elements of X are all of the same (positive) length. If the common length is one the result is a vector, and if greater than one is a matrix with a column corresponding to each element of X.

Hadley Wickham also recommend not to use sapply:

I recommend that you avoid sapply() because it tries to simplify the result, so it can return a list, a vector, or a matrix. This makes it difficult to program with, and it should be avoided in non-interactive settings

He also recommends not to use apply with a data frame. See Advanced R for further explanation.

Orthohydrogen answered 17/3, 2022 at 9:38 Comment(2)

thanks for the detailed explanation! lapply solved it! truth is Im never sure which apply function to use... whats the best rule of thumb? – Yl 18/3, 2022 at 2:28

I'm not experienced enough to recommend the best rule of thumb. I hope many R experts here would do. My personal view is that for a vector, vapply is the best. apply is good for matrices, OK for data frames. lapply is good for data frames but the outputs are lists that are not data frames. for loop is the best for sequential operation. Map is good for more than one conditionals. Other good options are ``map` families from purrr package purrr.tidyverse.org/reference/map.html – Orthohydrogen 18/3, 2022 at 3:10

You can also use map_df function from purrr package, which applies a function on each element of an object and also returns a data frame:

library(dplyr)
library(purrr)

mydf %>%
  mutate(map_df(select(cur_data(), starts_with("value")), ~ gsub("\\s", "_", .x)))

  ID value1 value2
1  A    A_1    Z_1

And with the original data frame:

  ID value1 value2
1  A    A_1    Z_1
2  B    B_2    Y_2
3  C    C_3    X_3

Semirigid answered 17/3, 2022 at 9:44 Comment(0)

Here's a solution that replaces the original data. Not sure if this is plays into your workflow, though. Notice that I used apply which is used to process data.frames by rows or columns.

mydf <- data.frame(ID=LETTERS[1], value1=paste(LETTERS[1], 1), value2=paste(rev(LETTERS)[1], 1))

xy <- apply(X = mydf[, -1, drop = FALSE],
      MARGIN = 2,
      FUN = function(x) gsub("\\s+", "_", x),
      simplify = FALSE
)
xy <- do.call(cbind, xy)
xy <- as.data.frame(xy)

mydf[, -1] <- as.data.frame(xy)
mydf

  ID value1 value2
1  A    A_1    Z_1

Outer answered 17/3, 2022 at 9:46 Comment(0)

Recommended topics

Hot tags