For each row in an R dataframe

Asked 9/11, 2009 at 4:8 Answered 5/8, 2020 at 9:33

212

I have a dataframe, and for each row in that dataframe I have to do some complicated lookups and append some data to a file.

The dataFrame contains scientific results for selected wells from 96 well plates used in biological research so I want to do something like:

for (well in dataFrame) {
  wellName <- well$name    # string like "H1"
  plateName <- well$plate  # string like "plate67"
  wellID <- getWellID(wellName, plateName)
  cat(paste(wellID, well$value1, well$value2, sep=","), file=outputFile)
}

In my procedural world, I'd do something like:

for (row in dataFrame) {
    #look up stuff using data from the row
    #write stuff to the file
}

What is the "R way" to do this?

Astragalus answered 9/11, 2009 at 4:8 Comment(3)

What is your question here? A data.frame is a two-dimensional object and looping over the rows is a perfectly normal way of doing things as rows are commonly sets of 'observations' of the 'variables' in each column. – Sculpturesque 9/11, 2009 at 4:29

what I end up doing is: for (index in 1:nrow(dataFrame)) { row = dataFrame[index, ]; # do stuff with the row } which never seemed very pretty to me. – Astragalus 9/11, 2009 at 5:33

Does getWellID call a database or anything? Otherwise, Jonathan is probably right and you could vectorize this. – Matteson 9/11, 2009 at 14:44

110

You can try this, using apply() function

> d
  name plate value1 value2
1    A    P1      1    100
2    B    P2      2    200
3    C    P3      3    300

> f <- function(x, output) {
 wellName <- x[1]
 plateName <- x[2]
 wellID <- 1
 print(paste(wellID, x[3], x[4], sep=","))
 cat(paste(wellID, x[3], x[4], sep=","), file= output, append = T, fill = T)
}

> apply(d, 1, f, output = 'outputfile')

Tarnopol answered 9/11, 2009 at 14:2 Comment(4)

Be careful, as the dataframe is converted to a matrix, and what you end up with (x) is a vector. This is why the above example has to use numeric indexes; the by() approach gives you a data.frame, which makes your code more robust. – Regeniaregensburg 19/12, 2011 at 5:20

did not work for me. The apply function treated every x given to f as a character value and not a row. – Weatherboarding 10/8, 2014 at 7:36

Note too that you can refer to the columns by name. So: wellName <- x[1] could also be wellName <- x["name"]. – Titration 3/9, 2014 at 11:2

When Darren mentioned robust, he meant something like shifting the orders of the columns. This answer would not work whereas the one with by() would still work. – Clearcut 4/1, 2016 at 6:26

131

You can use the by() function:

by(dataFrame, seq_len(nrow(dataFrame)), function(row) dostuff)

But iterating over the rows directly like this is rarely what you want to; you should try to vectorize instead. Can I ask what the actual work in the loop is doing?

Otes answered 9/11, 2009 at 5:54 Comment(6)

this will not work well if the data frame has 0 rows because 1:0 is not empty – Polyanthus 21/4, 2013 at 17:8

Easy fix for the 0 row case is to use seq_len(), insert seq_len(nrow(dataFrame)) in place of 1:nrow(dataFrame). – Yerxa 10/6, 2014 at 16:42

How do you actually implement (row)? Is it dataframe$column? dataframe[somevariableNamehere]? How do you actually say its a row. The pseudocode "function(row) dostuff" how would that actually look? – Phrenetic 7/4, 2016 at 11:0

@Mike, change dostuff in this answer to str(row) You'll see multiple lines printed in the console beginning with " 'data.frame': 1 obs of x variables." But be careful, changing dostuff to row does not return a data.frame object for the outer function as a whole. Instead it returns a list of one row data-frames. – Unconditional 1/5, 2017 at 15:22

Not everything should be vectorized. But in this case it would make sense I guess. – Braud 11/9, 2020 at 8:50

I fixed the issue noted by sds and Jim with an edit. – Crutcher 14/12, 2020 at 23:40

112

First, Jonathan's point about vectorizing is correct. If your getWellID() function is vectorized, then you can skip the loop and just use cat or write.csv:

write.csv(data.frame(wellid=getWellID(well$name, well$plate), 
         value1=well$value1, value2=well$value2), file=outputFile)

If getWellID() isn't vectorized, then Jonathan's recommendation of using by or knguyen's suggestion of apply should work.

Otherwise, if you really want to use for, you can do something like this:

for(i in 1:nrow(dataFrame)) {
    row <- dataFrame[i,]
    # do stuff with row
}

You can also try to use the foreach package, although it requires you to become familiar with that syntax. Here's a simple example:

library(foreach)
d <- data.frame(x=1:10, y=rnorm(10))
s <- foreach(d=iter(d, by='row'), .combine=rbind) %dopar% d

A final option is to use a function out of the plyr package, in which case the convention will be very similar to the apply function.

library(plyr)
ddply(dataFrame, .(x), function(x) { # do stuff })

Matteson answered 9/11, 2009 at 14:4 Comment(5)

Shane, thank you. I'm not sure how to write a vectorized getWellID. What I need to do right now is to dig into an existing list of lists to look it up or pull it out of a database. – Astragalus 9/11, 2009 at 23:45

Feel free to post the getWellID question (i.e. can this function be vectorized?) separately, and I'm sure I (or someone else) will answer it. – Matteson 10/11, 2009 at 1:30

Even if getWellID is not vectorized, I think you should go with this solution, and replace getWellId with mapply(getWellId, well$name, well$plate). – Otes 10/11, 2009 at 2:28

Even if you pull it from a database, you can pull them all at once and then filter the result in R; that will be faster than an iterative function. – Matteson 10/11, 2009 at 3:13

+1 for foreach - I'm going to use the hell out of that one. – Enwind 24/1, 2013 at 6:52

110

You can try this, using apply() function

> d
  name plate value1 value2
1    A    P1      1    100
2    B    P2      2    200
3    C    P3      3    300

> f <- function(x, output) {
 wellName <- x[1]
 plateName <- x[2]
 wellID <- 1
 print(paste(wellID, x[3], x[4], sep=","))
 cat(paste(wellID, x[3], x[4], sep=","), file= output, append = T, fill = T)
}

> apply(d, 1, f, output = 'outputfile')

Tarnopol answered 9/11, 2009 at 14:2 Comment(4)

did not work for me. The apply function treated every x given to f as a character value and not a row. – Weatherboarding 10/8, 2014 at 7:36

Note too that you can refer to the columns by name. So: wellName <- x[1] could also be wellName <- x["name"]. – Titration 3/9, 2014 at 11:2

When Darren mentioned robust, he meant something like shifting the orders of the columns. This answer would not work whereas the one with by() would still work. – Clearcut 4/1, 2016 at 6:26

I think the best way to do this with basic R is:

for( i in rownames(df) )
   print(df[i, "column1"])

The advantage over the for( i in 1:nrow(df))-approach is that you do not get into trouble if df is empty and nrow(df)=0.

Peignoir answered 16/7, 2017 at 16:7 Comment(0)

I use this simple utility function:

rows = function(tab) lapply(
  seq_len(nrow(tab)),
  function(i) unclass(tab[i,,drop=F])
)

Or a faster, less clear form:

rows = function(x) lapply(seq_len(nrow(x)), function(i) lapply(x,"[",i))

This function just splits a data.frame to a list of rows. Then you can make a normal "for" over this list:

tab = data.frame(x = 1:3, y=2:4, z=3:5)
for (A in rows(tab)) {
    print(A$x + A$y * A$z)
}

Your code from the question will work with a minimal modification:

for (well in rows(dataFrame)) {
  wellName <- well$name    # string like "H1"
  plateName <- well$plate  # string like "plate67"
  wellID <- getWellID(wellName, plateName)
  cat(paste(wellID, well$value1, well$value2, sep=","), file=outputFile)
}

Walkup answered 27/8, 2015 at 18:44 Comment(4)

It's faster to access a straight list then a data.frame. – Marelda 15/5, 2016 at 8:38

Just realized it's even faster to make the same thing with double lapply: rows = function(x) lapply(seq_len(nrow(x)), function(i) lapply(x,function(c) c[i])) – Marelda 15/5, 2016 at 16:45

So the inner lapply iterates over the columns of the entire dataset x, giving each column the name c, and then extracting the ith entry from that column vector. Is this correct? – Jagannath 16/5, 2016 at 12:3

Very nice! In my case, I had to convert from "factor" values to the underlying value: wellName <- as.character(well$name). – Bluecoat 3/2, 2017 at 19:2

I was curious about the time performance of the non-vectorised options. For this purpose, I have used the function f defined by knguyen

f <- function(x, output) {
  wellName <- x[1]
  plateName <- x[2]
  wellID <- 1
  print(paste(wellID, x[3], x[4], sep=","))
  cat(paste(wellID, x[3], x[4], sep=","), file= output, append = T, fill = T)
}

and a dataframe like the one in his example:

n = 100; #number of rows for the data frame
d <- data.frame( name = LETTERS[ sample.int( 25, n, replace=T ) ],
                  plate = paste0( "P", 1:n ),
                  value1 = 1:n,
                  value2 = (1:n)*10 )

I included two vectorised functions (for sure quicker than the others) in order to compare the cat() approach with a write.table() one...

library("ggplot2")
library( "microbenchmark" )
library( foreach )
library( iterators )

tm <- microbenchmark(S1 =
                       apply(d, 1, f, output = 'outputfile1'),
                     S2 = 
                       for(i in 1:nrow(d)) {
                         row <- d[i,]
                         # do stuff with row
                         f(row, 'outputfile2')
                       },
                     S3 = 
                       foreach(d1=iter(d, by='row'), .combine=rbind) %dopar% f(d1,"outputfile3"),
                     S4= {
                       print( paste(wellID=rep(1,n), d[,3], d[,4], sep=",") )
                       cat( paste(wellID=rep(1,n), d[,3], d[,4], sep=","), file= 'outputfile4', sep='\n',append=T, fill = F)                           
                     },
                     S5 = {
                       print( (paste(wellID=rep(1,n), d[,3], d[,4], sep=",")) )
                       write.table(data.frame(rep(1,n), d[,3], d[,4]), file='outputfile5', row.names=F, col.names=F, sep=",", append=T )
                     },
                     times=100L)
autoplot(tm)

The resulting image shows that apply gives the best performance for a non-vectorised version, whereas write.table() seems to outperform cat(). ForEachRunningTime

Fernald answered 14/7, 2015 at 13:12 Comment(0)

You can use the by_row function from the package purrrlyr for this:

myfn <- function(row) {
  #row is a tibble with one row, and the same 
  #number of columns as the original df
  #If you'd rather it be a list, you can use as.list(row)
}

purrrlyr::by_row(df, myfn)

By default, the returned value from myfn is put into a new list column in the df called .out.

If this is the only output you desire, you could write purrrlyr::by_row(df, myfn)$.out

Tremor answered 3/6, 2017 at 19:10 Comment(0)

Well, since you asked for R equivalent to other languages, I tried to do this. Seems to work though I haven't really looked at which technique is more efficient in R.

> myDf <- head(iris)
> myDf
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa
6          5.4         3.9          1.7         0.4  setosa
> nRowsDf <- nrow(myDf)
> for(i in 1:nRowsDf){
+ print(myDf[i,4])
+ }
[1] 0.2
[1] 0.2
[1] 0.2
[1] 0.2
[1] 0.2
[1] 0.4

For the categorical columns though, it would fetch you a Data Frame which you could typecast using as.character() if needed.

Vampirism answered 13/2, 2015 at 15:4 Comment(0)

you can do something for a list object,

data("mtcars")
rownames(mtcars)
data <- list(mtcars ,mtcars, mtcars, mtcars);data

out1 <- NULL 
for(i in seq_along(data)) { 
  out1[[i]] <- data[[i]][rownames(data[[i]]) != "Volvo 142E", ] } 
out1

Or a data frame,

data("mtcars")
df <- mtcars
out1 <- NULL 
for(i in 1:nrow(df)) {
  row <- rownames(df[i,])
  # do stuff with row
  out1 <- df[rownames(df) != "Volvo 142E",]
  
}
out1

Microhenry answered 5/8, 2020 at 9:33 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags