For each row in an R dataframe
Asked Answered
A

9

212

I have a dataframe, and for each row in that dataframe I have to do some complicated lookups and append some data to a file.

The dataFrame contains scientific results for selected wells from 96 well plates used in biological research so I want to do something like:

for (well in dataFrame) {
  wellName <- well$name    # string like "H1"
  plateName <- well$plate  # string like "plate67"
  wellID <- getWellID(wellName, plateName)
  cat(paste(wellID, well$value1, well$value2, sep=","), file=outputFile)
}

In my procedural world, I'd do something like:

for (row in dataFrame) {
    #look up stuff using data from the row
    #write stuff to the file
}

What is the "R way" to do this?

Astragalus answered 9/11, 2009 at 4:8 Comment(3)
What is your question here? A data.frame is a two-dimensional object and looping over the rows is a perfectly normal way of doing things as rows are commonly sets of 'observations' of the 'variables' in each column.Sculpturesque
what I end up doing is: for (index in 1:nrow(dataFrame)) { row = dataFrame[index, ]; # do stuff with the row } which never seemed very pretty to me.Astragalus
Does getWellID call a database or anything? Otherwise, Jonathan is probably right and you could vectorize this.Matteson
T
110

You can try this, using apply() function

> d
  name plate value1 value2
1    A    P1      1    100
2    B    P2      2    200
3    C    P3      3    300

> f <- function(x, output) {
 wellName <- x[1]
 plateName <- x[2]
 wellID <- 1
 print(paste(wellID, x[3], x[4], sep=","))
 cat(paste(wellID, x[3], x[4], sep=","), file= output, append = T, fill = T)
}

> apply(d, 1, f, output = 'outputfile')
Tarnopol answered 9/11, 2009 at 14:2 Comment(4)
Be careful, as the dataframe is converted to a matrix, and what you end up with (x) is a vector. This is why the above example has to use numeric indexes; the by() approach gives you a data.frame, which makes your code more robust.Regeniaregensburg
did not work for me. The apply function treated every x given to f as a character value and not a row.Weatherboarding
Note too that you can refer to the columns by name. So: wellName <- x[1] could also be wellName <- x["name"].Titration
When Darren mentioned robust, he meant something like shifting the orders of the columns. This answer would not work whereas the one with by() would still work.Clearcut
O
131

You can use the by() function:

by(dataFrame, seq_len(nrow(dataFrame)), function(row) dostuff)

But iterating over the rows directly like this is rarely what you want to; you should try to vectorize instead. Can I ask what the actual work in the loop is doing?

Otes answered 9/11, 2009 at 5:54 Comment(6)
this will not work well if the data frame has 0 rows because 1:0 is not emptyPolyanthus
Easy fix for the 0 row case is to use seq_len(), insert seq_len(nrow(dataFrame)) in place of 1:nrow(dataFrame).Yerxa
How do you actually implement (row)? Is it dataframe$column? dataframe[somevariableNamehere]? How do you actually say its a row. The pseudocode "function(row) dostuff" how would that actually look?Phrenetic
@Mike, change dostuff in this answer to str(row) You'll see multiple lines printed in the console beginning with " 'data.frame': 1 obs of x variables." But be careful, changing dostuff to row does not return a data.frame object for the outer function as a whole. Instead it returns a list of one row data-frames.Unconditional
Not everything should be vectorized. But in this case it would make sense I guess.Braud
I fixed the issue noted by sds and Jim with an edit.Crutcher
M
112

First, Jonathan's point about vectorizing is correct. If your getWellID() function is vectorized, then you can skip the loop and just use cat or write.csv:

write.csv(data.frame(wellid=getWellID(well$name, well$plate), 
         value1=well$value1, value2=well$value2), file=outputFile)

If getWellID() isn't vectorized, then Jonathan's recommendation of using by or knguyen's suggestion of apply should work.

Otherwise, if you really want to use for, you can do something like this:

for(i in 1:nrow(dataFrame)) {
    row <- dataFrame[i,]
    # do stuff with row
}

You can also try to use the foreach package, although it requires you to become familiar with that syntax. Here's a simple example:

library(foreach)
d <- data.frame(x=1:10, y=rnorm(10))
s <- foreach(d=iter(d, by='row'), .combine=rbind) %dopar% d

A final option is to use a function out of the plyr package, in which case the convention will be very similar to the apply function.

library(plyr)
ddply(dataFrame, .(x), function(x) { # do stuff })
Matteson answered 9/11, 2009 at 14:4 Comment(5)
Shane, thank you. I'm not sure how to write a vectorized getWellID. What I need to do right now is to dig into an existing list of lists to look it up or pull it out of a database.Astragalus
Feel free to post the getWellID question (i.e. can this function be vectorized?) separately, and I'm sure I (or someone else) will answer it.Matteson
Even if getWellID is not vectorized, I think you should go with this solution, and replace getWellId with mapply(getWellId, well$name, well$plate).Otes
Even if you pull it from a database, you can pull them all at once and then filter the result in R; that will be faster than an iterative function.Matteson
+1 for foreach - I'm going to use the hell out of that one.Enwind
T
110

You can try this, using apply() function

> d
  name plate value1 value2
1    A    P1      1    100
2    B    P2      2    200
3    C    P3      3    300

> f <- function(x, output) {
 wellName <- x[1]
 plateName <- x[2]
 wellID <- 1
 print(paste(wellID, x[3], x[4], sep=","))
 cat(paste(wellID, x[3], x[4], sep=","), file= output, append = T, fill = T)
}

> apply(d, 1, f, output = 'outputfile')
Tarnopol answered 9/11, 2009 at 14:2 Comment(4)
Be careful, as the dataframe is converted to a matrix, and what you end up with (x) is a vector. This is why the above example has to use numeric indexes; the by() approach gives you a data.frame, which makes your code more robust.Regeniaregensburg
did not work for me. The apply function treated every x given to f as a character value and not a row.Weatherboarding
Note too that you can refer to the columns by name. So: wellName <- x[1] could also be wellName <- x["name"].Titration
When Darren mentioned robust, he meant something like shifting the orders of the columns. This answer would not work whereas the one with by() would still work.Clearcut
P
32

I think the best way to do this with basic R is:

for( i in rownames(df) )
   print(df[i, "column1"])

The advantage over the for( i in 1:nrow(df))-approach is that you do not get into trouble if df is empty and nrow(df)=0.

Peignoir answered 16/7, 2017 at 16:7 Comment(0)
W
21

I use this simple utility function:

rows = function(tab) lapply(
  seq_len(nrow(tab)),
  function(i) unclass(tab[i,,drop=F])
)

Or a faster, less clear form:

rows = function(x) lapply(seq_len(nrow(x)), function(i) lapply(x,"[",i))

This function just splits a data.frame to a list of rows. Then you can make a normal "for" over this list:

tab = data.frame(x = 1:3, y=2:4, z=3:5)
for (A in rows(tab)) {
    print(A$x + A$y * A$z)
}        

Your code from the question will work with a minimal modification:

for (well in rows(dataFrame)) {
  wellName <- well$name    # string like "H1"
  plateName <- well$plate  # string like "plate67"
  wellID <- getWellID(wellName, plateName)
  cat(paste(wellID, well$value1, well$value2, sep=","), file=outputFile)
}
Walkup answered 27/8, 2015 at 18:44 Comment(4)
It's faster to access a straight list then a data.frame.Marelda
Just realized it's even faster to make the same thing with double lapply: rows = function(x) lapply(seq_len(nrow(x)), function(i) lapply(x,function(c) c[i]))Marelda
So the inner lapply iterates over the columns of the entire dataset x, giving each column the name c, and then extracting the ith entry from that column vector. Is this correct?Jagannath
Very nice! In my case, I had to convert from "factor" values to the underlying value: wellName <- as.character(well$name).Bluecoat
F
11

I was curious about the time performance of the non-vectorised options. For this purpose, I have used the function f defined by knguyen

f <- function(x, output) {
  wellName <- x[1]
  plateName <- x[2]
  wellID <- 1
  print(paste(wellID, x[3], x[4], sep=","))
  cat(paste(wellID, x[3], x[4], sep=","), file= output, append = T, fill = T)
}

and a dataframe like the one in his example:

n = 100; #number of rows for the data frame
d <- data.frame( name = LETTERS[ sample.int( 25, n, replace=T ) ],
                  plate = paste0( "P", 1:n ),
                  value1 = 1:n,
                  value2 = (1:n)*10 )

I included two vectorised functions (for sure quicker than the others) in order to compare the cat() approach with a write.table() one...

library("ggplot2")
library( "microbenchmark" )
library( foreach )
library( iterators )

tm <- microbenchmark(S1 =
                       apply(d, 1, f, output = 'outputfile1'),
                     S2 = 
                       for(i in 1:nrow(d)) {
                         row <- d[i,]
                         # do stuff with row
                         f(row, 'outputfile2')
                       },
                     S3 = 
                       foreach(d1=iter(d, by='row'), .combine=rbind) %dopar% f(d1,"outputfile3"),
                     S4= {
                       print( paste(wellID=rep(1,n), d[,3], d[,4], sep=",") )
                       cat( paste(wellID=rep(1,n), d[,3], d[,4], sep=","), file= 'outputfile4', sep='\n',append=T, fill = F)                           
                     },
                     S5 = {
                       print( (paste(wellID=rep(1,n), d[,3], d[,4], sep=",")) )
                       write.table(data.frame(rep(1,n), d[,3], d[,4]), file='outputfile5', row.names=F, col.names=F, sep=",", append=T )
                     },
                     times=100L)
autoplot(tm)

The resulting image shows that apply gives the best performance for a non-vectorised version, whereas write.table() seems to outperform cat(). ForEachRunningTime

Fernald answered 14/7, 2015 at 13:12 Comment(0)
T
7

You can use the by_row function from the package purrrlyr for this:

myfn <- function(row) {
  #row is a tibble with one row, and the same 
  #number of columns as the original df
  #If you'd rather it be a list, you can use as.list(row)
}

purrrlyr::by_row(df, myfn)

By default, the returned value from myfn is put into a new list column in the df called .out.

If this is the only output you desire, you could write purrrlyr::by_row(df, myfn)$.out

Tremor answered 3/6, 2017 at 19:10 Comment(0)
V
2

Well, since you asked for R equivalent to other languages, I tried to do this. Seems to work though I haven't really looked at which technique is more efficient in R.

> myDf <- head(iris)
> myDf
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa
6          5.4         3.9          1.7         0.4  setosa
> nRowsDf <- nrow(myDf)
> for(i in 1:nRowsDf){
+ print(myDf[i,4])
+ }
[1] 0.2
[1] 0.2
[1] 0.2
[1] 0.2
[1] 0.2
[1] 0.4

For the categorical columns though, it would fetch you a Data Frame which you could typecast using as.character() if needed.

Vampirism answered 13/2, 2015 at 15:4 Comment(0)
M
0

you can do something for a list object,

data("mtcars")
rownames(mtcars)
data <- list(mtcars ,mtcars, mtcars, mtcars);data

out1 <- NULL 
for(i in seq_along(data)) { 
  out1[[i]] <- data[[i]][rownames(data[[i]]) != "Volvo 142E", ] } 
out1

Or a data frame,

data("mtcars")
df <- mtcars
out1 <- NULL 
for(i in 1:nrow(df)) {
  row <- rownames(df[i,])
  # do stuff with row
  out1 <- df[rownames(df) != "Volvo 142E",]
  
}
out1 
Microhenry answered 5/8, 2020 at 9:33 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.