How to apply a function to all combinations of rows in a data frame?
Asked Answered
Y

3

3

I have trouble solving the following problem concerning the (simplified by limiting number of columns) data frame 'annotations' below.

require(irr)
# data
annotations <- read.table(text = "Obj1    Obj2    Obj3
Rater1     a       b       c
Rater2     a       b       b
Rater3     a       b       c", header = TRUE, stringsAsFactors = FALSE)

I would like to apply the function agree from the irr package to all combinations (not permutations) of rows, resulting in the following.

Agreement rater 1-2: 67%
Agreement rater 1-3: 100%
Agreement rater 2-3: 67%

I need to run a function on all combinations of rows and the function would need to access a number of/all columns.

I have worked out parts of the answer to the problem; I have generated a list of combinations running combn(rownames(annotations), 2), but I don't see how to use this list without writing inefficient for loops.

I have tried apply, as in apply(annotations, 1, agree), but I can only get this to work on one row, not the combinations mentioned before.

Does anyone have an idea how to proceed?

UPDATE: The following solution, based on your suggestions, works. (I have used kappa2 from the irr package instead of agree, but the solution to the main question remains the same.)

require(irr) #require the irr library for agreement calculations
annotations <- read.table(text = "Obj1    Obj2    Obj3
Rater1     a       b       c
Rater2     a       b       b
Rater3     a       b       c
Rater4     c       a       a", header = TRUE, stringsAsFactors = FALSE)

annotations <- t(annotations) #transpose annotations (rows become columns and vice versa)
kappa_list <- combn(colnames(annotations), 2, FUN=function(x) kappa_list[[length(kappa_list)+1]] = kappa2(matrix(c(annotations[,x[1]], annotations[,x[2]]), ncol=2))$value) #fill kappa_list with all pairs of columns (combinations of 2 raters) in annotations and, per combination, add a value to kappa_list that consists of the value of kappa2 applied to the current combination of raters
kappa_list # display the list of values
Youmans answered 9/11, 2016 at 21:8 Comment(2)
For inline code, use backticks, not single quotes.Dependency
Thank you. I have edited the post by including the function agree from the package irr. The data frame is only simplified by the number of columns (which is close to 100), next to that, the data provided are representative of the real data.Youmans
O
6

You are close, you just need to apply on the result of combn instead. I have no idea what function you are referring to, but this should work the same if you plug in your function.

First, save the results as a list instead, because it is easier to add names (which I am adding my combining the two entries together):

toCheck <- combn(rownames(annotations), 2, simplify = FALSE)

names(toCheck) <-
  sapply(toCheck, paste, collapse = " - ")

Then, use sapply to work through your combinations. Here, I am using mean to do the comparison, but use what you need here. If you are returning more than a single value, use lapply then work with the result to print as desired

sapply(toCheck, function(x){
  mean(annotations[x[1], ] == annotations[x[2], ])
})

Which returns:

Rater 1 - Rater 2 Rater 1 - Rater 3 Rater 2 - Rater 3 
        0.6666667         1.0000000         0.6666667 
Oar answered 9/11, 2016 at 21:17 Comment(2)
You can do this within combn combn(rownames(annotations), 2, FUN=function(x) mean(annotations[x[1], ] == annotations[x[2], ]))Educational
This suggestion got me on the right track. I have updated my question with the solution that solves the problem.Youmans
C
0

Applying the function f(x):= 2x+5 to all entries of a column corresponding to combinations. Instead of f(x):= 2x+5, one can write his/her own function:

Step1: Design specific combinations dataframe. (The following was for my own case)

causalitycombinations <- function (nvars, ncausers, ndependents)
{
    independents <- combn(nvars, ncausers)
    swingnumber <- dim(combn(nvars - ncausers, ndependents))[[2]]
    numberofallcombinations <- dim(combn(nvars, ncausers))[[2]] * swingnumber
    dependents <- matrix(, nrow = dim(combn(nvars, ncausers))[[2]] * swingnumber, ncol = ndependents)
    for (i in as.integer(1:dim(combn(nvars, ncausers))[[2]])) {
        dependents[(swingnumber * (i - 1) + 1):(swingnumber * i), ] <- t(combn(setdiff(seq(1:nvars), independents[, i]), ndependents))
    }
    swingedindependents <- matrix(, nrow = dim(combn(nvars, ncausers))[[2]] * swingnumber, ncol = ncausers)
    for (i in as.integer(1:dim(combn(nvars, ncausers))[[2]])) {
        for (j in as.integer(1:swingnumber)) {
            swingedindependents[(i - 1) * swingnumber + j, ] <- independents[, i]
        }
    }
    independentsdependents <- cbind(swingedindependents, dependents)
    others <- matrix(, nrow = dim(combn(nvars, ncausers))[[2]] * swingnumber, ncol = nvars - ncausers - ndependents)
    for (i in as.integer(1:((dim(combn(nvars, ncausers))[[2]]) * swingnumber))) {
        others[i, ] <- setdiff(seq(1:nvars), independentsdependents[i, ])
    }
    causalitiestemplate <- cbind(independentsdependents, others)
    causalitiestemplate
}

    causalitycombinations(3,1,1)
#     [,1] [,2] [,3]
#[1,]    1    2    3
#[2,]    1    3    2
#[3,]    2    1    3
#[4,]    2    3    1
#[5,]    3    1    2
#[6,]    3    2    1

Step2: Append the data to the combinations
(one can append multiple columns, I added only 1 for simplicity)

set.seed(1)
mydataframer <- cbind(causalitycombinations(3,1,1), rnorm(6))
mydataframer
 #     [,1] [,2] [,3]       [,4]
 #[1,]    1    2    3 -0.6264538
 #[2,]    1    3    2  0.1836433
 #[3,]    2    1    3 -0.8356286
 #[4,]    2    3    1  1.5952808
 #[5,]    3    1    2  0.3295078
 #[6,]    3    2    1 -0.8204684

Step3: Apply the function via lapply while taking into account number of rows of the composite dataframe

lapply(1: dim(mydataframer)[[1]], function(x) {2*mydataframer[x,4] + 5})

# 3.747092
# 5.367287
# 3.328743
# 8.190562
# 5.659016
# 3.359063

That is it.

By the way, ?irr::agree help file states that nxm ratings matrix/dataframe is "n subjects, m raters". Hence, questioner may design this better via:

annotations <- read.table(text = "Rater1    Rater2    Rater3
Subject1     a       b       c
Subject2     a       b       b
Subject3     a       b       c", header = TRUE, stringsAsFactors = FALSE)

annotations
#         Rater1 Rater2 Rater3
# Subject1      a      b      c
# Subject2      a      b      b
# Subject3      a      b      c

Also, one thing still needs to be clarified whether questioner want to loop over all such combinations of annotations. If that's the case, i.e.,

annotations
#         Rater1 Rater2 Rater3
# Subject1      a      a      a
# Subject2      a      a      a
# Subject3      a      a      a

annotations
#         Rater1 Rater2 Rater3
# Subject1      a      a      b
# Subject2      a      a      a
# Subject3      a      a      a

annotations
#         Rater1 Rater2 Rater3
# Subject1      a      a      c
# Subject2      a      a      a
# Subject3      a      a      a

annotations
#         Rater1 Rater2 Rater3
# Subject1      a      b      a
# Subject2      a      a      a
# Subject3      a      a      a

# .... after consuming all Subject1 possibilities, this time consuming Subject2 possibilities,

annotations
#         Rater1 Rater2 Rater3
# Subject1      a      a      a
# Subject2      a      a      b
# Subject3      a      a      a

and then Subject3 possibilities, and thereby collecting all the possibilities of agreements, then the problem changes completely.

The irr::agree function designed for multiple rows. Observe from its help file:

data(video)
video
#   rater1 rater2 rater3 rater4
# 1       4      4      3      4
# 2       4      4      4      5
# ..............................
# 20      4      5      5      4

agree(video)     # Simple percentage agreement
# Percentage agreement (Tolerance=0)
# Subjects = 20; Raters = 4; %-agree = 35 

agree(video, 1)  # Extended percentage agreement
# Percentage agreement (Tolerance=1)
# Subjects = 20; Raters = 4; %-agree = 90 

whereas in the case where the questioner wanna apply row-wise agrees (only 1 subject!), the %-agree is always 0:

agree(video[1,])
# Percentage agreement (Tolerance=0)
# Subjects = 1; Raters = 4; %-agree = 0

...

agree(video[20,])
# Percentage agreement (Tolerance=0)
# Subjects = 1; Raters = 4; %-agree = 0
Cu answered 9/11, 2016 at 21:55 Comment(0)
H
0

A fast approach is to make two vectors for the sequence of row numbers for each combination of rows, to make one matrix for the rows on one side and another matrix for the rows on the other side, and to then apply a vectorized function to the matrices:

es=1:3
r=sapply(es,function(e){
  nrow=10^e
  ncol=8
  m=matrix(rnorm(ncol*nrow),nrow)

  b=microbenchmark(times=100,
    vectorized={
      z=1:nrow
      i1=rep(z[-nrow],rev(z[-nrow]))
      i2=unlist(lapply(z[-1],function(x)x:nrow))
      o=m[i1,]+m[i2,]
    },
    vectorized_without_saving_sequence_in_variable={
      i1=rep(1:(nrow-1),(nrow-1):1)
      i2=unlist(lapply(2:nrow,function(x)x:nrow))
      o=m[i1,]+m[i2,]
    },
    vectorized_combn={
      c=combn(nrow,2)
      o=m[c[1,],]+m[c[2,],]
    },
    vectorized_for_loops={
      ncomb=nrow*(nrow-1)/2
      i1=integer(ncomb)
      i2=integer(ncomb)
      n=1;for(i in 1:(nrow-1))for(j in(i+1):nrow){i1[n]=i;i2[n]=j;n=n+1}
      o=m[i1,]+m[i2,]
    },
    for_loops={
      o=matrix(nrow=nrow*(nrow-1)/2,ncol=ncol)
      n=1;for(i in 1:(nrow-1))for(j in(i+1):nrow){o[n,]=m[i,]+m[j,];n=n+1}
    },
    combn_with_function_argument={o=t(combn(nrow,2,function(x)m[x[1],]+m[x[2],]))},
    combn_with_apply={o=t(apply(combn(nrow,2),2,function(x)m[x[1],]+m[x[2],]))}
  )
  a=aggregate(b$time,list(b$expr),median)
  setNames(a[,2],gsub(" ","",a[,1]))/1e6
})

r2=apply(r,2,function(x)formatC(x,max(0,2-ceiling(log10(min(x,na.rm=T)))),format="f"))
r3=apply(rbind(paste0("1e",es),r2),2,function(x)formatC(x,max(nchar(x)),format="s"))
writeLines(apply(cbind(r3,c("",rownames(r))),1,paste,collapse=" "))

Median time in ms for input with 10, 100, and 1000 rows:

  1e1   1e2  1e3 
0.022  0.58   31 vectorized
0.017  0.58   34 vectorized_without_saving_sequence_in_variable
0.052  3.41  331 vectorized_combn
0.068  7.12  718 vectorized_for_loops
0.095 10.24 1058 for_loops
0.126 11.84 1194 combn_with_function_argument
0.182 14.99 1636 combn_with_apply

The following is a faster alternative to c=combn(nrow,2);i1=c[1,];i2=c[2,] (see How to efficiently generate lower triangle indices of a symmetric matrix):

z=1:nrow
i1=rep(z[-nrow],rev(z[-nrow]))
i2=unlist(lapply(z[-1],function(x)x:nrow))

Or this is only slightly slower but easier to understand:

i1=rep(1:(nrow-1),(nrow-1):1)
i2=unlist(lapply(2:nrow,function(x)x:nrow))
Helvetii answered 31/7, 2022 at 8:55 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.