How do I find equal columns in R?
Asked Answered
H

4

6

Given the following:

a <- c(1,2,3)
b <- c(1,2,3)
c <- c(4,5,6)
A <- cbind(a,b,c)

I want to find which columns in A are equal to for example my vector a.

My first attempt would be:

> which(a==A)
[1] 1 2 3 4 5 6

Which did not do that. (Too be honest I don't even understand what that did) Second attempt was:

a==A
        a    b     c
[1,] TRUE TRUE FALSE
[2,] TRUE TRUE FALSE
[3,] TRUE TRUE FALSE

which definitely is a step in the right direction but it seems extended into a matrix. What I would have preferred is something like just one of the rows. How do I compare a vector to columns and how do I find columns in a matrix that are equal to a vector?

Hypnoanalysis answered 19/10, 2012 at 8:38 Comment(3)
"Could not find function 'nbind'". Always cut n paste your code.Ranie
Fixed. (And adding some more text so I can press "Add Comment")Hypnoanalysis
Protip: never test with a square matrix (too easy to confuse rows with columns). Not saying you have, but you will....Ranie
R
8

If you add an extra row:

> A
     a b c  
[1,] 1 1 4 4
[2,] 2 2 5 2
[3,] 3 3 6 1

Then you can see that this function is correct:

> hasCol=function(A,a){colSums(a==A)==nrow(A)}
> A[,hasCol(A,a)]
     a b
[1,] 1 1
[2,] 2 2
[3,] 3 3

But the earlier version accepted doesn't:

> oopsCol=function(A,a){colSums(a==A)>0}
> A[,oopsCol(A,a)]
     a b  
[1,] 1 1 4
[2,] 2 2 2
[3,] 3 3 1

It returns the 4,2,1 column because the 2 matches the 2 in 1,2,3.

Ranie answered 19/10, 2012 at 11:24 Comment(3)
Why not use all? I don't understand the point of converting a logical to a numeric in this situation.Heterotypic
I wanted to show and correct the bug in the accepted answer. Better answers are available!Ranie
@Heterotypic Because there is no colAlls in R. I'll update my answer with a benchmark.Rebane
H
9

Use identical. That is R's "scalar" comparison operator; it returns a single logical value, not a vector.

apply(A, 2, identical, a)
#    a     b     c 
# TRUE  TRUE FALSE 

If A is a data frame in your real case, you're better off using sapply or vapply because apply coerces it's input to a matrix.

d <- c("a", "b", "c")
B <- data.frame(a, b, c, d)

apply(B, 2, identical, a) # incorrect!
#     a     b     c     d 
# FALSE FALSE FALSE FALSE 

sapply(B, identical, a) # correct
#    a     b     c     d 
# TRUE  TRUE FALSE FALSE

But note that data.frame coerces character inputs to factors unless you ask otherwise:

sapply(B, identical, d) # incorrect
#     a     b     c     d 
# FALSE FALSE FALSE FALSE 

C <- data.frame(a, b, c, d, stringsAsFactors = FALSE)
sapply(C, identical, d) # correct
#     a     b     c     d 
# FALSE FALSE FALSE  TRUE 

Identical is also considerably faster than using all + ==:

library(microbenchmark)

a <- 1:1000
b <- c(1:999, 1001)

microbenchmark(
  all(a == b), 
  identical(a, b))
# Unit: microseconds
#              expr   min    lq median     uq    max
# 1     all(a == b) 8.053 8.149 8.2195 8.3295 17.355
# 2 identical(a, b) 1.082 1.182 1.2675 1.3435  3.635
Heterotypic answered 19/10, 2012 at 12:42 Comment(4)
@Ranie identical will be even faster in situations where == would coerce or recycle.Heterotypic
Note that this will fail (and silently) if the vector you test against ('a') has any other attributes, such as names. c(1,2,3) is not identical to c(n=1,m=2,x=3) even though the values are the same. I'm wary of using identical.Ranie
@Ranie Well personally, I'd consider those vectors different, but ymmv. As in any problem, you need to think about definition of equality you need. Using == has it's own risks: all(c(1,2) == c(1,2,1,2)).Heterotypic
When do we need to use all.equal() ?Profane
R
8

If you add an extra row:

> A
     a b c  
[1,] 1 1 4 4
[2,] 2 2 5 2
[3,] 3 3 6 1

Then you can see that this function is correct:

> hasCol=function(A,a){colSums(a==A)==nrow(A)}
> A[,hasCol(A,a)]
     a b
[1,] 1 1
[2,] 2 2
[3,] 3 3

But the earlier version accepted doesn't:

> oopsCol=function(A,a){colSums(a==A)>0}
> A[,oopsCol(A,a)]
     a b  
[1,] 1 1 4
[2,] 2 2 2
[3,] 3 3 1

It returns the 4,2,1 column because the 2 matches the 2 in 1,2,3.

Ranie answered 19/10, 2012 at 11:24 Comment(3)
Why not use all? I don't understand the point of converting a logical to a numeric in this situation.Heterotypic
I wanted to show and correct the bug in the accepted answer. Better answers are available!Ranie
@Heterotypic Because there is no colAlls in R. I'll update my answer with a benchmark.Rebane
M
4

Surely there's a better solution but the following works:

> a <- c(1,2,3)
> b <- c(1,2,3)
> c <- c(4,5,6)
> A <- cbind(a,b,c)
> sapply(1:ncol(A), function(i) all(a==A[,i]))
[1]  TRUE  TRUE FALSE

And to get the indices:

> which(sapply(1:ncol(A), function(i) all(a==A[,i])))
[1] 1 2
Mcmasters answered 19/10, 2012 at 8:50 Comment(1)
I did something similar: which(apply(A==a,2,all))Leitmotif
R
-1
colSums(a==A)==nrow(A)

Recycling of == makes a effectively a matrix which has all columns equal to a and dimensions equal to those of A. colSums sums each column; while TRUE behaves like 1 and FALSE like 0, columns equal to a will have sum equal to the number of rows. We use this observation to finally reduce the answer to a logical vector.

EDIT:

library(microbenchmark)

A<-rep(1:14,1000);c(7,2000)->dim(A)
1:7->a

microbenchmark(
 apply(A,2,function(b) identical(a,b)),
 apply(A,2,function(b) all(a==b)),
 colSums(A==a)==nrow(A))

# Unit: microseconds
#                                     expr      min        lq    median
# 1     apply(A, 2, function(b) all(a == b)) 9446.210 9825.6465 10278.335
# 2 apply(A, 2, function(b) identical(a, b)) 9324.203 9915.7935 10314.833
# 3               colSums(A == a) == nrow(A)  120.252  121.5885   140.185
#         uq       max
# 1 10648.7820 30588.765
# 2 10868.5970 13905.095
# 3   141.7035   162.858
Rebane answered 19/10, 2012 at 9:33 Comment(1)
Wrong! The >0 will be TRUE if any of the elements of 'a' match the element of the column of A. You need to check if all the values are TRUE, hence: colSums(a==A)==nrow(A)Ranie

© 2022 - 2024 — McMap. All rights reserved.