How do I find equal columns in R?

Asked 19/10, 2012 at 8:38 Answered 19/10, 2012 at 12:42

Given the following:

a <- c(1,2,3)
b <- c(1,2,3)
c <- c(4,5,6)
A <- cbind(a,b,c)

I want to find which columns in A are equal to for example my vector a.

My first attempt would be:

> which(a==A)
[1] 1 2 3 4 5 6

Which did not do that. (Too be honest I don't even understand what that did) Second attempt was:

a==A
        a    b     c
[1,] TRUE TRUE FALSE
[2,] TRUE TRUE FALSE
[3,] TRUE TRUE FALSE

which definitely is a step in the right direction but it seems extended into a matrix. What I would have preferred is something like just one of the rows. How do I compare a vector to columns and how do I find columns in a matrix that are equal to a vector?

Hypnoanalysis answered 19/10, 2012 at 8:38 Comment(3)

"Could not find function 'nbind'". Always cut n paste your code. – Ranie 19/10, 2012 at 8:49

Fixed. (And adding some more text so I can press "Add Comment") – Hypnoanalysis 19/10, 2012 at 9:0

Protip: never test with a square matrix (too easy to confuse rows with columns). Not saying you have, but you will.... – Ranie 19/10, 2012 at 11:21

If you add an extra row:

> A
     a b c  
[1,] 1 1 4 4
[2,] 2 2 5 2
[3,] 3 3 6 1

Then you can see that this function is correct:

> hasCol=function(A,a){colSums(a==A)==nrow(A)}
> A[,hasCol(A,a)]
     a b
[1,] 1 1
[2,] 2 2
[3,] 3 3

But the earlier version accepted doesn't:

> oopsCol=function(A,a){colSums(a==A)>0}
> A[,oopsCol(A,a)]
     a b  
[1,] 1 1 4
[2,] 2 2 2
[3,] 3 3 1

It returns the 4,2,1 column because the 2 matches the 2 in 1,2,3.

Ranie answered 19/10, 2012 at 11:24 Comment(3)

Why not use all? I don't understand the point of converting a logical to a numeric in this situation. – Heterotypic 19/10, 2012 at 12:43

I wanted to show and correct the bug in the accepted answer. Better answers are available! – Ranie 19/10, 2012 at 12:48

@Heterotypic Because there is no colAlls in R. I'll update my answer with a benchmark. – Rebane 19/10, 2012 at 14:0

Use identical. That is R's "scalar" comparison operator; it returns a single logical value, not a vector.

apply(A, 2, identical, a)
#    a     b     c 
# TRUE  TRUE FALSE

If A is a data frame in your real case, you're better off using sapply or vapply because apply coerces it's input to a matrix.

d <- c("a", "b", "c")
B <- data.frame(a, b, c, d)

apply(B, 2, identical, a) # incorrect!
#     a     b     c     d 
# FALSE FALSE FALSE FALSE 

sapply(B, identical, a) # correct
#    a     b     c     d 
# TRUE  TRUE FALSE FALSE

But note that data.frame coerces character inputs to factors unless you ask otherwise:

sapply(B, identical, d) # incorrect
#     a     b     c     d 
# FALSE FALSE FALSE FALSE 

C <- data.frame(a, b, c, d, stringsAsFactors = FALSE)
sapply(C, identical, d) # correct
#     a     b     c     d 
# FALSE FALSE FALSE  TRUE

Identical is also considerably faster than using all + ==:

library(microbenchmark)

a <- 1:1000
b <- c(1:999, 1001)

microbenchmark(
  all(a == b), 
  identical(a, b))
# Unit: microseconds
#              expr   min    lq median     uq    max
# 1     all(a == b) 8.053 8.149 8.2195 8.3295 17.355
# 2 identical(a, b) 1.082 1.182 1.2675 1.3435  3.635

Heterotypic answered 19/10, 2012 at 12:42 Comment(4)

@Ranie identical will be even faster in situations where == would coerce or recycle. – Heterotypic 19/10, 2012 at 13:42

Note that this will fail (and silently) if the vector you test against ('a') has any other attributes, such as names. c(1,2,3) is not identical to c(n=1,m=2,x=3) even though the values are the same. I'm wary of using identical. – Ranie 19/10, 2012 at 15:50

@Ranie Well personally, I'd consider those vectors different, but ymmv. As in any problem, you need to think about definition of equality you need. Using == has it's own risks: all(c(1,2) == c(1,2,1,2)). – Heterotypic 19/10, 2012 at 16:57

When do we need to use all.equal() ? – Profane 8/7, 2016 at 9:39

If you add an extra row:

> A
     a b c  
[1,] 1 1 4 4
[2,] 2 2 5 2
[3,] 3 3 6 1

Then you can see that this function is correct:

> hasCol=function(A,a){colSums(a==A)==nrow(A)}
> A[,hasCol(A,a)]
     a b
[1,] 1 1
[2,] 2 2
[3,] 3 3

But the earlier version accepted doesn't:

> oopsCol=function(A,a){colSums(a==A)>0}
> A[,oopsCol(A,a)]
     a b  
[1,] 1 1 4
[2,] 2 2 2
[3,] 3 3 1

It returns the 4,2,1 column because the 2 matches the 2 in 1,2,3.

Ranie answered 19/10, 2012 at 11:24 Comment(3)

Why not use all? I don't understand the point of converting a logical to a numeric in this situation. – Heterotypic 19/10, 2012 at 12:43

I wanted to show and correct the bug in the accepted answer. Better answers are available! – Ranie 19/10, 2012 at 12:48

@Heterotypic Because there is no colAlls in R. I'll update my answer with a benchmark. – Rebane 19/10, 2012 at 14:0

Surely there's a better solution but the following works:

> a <- c(1,2,3)
> b <- c(1,2,3)
> c <- c(4,5,6)
> A <- cbind(a,b,c)
> sapply(1:ncol(A), function(i) all(a==A[,i]))
[1]  TRUE  TRUE FALSE

And to get the indices:

> which(sapply(1:ncol(A), function(i) all(a==A[,i])))
[1] 1 2

Mcmasters answered 19/10, 2012 at 8:50 Comment(1)

I did something similar: which(apply(A==a,2,all)) – Leitmotif 19/10, 2012 at 15:43

-1

colSums(a==A)==nrow(A)

Recycling of == makes a effectively a matrix which has all columns equal to a and dimensions equal to those of A. colSums sums each column; while TRUE behaves like 1 and FALSE like 0, columns equal to a will have sum equal to the number of rows. We use this observation to finally reduce the answer to a logical vector.

EDIT:

library(microbenchmark)

A<-rep(1:14,1000);c(7,2000)->dim(A)
1:7->a

microbenchmark(
 apply(A,2,function(b) identical(a,b)),
 apply(A,2,function(b) all(a==b)),
 colSums(A==a)==nrow(A))

# Unit: microseconds
#                                     expr      min        lq    median
# 1     apply(A, 2, function(b) all(a == b)) 9446.210 9825.6465 10278.335
# 2 apply(A, 2, function(b) identical(a, b)) 9324.203 9915.7935 10314.833
# 3               colSums(A == a) == nrow(A)  120.252  121.5885   140.185
#         uq       max
# 1 10648.7820 30588.765
# 2 10868.5970 13905.095
# 3   141.7035   162.858

Rebane answered 19/10, 2012 at 9:33 Comment(1)

Wrong! The >0 will be TRUE if any of the elements of 'a' match the element of the column of A. You need to check if all the values are TRUE, hence: colSums(a==A)==nrow(A) – Ranie 19/10, 2012 at 11:19

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags