grep using a character vector with multiple patterns
Asked Answered
E

11

181

I am trying to use grep to test whether a vector of strings are present in an another vector or not, and to output the values that are present (the matching patterns).

I have a data frame like this:

FirstName Letter   
Alex      A1
Alex      A6
Alex      A7
Bob       A1
Chris     A9
Chris     A6

I have a vector of strings patterns to be found in the "Letter" columns, for example: c("A1", "A9", "A6").

I would like to check whether the any of the strings in the pattern vector is present in the "Letter" column. If they are, I would like the output of unique values.

The problem is, I don't know how to use grep with multiple patterns. I tried:

matches <- unique (
    grep("A1| A9 | A6", myfile$Letter, value=TRUE, fixed=TRUE)
)

But it gives me 0 matches which is not true, any suggestions?

Envision answered 29/9, 2011 at 12:48 Comment(2)
You can't use fixed=TRUE cause you pattern is true regular expression.Navigate
Using match or %in% or even == is the only correct way to compare exact matches. regex is very dangerous for such a task and can lead to unexpected results.Chervil
L
339

In addition to @Marek's comment about not including fixed==TRUE, you also need to not have the spaces in your regular expression. It should be "A1|A9|A6".

You also mention that there are lots of patterns. Assuming that they are in a vector

toMatch <- c("A1", "A9", "A6")

Then you can create your regular expression directly using paste and collapse = "|".

matches <- unique (grep(paste(toMatch,collapse="|"), 
                        myfile$Letter, value=TRUE))
Longlongan answered 5/10, 2011 at 16:35 Comment(6)
Any way to do this when your list of strings includes regex operators as punctuation?Chiefly
@user1987097 It should work the same way, with or without any other regex operators. Did you have a specific example this didn't work for?Longlongan
@user1987097 use 2 backslahes before a dot or bracket. First backslash is an escape character to interpret the second one needed to disable the operator.Ramses
Using regex for exact matches seem dangerous to me and can have unexpected results. Why not just toMatch %in% myfile$Letter ?Chervil
@user4050 No specific reason. The version in the question had it and I probably just carried it through without thinking about whether it was necessary.Longlongan
method also works for matching multiple patterns not in a dataframe, but within a character vector.Mammillary
S
47

Good answers, however don't forget about filter() from dplyr:

patterns <- c("A1", "A9", "A6")
>your_df
  FirstName Letter
1      Alex     A1
2      Alex     A6
3      Alex     A7
4       Bob     A1
5     Chris     A9
6     Chris     A6

result <- filter(your_df, grepl(paste(patterns, collapse="|"), Letter))

>result
  FirstName Letter
1      Alex     A1
2      Alex     A6
3       Bob     A1
4     Chris     A9
5     Chris     A6
Sirloin answered 12/5, 2017 at 8:42 Comment(3)
I think that grepl works with one pattern at the time (we need vector with length 1), we have 3 patterns (vector of length 3), so we can combine them with one using some friendly for grepl separator - |, try your luck with other :)Sirloin
oh I get it now. So its a compress way to output something like A1 | A2 so if one wanted all conditions then the collapse would be with an & sign, cool thanks.Selby
Hi, using )|( to separate patterns might make this more robust: paste0("(", paste(patterns, collapse=")|("),")"). Unfortunately it becomes also slightly less elegent. This results in pattern (A1)|(A9)|(A6).Missive
S
41

This should work:

grep(pattern = 'A1|A9|A6', x = myfile$Letter)

Or even more simply:

library(data.table)
myfile$Letter %like% 'A1|A9|A6'
Somatic answered 1/11, 2018 at 15:15 Comment(2)
%like% isn't in base R, so you should mention what package(s) are needed to use it.Compassion
For others looking at this answer, %like% is part of the data.table package. Also similar in data.table are like(...), %ilike%, and %flike%.Scarron
C
10

Based on Brian Digg's post, here are two helpful functions for filtering lists:

#Returns all items in a list that are not contained in toMatch
#toMatch can be a single item or a list of items
exclude <- function (theList, toMatch){
  return(setdiff(theList,include(theList,toMatch)))
}

#Returns all items in a list that ARE contained in toMatch
#toMatch can be a single item or a list of items
include <- function (theList, toMatch){
  matches <- unique (grep(paste(toMatch,collapse="|"), 
                          theList, value=TRUE))
  return(matches)
}
Coppice answered 22/8, 2015 at 20:15 Comment(0)
G
6

Have you tried the match() or charmatch() functions?

Example use:

match(c("A1", "A9", "A6"), myfile$Letter)
Gil answered 25/7, 2014 at 13:16 Comment(1)
One thing to note with match is that it is not using patterns, it is expecting an exact match.Scarron
S
5

To add to Brian Diggs answer.

another way using grepl will return a data frame containing all your values.

toMatch <- myfile$Letter

matches <- myfile[grepl(paste(toMatch, collapse="|"), myfile$Letter), ]

matches

Letter Firstname
1     A1      Alex 
2     A6      Alex 
4     A1       Bob 
5     A9     Chris 
6     A6     Chris

Maybe a bit cleaner... maybe?

Swarth answered 23/1, 2017 at 0:14 Comment(0)
B
4

Not sure whether this answer has already appeared...

For the particular pattern in the question, you can just do it with a single grep() call,

grep("A[169]", myfile$Letter)
Bunion answered 19/4, 2017 at 16:0 Comment(0)
C
2

Using the sapply

 patterns <- c("A1", "A9", "A6")
         df <- data.frame(name=c("A","Ale","Al","lex","x"),Letters=c("A1","A2","A9","A1","A9"))



   name Letters
1    A      A1
2  Ale      A2
3   Al      A9
4  lex      A1
5    x      A9


 df[unlist(sapply(patterns, grep, df$Letters, USE.NAMES = F)), ]
  name Letters
1    A      A1
4  lex      A1
3   Al      A9
5    x      A9
Curator answered 9/2, 2018 at 7:56 Comment(0)
M
2

Take away the spaces. So do:

matches <- unique(grep("A1|A9|A6", myfile$Letter, value=TRUE, fixed=TRUE))
Mesic answered 4/5, 2018 at 22:26 Comment(0)
B
0

Another option would be using the syntax like '\\b(A1|A9|A6)\\b' as the pattern. This is for regular expressions word boundary which comes in hand for example if Bob had the letters for example "A7,A1", when using that syntax, you can still extract the row. Here is a reproducible example for both options:

df <- read.table(text="FirstName Letter   
Alex      A1
Alex      A6
Alex      A7
Bob       A1
Chris     A9
Chris     A6", header = TRUE)
df
#>   FirstName Letter
#> 1      Alex     A1
#> 2      Alex     A6
#> 3      Alex     A7
#> 4       Bob     A1
#> 5     Chris     A9
#> 6     Chris     A6
with(df, df[grep('\\b(A1|A9|A6)\\b', Letter),])
#>   FirstName Letter
#> 1      Alex     A1
#> 2      Alex     A6
#> 4       Bob     A1
#> 5     Chris     A9
#> 6     Chris     A6

df2 <- read.table(text="FirstName Letter   
Alex      A1
Alex      A6
Alex      A7,A1
Bob       A1
Chris     A9
Chris     A6", header = TRUE)
df2
#>   FirstName Letter
#> 1      Alex     A1
#> 2      Alex     A6
#> 3      Alex  A7,A1
#> 4       Bob     A1
#> 5     Chris     A9
#> 6     Chris     A6
with(df2, df2[grep('A1|A9|A6', Letter),])
#>   FirstName Letter
#> 1      Alex     A1
#> 2      Alex     A6
#> 3      Alex  A7,A1
#> 4       Bob     A1
#> 5     Chris     A9
#> 6     Chris     A6

Created on 2022-07-16 by the reprex package (v2.0.1)

Please note: if you are using R v4.1+, you can use \\b, otherwise use \b.

Birdhouse answered 16/7, 2022 at 18:1 Comment(0)
H
-1

I suggest writing a little script and doing multiple searches with Grep. I've never found a way to search for multiple patterns, and believe me, I've looked!

Like so, your shell file, with an embedded string:

 #!/bin/bash 
 grep *A6* "Alex A1 Alex A6 Alex A7 Bob A1 Chris A9 Chris A6";
 grep *A7* "Alex A1 Alex A6 Alex A7 Bob A1 Chris A9 Chris A6";
 grep *A8* "Alex A1 Alex A6 Alex A7 Bob A1 Chris A9 Chris A6";

Then run by typing myshell.sh.

If you want to be able to pass in the string on the command line, do it like this, with a shell argument--this is bash notation btw:

 #!/bin/bash 
 $stingtomatch = "${1}";
 grep *A6* "${stingtomatch}";
 grep *A7* "${stingtomatch}";
 grep *A8* "${stingtomatch}";

And so forth.

If there are a lot of patterns to match, you can put it in a for loop.

Helman answered 29/9, 2011 at 13:0 Comment(2)
Thank you ChrisBean. The patterns are lots actually, and maybe it would be better to use a file then. I am new to BASH, but maybe something like this should work… #!/bin/bash for i in 'pattern.txt' do echo $i j='grep -c "${i}" myfile.txt' echo $j if [$j -eq o ] then echo $i >> matches.txt fi doneEnvision
doesn't work…the error message is '[grep: command not found'…I have grep in the /bin folder, and /bin is on my $PATH…Not sure what is happening…Can you please help?Envision

© 2022 - 2024 — McMap. All rights reserved.