Partial animal string matching in R
Asked Answered
Z

3

16

I have a dataframe,

d<-data.frame(name=c("brown cat", "blue cat", "big lion", "tall tiger",
                     "black panther", "short cat", "red bird",
                     "short bird stuffed", "big eagle", "bad sparrow",
                     "dog fish", "head dog", "brown yorkie",
                     "lab short bulldog"), label=1:14)

I'd like to search the name column and if the words "cat", "lion", "tiger", and "panther" appear, I want to assign the character string feline to a new column and corresponding row species.

If the words "bird", "eagle", and "sparrow" appear, I want to assign the character string avian to a new column and corresponding row species.

If the words "dog", "yorkie", and "bulldog" appear, I want to assign the character string canine to a new column and corresponding row species.

Ideally, I'd store this in a list or something similar that I can keep at the beginning of the script, because as new variants of the species show up in the name category, it would be nice to have easy access to update what qualifies as a feline, avian, and canine.

This question is almost answered here (How to create new column in dataframe based on partial string matching other column in R), but it doesn't address the multiple name twist that is present in this problem.

Zuber answered 8/4, 2014 at 22:18 Comment(0)
B
26

There may be a more elegant solution than this, but you could use grep with | to specify alternative matches.

d[grep("cat|lion|tiger|panther", d$name), "species"] <- "feline"
d[grep("bird|eagle|sparrow", d$name), "species"] <- "avian"
d[grep("dog|yorkie", d$name), "species"] <- "canine"

I've assumed you meant "avian", and left out "bulldog" since it contains "dog".

You might want to add ignore.case = TRUE to the grep.

output:

#                 name label species
#1           brown cat     1  feline
#2            blue cat     2  feline
#3            big lion     3  feline
#4          tall tiger     4  feline
#5       black panther     5  feline
#6           short cat     6  feline
#7            red bird     7   avian
#8  short bird stuffed     8   avian
#9           big eagle     9   avian
#10        bad sparrow    10   avian
#11           dog fish    11  canine
#12           head dog    12  canine
#13       brown yorkie    13  canine
#14  lab short bulldog    14  canine
Bonanno answered 8/4, 2014 at 22:36 Comment(0)
Z
1

An elegant-ish way of doing this (I say elegant-ish because, while it's the most elegant way I know of, it's not great) is something like:

#Define the regexes at the beginning of the code
regexes <- list(c("(cat|lion|tiger|panther)","feline"),
                c("(bird|eagle|sparrow)","avian"),
                c("(dog|yorkie|bulldog)","canine"))

....


#Create a vector, the same length as the df
output_vector <- character(nrow(d))

#For each regex..
for(i in seq_along(regexes)){

    #Grep through d$name, and when you find matches, insert the relevant 'tag' into
    #The output vector
    output_vector[grepl(x = d$name, pattern = regexes[[i]][1])] <- regexes[[i]][2]

} 

#Insert that now-filled output vector into the dataframe
d$species <- output_vector

The advantage of this method are several-fold

  1. You only have to modify the data frame once in the entire process, which increases the speed of the loop (data frames do not have modification-in-place; to modify a data frame 3 times, you're essentially relabelling and recreating it 3 times).
  2. By specifying the length of the vector in advance, since we know what it's going to be, you increase speed even more by ensuring that the output vector never needs more memory allotted after it is created.
  3. Because it's a loop, rather than repeated, manual calls, the addition of more rows and categories to the 'regexes' object will not require further modification of the code. It'll run just as it does now.

The only disadvantage - and this applies to, I think, most solutions you're likely to get, is that if something matches multiple patterns, the last pattern in the list it matches will be its 'species' tag.

Zucchetto answered 8/4, 2014 at 22:47 Comment(1)
good point about whether there could be multiple matches. @Brocolli-Rob: maybe having one TRUE/FALSE column for each species would be a better method if this situation is likely in your dataset.Bonanno
A
1

Another way is to create lookup tables and combine matching by index with grep and match

d<-data.frame(name=c("brown cat", "blue cat", "big lion", "tall tiger",
                     "black panther", "short cat", "red bird",
                     "short bird stuffed", "big eagle", "bad sparrow",
                     "dog fish", "head dog", "brown yorkie",
                     "lab short bulldog"), label=1:14)

avian <- c("bird", "eagle", "sparrow")
canine <- c("dog", "yorkie", "bulldog")
feline <-  c("cat", "lion", "tiger", "panther")

lu <- stack(tibble::lst(avian, canine, feline))
lu2 <- stack(sapply(lu$values, grep, x = d$name, ignore.case = TRUE))
lu2$ind <- as.character(lu$ind[match(as.character(lu2$ind), lu$values)])

d$species <- d$name
d$species[lu2$values] <- as.character(lu2$ind)

d
#>                  name label species
#> 1           brown cat     1  feline
#> 2            blue cat     2  feline
#> 3            big lion     3  feline
#> 4          tall tiger     4  feline
#> 5       black panther     5  feline
#> 6           short cat     6  feline
#> 7            red bird     7   avian
#> 8  short bird stuffed     8   avian
#> 9           big eagle     9   avian
#> 10        bad sparrow    10   avian
#> 11           dog fish    11  canine
#> 12           head dog    12  canine
#> 13       brown yorkie    13  canine
#> 14  lab short bulldog    14  canine

Created on 2021-11-13 by the reprex package (v2.0.1)

Acreinch answered 13/11, 2021 at 16:7 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.