Create new column in dataframe based on partial string matching other column

Asked 2/11, 2013 at 22:2 Answered 13/11, 2021 at 15:2

I have a dataframe with 2 columns GL and GLDESC and want to add a 3rd column called KIND based on some data that is inside of column GLDESC.

The dataframe is as follows:

      GL                             GLDESC
1 515100         Payroll-Indir Salary Labor
2 515900 Payroll-Indir Compensated Absences
3 532300                           Bulk Gas
4 539991                     Area Charge In
5 551000        Repairs & Maint-Spare Parts
6 551100                 Supplies-Operating
7 551300                        Consumables

For each row of the data table:

If GLDESC contains the word Payroll anywhere in the string then I want KIND to be Payroll
If GLDESC contains the word Gas anywhere in the string then I want KIND to be Materials
In all other cases I want KIND to be Other

I looked for similar examples on stackoverflow but could not find any, also looked in R for dummies on switch, grep, apply and regular expressions to try and match only part of the GLDESC column and then fill the KIND column with the kind of account but was unable to make it work.

Cremator answered 2/11, 2013 at 22:2 Comment(0)

Since you have only two conditions, you can use a nested ifelse:

#random data; it wasn't easy to copy-paste yours  
DF <- data.frame(GL = sample(10), GLDESC = paste(sample(letters, 10), 
  c("gas", "payroll12", "GaSer", "asdf", "qweaa", "PayROll-12", 
     "asdfg", "GAS--2", "fghfgh", "qweee"), sample(letters, 10), sep = " "))

DF$KIND <- ifelse(grepl("gas", DF$GLDESC, ignore.case = T), "Materials", 
         ifelse(grepl("payroll", DF$GLDESC, ignore.case = T), "Payroll", "Other"))

DF
#   GL         GLDESC      KIND
#1   8        e gas l Materials
#2   1  c payroll12 y   Payroll
#3  10      m GaSer v Materials
#4   6       t asdf n     Other
#5   2      w qweaa t     Other
#6   4 r PayROll-12 q   Payroll
#7   9      n asdfg a     Other
#8   5     d GAS--2 w Materials
#9   7     s fghfgh e     Other
#10  3      g qweee k     Other

EDIT 10/3/2016 (..after receiving more attention than expected)

A possible solution to deal with more patterns could be to iterate over all patterns and, whenever there is match, progressively reduce the amount of comparisons:

ff = function(x, patterns, replacements = patterns, fill = NA, ...)
{
    stopifnot(length(patterns) == length(replacements))

    ans = rep_len(as.character(fill), length(x))    
    empty = seq_along(x)

    for(i in seq_along(patterns)) {
        greps = grepl(patterns[[i]], x[empty], ...)
        ans[empty[greps]] = replacements[[i]]  
        empty = empty[!greps]
    }

    return(ans)
}

ff(DF$GLDESC, c("gas", "payroll"), c("Materials", "Payroll"), "Other", ignore.case = TRUE)
# [1] "Materials" "Payroll"   "Materials" "Other"     "Other"     "Payroll"   "Other"     "Materials" "Other"     "Other"

ff(c("pat1a pat2", "pat1a pat1b", "pat3", "pat4"), 
   c("pat1a|pat1b", "pat2", "pat3"), 
   c("1", "2", "3"), fill = "empty")
#[1] "1"     "1"     "3"     "empty"

ff(c("pat1a pat2", "pat1a pat1b", "pat3", "pat4"), 
   c("pat2", "pat1a|pat1b", "pat3"), 
   c("2", "1", "3"), fill = "empty")
#[1] "2"     "1"     "3"     "empty"

Tereus answered 2/11, 2013 at 22:52 Comment(2)

Hi! Thanks for sharing this regex, its very useful. If we were to define a second variable other than kind i.e I want another string labor to be replaced with xyz in a new variable new then do we have to run the regex individually for each such new variable. Also you did not define kind in the latest iterative function. Thanks! – Madelynmademoiselle 20/3, 2017 at 18:18

@ManasiShah : You mean something like DF$new = ff(DF$GLDESC, "labor", "xyz", "Other", ignore.case = TRUE)? The function (ff) at the end, is a more general one and its result can be used to assign back to the "data.frame" as a second step. Or am I missing your point? – Tereus 21/3, 2017 at 9:39

I personally like matching by index. You can loop grep over your new labels, in order to get the indices of your partial matches, then use this with a lookup table to simply reassign the values.

If you wanna create new labels, use a named vector.

DF <- data.frame(GL = sample(10), GLDESC = paste(sample(letters, 10),
  c(
    "gas", "payroll12", "GaSer", "asdf", "qweaa", "PayROll-12",
    "asdfg", "GAS--2", "fghfgh", "qweee"
  ), sample(letters, 10),
  sep = " "
))


lu <- stack(sapply(c(Material = "gas", Payroll = "payroll"), grep, x = DF$GLDESC, ignore.case = TRUE))

DF$KIND <- DF$GLDESC
DF$KIND[lu$values] <- as.character(lu$ind)
DF$KIND[-lu$values] <- "Other"

DF
#>    GL         GLDESC     KIND
#> 1   6        x gas f Material
#> 2   3  t payroll12 q  Payroll
#> 3   5      a GaSer h Material
#> 4   4       s asdf x    Other
#> 5   1      m qweaa y    Other
#> 6  10 y PayROll-12 r  Payroll
#> 7   7      g asdfg a    Other
#> 8   2     k GAS--2 i Material
#> 9   9     e fghfgh j    Other
#> 10  8      l qweee p    Other

^{Created on 2021-11-13 by the reprex package (v2.0.1)}

Hardened answered 13/11, 2021 at 15:2 Comment(0)

Recommended topics

Hot tags