R - Is there a way to compare individual items between column strings?
Asked Answered
S

3

1

I have a data set and am trying to compare whether any value in a string appears in any of a set of columns in the same row.

A subset of the data looks like:

df <- data.frame(ID=c(1,2,3,4,5), 
                 A=c("A1","A3","A5","A7","A8"),
                 B=c("B3","B6","B8","B11","B20"),
                 C=c("C2","C1","C2",NA,"C3"),
                 D=c("D7","D11","D1","D5",NA),
                 X=c("A1,B7,B11,C2","A2,A5","A1,B5,B8,D3","","A9,B11,B20,C3"))

I want to see if within a row any element in X (separated by the commas) appears across columns A-D, and output the value which match.

In this example ideally the output would be another column "Z" with values:

Z=c("A1,C2",NA,"B8",NA,"B20,C3")

My first idea as to how to solve this would be to create two vectors for each row, one with the individual items of X and one with the individual items of the other columns and then compare the values in the vectors, but got stuck trying to convert each cell to a vector rather than the column.

I wondered if str_detect might be useful but so far haven't figured out how to compare the individual items not the entire string, and it brings up an error message for missing values (of which there are many)

This is my most recent attempt, but its not comparing the items individually.

mutate(df,unite(df[2:5],Y,sep=",",na.rm=TRUE)) %>%
  select(X,Y) %>%
  mutate(Z=str_detect(Y,X))

I might be going about this completely wrong so any ideas or pointers gratefully received.

Stauder answered 5/1 at 12:28 Comment(0)
C
2

You can build this around intersect() to get items that are present in both input vectors; when you switch to row-wise grouping, you'd get first of the input vectors with c(across(A:D)) and 2nd one with str_split(X, ",", simplify = TRUE), then you can use str_c() or paste0() to get a single string per row:

library(dplyr, warn.conflicts = FALSE)
library(stringr)
df <- data.frame(ID=c(1,2,3,4,5), 
                 A=c("A1","A3","A5","A7","A8"),
                 B=c("B3","B6","B8","B11","B20"),
                 C=c("C2","C1","C2",NA,"C3"),
                 D=c("D7","D11","D1","D5",NA),
                 X=c("A1,B7,B11,C2","A2,A5","A1,B5,B8,D3","","A9,B11,B20,C3"))

df |>
  rowwise() |>
  mutate(Z = intersect(
    c(across(A:D)), 
    str_split(X, ",", simplify = TRUE)
    ) |> str_c(collapse = ",") |> na_if("")) |>
  ungroup()
#> # A tibble: 5 × 7
#>      ID A     B     C     D     X               Z     
#>   <dbl> <chr> <chr> <chr> <chr> <chr>           <chr> 
#> 1     1 A1    B3    C2    D7    "A1,B7,B11,C2"  A1,C2 
#> 2     2 A3    B6    C1    D11   "A2,A5"         <NA>  
#> 3     3 A5    B8    C2    D1    "A1,B5,B8,D3"   B8    
#> 4     4 A7    B11   <NA>  D5    ""              <NA>  
#> 5     5 A8    B20   C3    <NA>  "A9,B11,B20,C3" B20,C3

Created on 2024-01-05 with reprex v2.0.2

Classis answered 5/1 at 12:48 Comment(0)
B
3

I am always interested in the quickest ways to iterate over rows so I ran a few benchmarks of the other two answers and two more approaches. One is exactly the same as the answer by margusl except it uses purrr::pmap() rather than dplyr::rowwise(), which can be a faster alternative.

I also added an approach which converts to matrix, transposes and iterates over columns:

x <- t(as.matrix(df[, c("A", "B", "C", "D")]))
df$Z  <- sapply(seq_len(nrow(df)), \(i){
    overlaps <- intersect(
        x[, i],
        strsplit(df$X[[i]], ",")[[1]]
    )
    if (length(overlaps) == 0) NA_character_ else overlaps
})

This approach turns out to be much faster than the others. I also tried iterating over columns using data.table::transpose() and in this case it is almost as fast as t(), but not faster. So the message is, if you can, iterate over columns instead of rows.

Full benchmark results

I generated random strings in the same format as your data frame for between 10 and 10^5 rows and compared the approaches. purrr::pmap() is comparable in speed to the base R approach. rowwise() is particularly slow with a lot more memory usage (500mb with 1e5 rows, compared with 800kb for Friede's base R approach). With 1e5 rows, the base R approach takes 13 seconds, purrr::pmap() 15 seconds, dplyr::rowwise() 2m40 seconds and iterating over columns takes 4 seconds. The full results are:

# A tibble: 25 × 10
   expression        n        min     median  `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time
   <bch:expr>    <dbl>      <dbl>      <dbl>      <dbl> <bch:byt>    <dbl> <int> <dbl>      <dbl>
 1 base             10      1.05       1.12   860.             0B    8.69     99     1      115. 
 2 rowwise          10     15.3       16.8     57.6       53.25KB    0        29     0      503. 
 3 purrr            10      1.29       1.44   599.        10.57KB    0       100     0      167. 
 4 data.table       10      2.50       2.77   345.       187.04KB    3.48     99     1      287. 
 5 base_columns     10      0.319      0.347 2655.           736B    0       100     0       37.7
 6 base            100     11.4       15.6     63.6          848B    2.19     29     1      456. 
 7 rowwise         100    198.       203.       4.92      510.5KB    2.46      2     1      406. 
 8 purrr           100     11.5       12.9     70.8      105.04KB    0        36     0      509. 
 9 data.table      100      5.48       6.14   151.       207.79KB    5.60     27     1      179. 
10 base_columns    100      2.34       2.50   392.         9.09KB    0       100     0      255. 
11 base           1000    110.       117.       8.38       7.86KB    1.68      5     1      596. 
12 rowwise        1000   1437.      1447.       0.683      4.93MB    1.14      3     5     4393. 
13 purrr          1000    124.       152.       6.80       1.02MB    1.70      4     1      588. 
14 data.table     1000     34.5       37.0     26.6      381.34KB    0        14     0      527. 
15 base_columns   1000     21.8       25.3     37.8       86.41KB    1.99     19     1      503. 
16 base          10000   1322.      1479.       0.633      83.5KB    1.69      3     8     4738. 
17 rowwise       10000  15603.     16366.       0.0597    49.18MB    1.19      3    60    50284. 
18 purrr         10000   1282.      1306.       0.770     10.22MB    1.28      3     5     3894. 
19 data.table    10000    392.       416.       2.37       2.12MB    0.790     3     1     1266. 
20 base_columns  10000    276.       281.       3.55     909.53KB    1.18      3     1      844. 
21 base         100000  11919.     11959.       0.0829   781.78KB    0.608     3    22    36184. 
22 rowwise      100000 145657.    158549.       0.00644  491.72MB    0.403     3   188   465947. 
23 purrr        100000  14887.     16189.       0.0615   102.23MB    0.390     3    19    48778. 
24 data.table   100000   4521.      4526.       0.217      19.3MB    0.434     3     6    13832. 
25 base_columns 100000   3061.      4401.       0.233      8.63MB    0.311     3     4    12870. 

enter image description here

Benchmark code

library(dplyr)
library(stringr)
library(data.table)

results <- bench::press(
    n = 10^(1:5),
    {
        rand_strings <- CJ(letter = LETTERS[1:4], number = 0:9)[
            ,
            strings := paste0(letter, number)
        ][]
        df <- data.frame(
            ID = seq(n),
            A = sample(rand_strings[letter == "A", strings], n, replace = TRUE),
            B = sample(rand_strings[letter == "B", strings], n, replace = TRUE),
            C = sample(rand_strings[letter == "C", strings], n, replace = TRUE),
            D = sample(rand_strings[letter == "D", strings], n, replace = TRUE),
            X = sapply(
                seq(n),
                \(i)
                paste(sample(rand_strings$strings, sample(0:10, 1), replace = TRUE), collapse = ",")
            )
        )

        dt <- as.data.table(df)

        bench::mark(
            max_iterations = 100,
            min_iterations = 3,
            time_unit = "ms",
            check = FALSE,
            base = { # Friede approach
                lapply(seq_len(nrow(df)), \(i) {
                    x <- df[i, LETTERS[1L:4L]]
                    x <- x[x %in% unlist(strsplit(df$X[i], ","))]
                    if (ncol(x)) paste0(x, collapse = ",") else NA
                })
            },
            rowwise = { # margusl approach
                df |>
                    rowwise() |>
                    mutate(Z = intersect(
                        c(across(A:D)),
                        str_split(X, ",", simplify = TRUE)
                    ) |> str_c(collapse = ",") |> na_if("")) |>
                    ungroup()
            },
            purrr = {
                df |>
                    purrr::pmap(\(ID, A, B, C, D, X) {
                        intersect(
                            c(A, B, C, D),
                            str_split(X, ",", simplify = TRUE)
                        ) |>
                            str_c(collapse = ",") |>
                            na_if("")
                    }) |>
                    unlist()
            },
            data.table = {
                x <- transpose(dt[, .(A, B, C, D)])

                dt[, Z := sapply(seq_len(nrow(df)), \(i)
                intersect(
                    x[[i]],
                    strsplit(df$X[[i]], ",")[[1]]
                ) |> paste(collapse = ","))][Z == "", Z := NA_character_]
            }
        )
    }
)

# Plotting the results
library(ggplot2)
autoplot(results) +
    facet_wrap(
        vars(n),
        scales = "free_x",
        labeller = labeller(n = \(x) sprintf("Num rows: %s", x))
    ) +
    theme_bw() +
    theme(legend.position = "bottom")
Bothnia answered 5/1 at 14:22 Comment(5)
You definitely put a lot of afford in. To appreciate it, I would like to add another base R option. Hence, I would like to run your benchmark on my machine, but it's not reproducible though: rand_strings is missing. Can you please add it in?Heifetz
@Heifetz oops thanks! It should work now.Bothnia
Thank you! Fyi, I additionally needed library(ggbeeswarm). Error: ggbeeswarm must be installed to use type = "beeswarm" option.Heifetz
@Heifetz I didn't realise that. autoplot() is always a bit mysterious. Your comment inspired me to try the data.table approach in base R and it's slightly faster - so the speed up appears to be about iterating over columns rather than any data.table magic.Bothnia
Well, that is what I wanted to add. Well done (+1). Side note: I prefer if (length(overlaps)) overlaps else NA_character_Heifetz
C
2

You can build this around intersect() to get items that are present in both input vectors; when you switch to row-wise grouping, you'd get first of the input vectors with c(across(A:D)) and 2nd one with str_split(X, ",", simplify = TRUE), then you can use str_c() or paste0() to get a single string per row:

library(dplyr, warn.conflicts = FALSE)
library(stringr)
df <- data.frame(ID=c(1,2,3,4,5), 
                 A=c("A1","A3","A5","A7","A8"),
                 B=c("B3","B6","B8","B11","B20"),
                 C=c("C2","C1","C2",NA,"C3"),
                 D=c("D7","D11","D1","D5",NA),
                 X=c("A1,B7,B11,C2","A2,A5","A1,B5,B8,D3","","A9,B11,B20,C3"))

df |>
  rowwise() |>
  mutate(Z = intersect(
    c(across(A:D)), 
    str_split(X, ",", simplify = TRUE)
    ) |> str_c(collapse = ",") |> na_if("")) |>
  ungroup()
#> # A tibble: 5 × 7
#>      ID A     B     C     D     X               Z     
#>   <dbl> <chr> <chr> <chr> <chr> <chr>           <chr> 
#> 1     1 A1    B3    C2    D7    "A1,B7,B11,C2"  A1,C2 
#> 2     2 A3    B6    C1    D11   "A2,A5"         <NA>  
#> 3     3 A5    B8    C2    D1    "A1,B5,B8,D3"   B8    
#> 4     4 A7    B11   <NA>  D5    ""              <NA>  
#> 5     5 A8    B20   C3    <NA>  "A9,B11,B20,C3" B20,C3

Created on 2024-01-05 with reprex v2.0.2

Classis answered 5/1 at 12:48 Comment(0)
H
1

In base R, one way to do it would be

# Z = c("A1,C2",NA,"B8",NA,"B20,C3")
df$Z = lapply(seq_len(nrow(df)), \(i) {
  x = df[i, LETTERS[1L:4L]]
  x = x[x %in% unlist(strsplit(df$X[i], ","))]
  if(ncol(x)) paste0(x, collapse = ",") else NA
})
df
#>   ID  A   B    C    D             X      Z
#> 1  1 A1  B3   C2   D7  A1,B7,B11,C2  A1,C2
#> 2  2 A3  B6   C1  D11         A2,A5     NA
#> 3  3 A5  B8   C2   D1   A1,B5,B8,D3     B8
#> 4  4 A7 B11 <NA>   D5                   NA
#> 5  5 A8 B20   C3 <NA> A9,B11,B20,C3 B20,C3

There are certainly more elegant ways.


Data

df <- data.frame(ID=c(1,2,3,4,5), 
                 A=c("A1","A3","A5","A7","A8"),
                 B=c("B3","B6","B8","B11","B20"),
                 C=c("C2","C1","C2",NA,"C3"),
                 D=c("D7","D11","D1","D5",NA),
                 X=c("A1,B7,B11,C2","A2,A5","A1,B5,B8,D3","","A9,B11,B20,C3"))
Heifetz answered 5/1 at 13:11 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.