Filter all rows with word next to a specified word in R

Asked 13/2, 2020 at 12:48 Answered 13/2, 2020 at 13:43

I have a column with string content

temp <- c(NA, NA, "grocery pantry all offers", NA, "grocery offers today low price", 
"grocery offers today low price", "tide soap", "tide soap bar", 
"tide detergent powders 2kg", NA, "tide", "tide detergent powders 2kg", 
"liquid detergent tide brand")

My intention is to create a bigram with words which are next to Tide. For this I would need to filter out words which are next to tide. Either left or right side. For ex in above output would be

tide soap
tide soap
tide detergent
tide detergent
detergent tide
tide brand

Any help ?

Timikatiming answered 13/2, 2020 at 12:48 Comment(0)

If you use the quanteda package, this is straightforward. You specify which word you want to target and decide how many words on left/right side of the target you want.

library(quanteda)

kwic(x = temp, pattern = "tide", window = 1) %>% 
as.data.frame

  docname from to       pre keyword      post pattern
1   text7    1  1              tide      soap    tide
2   text8    1  1              tide      soap    tide
3   text9    1  1              tide detergent    tide
4  text11    1  1              tide              tide
5  text12    1  1              tide detergent    tide
6  text13    3  3 detergent    tide     brand    tide

Coomer answered 13/2, 2020 at 13:5 Comment(1)

@akrun No worries. I think there are similar cases on SO and you would come across kwic(). Quanteda is very useful if you dig text. – Coomer 14/2, 2020 at 1:50

Is this what you want?

library(stringr)

str_extract(temp, "(tide [:alnum:]*)|([:alnum:]* tide)")

It basically says extract the strings that are either "tide" followed by a whitespace and then a combination of letters and numbers ([:alnum:]) of any length (*) or (|) the other way around ([:alnum:]* tide).

Btw: if you want to, afterwards you can remove the NAs with

x <- str_extract(temp, "(tide [:alnum:]*)|([:alnum:]* tide)")
x[!is.na(x)]

Rima answered 13/2, 2020 at 12:56 Comment(1)

Thanks could you please edit it so that it includes both words from last word it currently gives only tide detergent, can it also give tide brand? Also, is it possible using pairwise_count? I was trying to do it using pairwise_count but unfortunately its not possible without a group as I found – Timikatiming 13/2, 2020 at 13:6

You can use the tidytext package to split the text into bigrams and filter for tide.

library(tidytext)
library(dplyr)
library(tibble)

temp %>% 
  enframe(name = "id") %>%
  filter(str_detect(value, "tide")) %>%
  unnest_tokens(bigrams, value, token = "ngrams", n = 2) %>%
  filter(str_detect(bigrams, "tide"))

# A tibble: 6 x 2
     id bigrams       
  <int> <chr>         
1     5 tide soap     
2     6 tide soap     
3     7 tide detergent
4    10 tide detergent
5    11 detergent tide
6    11 tide brand

Flycatcher answered 13/2, 2020 at 13:26 Comment(0)

This is another option just using tidyverse that grabs anything before and/or after 'tide'.

stringr::str_match_all(temp, "(\\w+)?\\s?tide\\s?(\\w+)?") %>%
   purrr::reduce(rbind) %>%
   as.data.frame %>%
   dplyr::filter_all(dplyr::any_vars(!is.na(.)))

                    V1        V2        V3
1            tide soap      <NA>      soap
2            tide soap      <NA>      soap
3       tide detergent      <NA> detergent
4                 tide      <NA>      <NA>
5       tide detergent      <NA> detergent
6 detergent tide brand detergent     brand

Wattle answered 13/2, 2020 at 13:10 Comment(0)

Here is a base R solution

r <- unlist(Filter(length,
                   t(do.call(cbind,
                             lapply(c("\\w+\\stide","tide\\s\\w+"), 
                                    function(p) regmatches(temp,gregexpr(p,temp)))))))

such that

> r
[1] "tide soap"      "tide soap"      "tide detergent" "tide detergent" "detergent tide" "tide brand"

Parisparish answered 13/2, 2020 at 13:43 Comment(0)

Recommended topics

Hot tags