How do keep only unique words within each string in a vector

Asked 19/1, 2015 at 20:55 Answered 17/2, 2023 at 18:37

I have data that looks like this:

vector = c("hello I like to code hello","Coding is fun", "fun fun fun")

I want to remove duplicate words (space delimited) i.e. the output should look like

vector_cleaned

[1] "hello I like to code"
[2] "coding is fun"
[3] "fun"

Bituminous answered 19/1, 2015 at 20:55 Comment(0)

Split it up (strsplit on spaces), use unique (in lapply), and paste it back together:

vapply(lapply(strsplit(vector, " "), unique), paste, character(1L), collapse = " ")
# [1] "hello i like to code" "coding is fun"        "fun"  

## OR
vapply(strsplit(vector, " "), function(x) paste(unique(x), collapse = " "), character(1L))

Update based on comments

You can always write a custom function to use with your vapply function. For instance, here's a function that takes a split string, drops strings that are shorter than a certain number of characters, and has the "unique" setting as a user choice.

myFun <- function(x, minLen = 3, onlyUnique = TRUE) {
  a <- if (isTRUE(onlyUnique)) unique(x) else x
  paste(a[nchar(a) > minLen], collapse = " ")
}

Compare the output of the following to see how it would work.

vapply(strsplit(vector, " "), myFun, character(1L))
vapply(strsplit(vector, " "), myFun, character(1L), onlyUnique = FALSE)
vapply(strsplit(vector, " "), myFun, character(1L), minLen = 0)

Pinfeather answered 19/1, 2015 at 20:57 Comment(3)

can I apply this same technique to remove any words in the split string that have less than 3 characters? – Bituminous 19/1, 2015 at 22:47

@shecode, the approach would be similar, but you would have to add one more requirement based on the result of nchar (which would count the number of characters in the string). On my phone right now, so I can't show the code, but I'll try to update later. Ideally, if I do so, the question should also be updated. – Pinfeather 20/1, 2015 at 2:57

Thankyou. I figured it out how to do it based on the structure of your answer. very useful – Bituminous 20/1, 2015 at 21:15

I spent a while looking for a data frame, tidyverse-friendly version of this, so figured I'd paste my verbose solution:

library(tidyverse)

df <- data.frame(vector = c("hello I like to code hello",
                            "Coding is fun", 
                            "fun fun fun"))

df %>% 
  mutate(split = str_split(vector, " ")) %>% # split
  mutate(split = map(.$split, ~ unique(.x))) %>% # drop duplicates
  mutate(split = map_chr(.$split, ~paste(.x, collapse = " "))) # recombine

Result:

#>                       vector                split
#> 1 hello I like to code hello hello I like to code
#> 2              Coding is fun        Coding is fun
#> 3                fun fun fun                  fun

^{Created on 2021-01-03 by the reprex package (v0.3.0)}

Jaques answered 3/1, 2021 at 14:49 Comment(0)

Using tidyverse

library(dplyr)
library(stringr)
library(tidyr)
df %>%
   mutate(rn = row_number()) %>% 
   separate_longer_delim(vector, delim = regex("\\s+")) %>%
   distinct() %>%
   reframe(vector = str_c(vector, collapse = " "), .by = c("rn")) %>% 
  select(-rn)

-output

                vector
1 hello I like to code
2        Coding is fun
3                  fun

Kean answered 17/2, 2023 at 18:37 Comment(0)

Update based on comments

Recommended topics

Hot tags