How do keep only unique words within each string in a vector
Asked Answered
B

3

8

I have data that looks like this:

vector = c("hello I like to code hello","Coding is fun", "fun fun fun")

I want to remove duplicate words (space delimited) i.e. the output should look like

vector_cleaned

[1] "hello I like to code"
[2] "coding is fun"
[3] "fun"
Bituminous answered 19/1, 2015 at 20:55 Comment(0)
P
16

Split it up (strsplit on spaces), use unique (in lapply), and paste it back together:

vapply(lapply(strsplit(vector, " "), unique), paste, character(1L), collapse = " ")
# [1] "hello i like to code" "coding is fun"        "fun"  

## OR
vapply(strsplit(vector, " "), function(x) paste(unique(x), collapse = " "), character(1L))

Update based on comments

You can always write a custom function to use with your vapply function. For instance, here's a function that takes a split string, drops strings that are shorter than a certain number of characters, and has the "unique" setting as a user choice.

myFun <- function(x, minLen = 3, onlyUnique = TRUE) {
  a <- if (isTRUE(onlyUnique)) unique(x) else x
  paste(a[nchar(a) > minLen], collapse = " ")
}

Compare the output of the following to see how it would work.

vapply(strsplit(vector, " "), myFun, character(1L))
vapply(strsplit(vector, " "), myFun, character(1L), onlyUnique = FALSE)
vapply(strsplit(vector, " "), myFun, character(1L), minLen = 0)
Pinfeather answered 19/1, 2015 at 20:57 Comment(3)
can I apply this same technique to remove any words in the split string that have less than 3 characters?Bituminous
@shecode, the approach would be similar, but you would have to add one more requirement based on the result of nchar (which would count the number of characters in the string). On my phone right now, so I can't show the code, but I'll try to update later. Ideally, if I do so, the question should also be updated.Pinfeather
Thankyou. I figured it out how to do it based on the structure of your answer. very usefulBituminous
J
2

I spent a while looking for a data frame, tidyverse-friendly version of this, so figured I'd paste my verbose solution:

library(tidyverse)

df <- data.frame(vector = c("hello I like to code hello",
                            "Coding is fun", 
                            "fun fun fun"))

df %>% 
  mutate(split = str_split(vector, " ")) %>% # split
  mutate(split = map(.$split, ~ unique(.x))) %>% # drop duplicates
  mutate(split = map_chr(.$split, ~paste(.x, collapse = " "))) # recombine

Result:

#>                       vector                split
#> 1 hello I like to code hello hello I like to code
#> 2              Coding is fun        Coding is fun
#> 3                fun fun fun                  fun

Created on 2021-01-03 by the reprex package (v0.3.0)

Jaques answered 3/1, 2021 at 14:49 Comment(0)
K
0

Using tidyverse

library(dplyr)
library(stringr)
library(tidyr)
df %>%
   mutate(rn = row_number()) %>% 
   separate_longer_delim(vector, delim = regex("\\s+")) %>%
   distinct() %>%
   reframe(vector = str_c(vector, collapse = " "), .by = c("rn")) %>% 
  select(-rn)

-output

                vector
1 hello I like to code
2        Coding is fun
3                  fun
Kean answered 17/2, 2023 at 18:37 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.