I want to match similar words between columns

Asked 11/4, 2023 at 14:22 Answered 11/4, 2023 at 19:19

1.0	2.0	3.0
loud	complaint	problems
pain	stress	confused
dull	pain	stress

this is my data set and I would like to reorganize the rows so that if there is a word that appears in each column it is transferred to a corresponding row. For example

1.0	2.0	3.0
loud	NA	NA
pain	pain	NA
dull	NA	NA
NA	complaint	NA
NA	stress	stress
NA	NA	confused
NA	NA	problems

etc., so that every word gets its own row and if the word in that row appears in the column it is matched up with the other columns

I have been looking how to do this but cannot find good code. One idea i was given was create a list of all appearing words and then try to match them to each column but still havent found code for this.

Missioner answered 11/4, 2023 at 14:22 Comment(0)

We may use match in base R - get the unique elements from the unlisted data as a vector, loop over the columns, get the index of the matching elements, and replace the index with the matched elements and convert to a data.frame after taking care of the lengths

v1 <- unique(unlist(df1))
lst1 <- lapply(df1, \(x) 
   {i1 <- match(x, v1)
   replace(rep(NA, max(i1)), i1, v1[i1])
 })
list2DF(lapply(lst1, `length<-`, max(lengths(lst1))))
 1.0       2.0      3.0
1 loud      <NA>     <NA>
2 pain      pain     <NA>
3 dull      <NA>     <NA>
4 <NA> complaint     <NA>
5 <NA>    stress   stress
6 <NA>      <NA> problems
7 <NA>      <NA> confused

data

df1 <- structure(list(`1.0` = c("loud", "pain", "dull"), `2.0` = c("complaint", 
"stress", "pain"), `3.0` = c("problems", "confused", "stress"
)), class = "data.frame", row.names = c(NA, -3L))

Hanleigh answered 11/4, 2023 at 14:46 Comment(1)

The second lapply in list2DF isn't necessary; you can just use the nrow argument: list2DF(lst1, nrow = max(lengths(lst1)) – Ulises 11/4, 2023 at 15:0

Here is a tidyverse version.

suppressMessages(library(tidyverse))
x = tibble(`1.0` = c("loud", "pain", "dull"),
           `2.0` = c("complaint", "stress", "pain"),
           `3.0` = c("problems", "confused", "stress"))

x %>% 
  gather("version", "value") %>% 
  mutate(id = value) %>% 
  spread(version, value) %>% 
  select(-id)
#> # A tibble: 7 x 3
#>   `1.0` `2.0`     `3.0`   
#>   <chr> <chr>     <chr>   
#> 1 <NA>  complaint <NA>    
#> 2 <NA>  <NA>      confused
#> 3 dull  <NA>      <NA>    
#> 4 loud  <NA>      <NA>    
#> 5 pain  pain      <NA>    
#> 6 <NA>  <NA>      problems
#> 7 <NA>  stress    stress

^{Created on 2023-04-11 by the reprex package (v2.0.0)}

If you need to arrange the rows in the order of occurrence you can change the second statement to

mutate(id = fct_inorder(value)) %>%

NOTE: The functions gather and spread are superseded by pivot_longer and pivot_wider functions. In my opinion, the older are a bit easier to use and are good enough for this case. The new functions are much more powerful.

Grimbly answered 11/4, 2023 at 15:26 Comment(0)

Fast + efficient data.table solution:

x <- data.table(`1.0` = c("loud", "pain", "dull"),
           `2.0` = c("complaint", "stress", "pain"),
           `3.0` = c("problems", "confused", "stress"))

dcast(unique(melt(x, measure.vars = names(x))), value ~ variable)

       value  1.0       2.0      3.0
1: complaint <NA> complaint     <NA>
2:  confused <NA>      <NA> confused
3:      dull dull      <NA>     <NA>
4:      loud loud      <NA>     <NA>
5:      pain pain      pain     <NA>
6:  problems <NA>      <NA> problems
7:    stress <NA>    stress   stress

Pancreatin answered 11/4, 2023 at 16:47 Comment(0)

Here is a base R option using stack + reshape

reshape(
  transform(stack(df), v = values),
  direction = "wide",
  idvar = "values",
  timevar = "ind"
)[-1]

which gives

  v.x1      v.x2     v.x3
1 loud      <NA>     <NA>
2 pain      pain     <NA>
3 dull      <NA>     <NA>
4 <NA> complaint     <NA>
5 <NA>    stress   stress
7 <NA>      <NA> problems
8 <NA>      <NA> confused

Data

> dput(df)
structure(list(x1 = c("loud", "pain", "dull"), x2 = c("complaint", 
"stress", "pain"), x3 = c("problems", "confused", "stress")), class = "data.frame", row.names = c(NA,
-3L))

Sapheaded answered 11/4, 2023 at 19:19 Comment(0)

data

Data

Recommended topics

Hot tags