I want to match similar words between columns
Asked Answered
M

4

9
1.0 2.0 3.0
loud complaint problems
pain stress confused
dull pain stress

this is my data set and I would like to reorganize the rows so that if there is a word that appears in each column it is transferred to a corresponding row. For example

1.0 2.0 3.0
loud NA NA
pain pain NA
dull NA NA
NA complaint NA
NA stress stress
NA NA confused
NA NA problems

etc., so that every word gets its own row and if the word in that row appears in the column it is matched up with the other columns

I have been looking how to do this but cannot find good code. One idea i was given was create a list of all appearing words and then try to match them to each column but still havent found code for this.

Missioner answered 11/4, 2023 at 14:22 Comment(0)
H
8

We may use match in base R - get the unique elements from the unlisted data as a vector, loop over the columns, get the index of the matching elements, and replace the index with the matched elements and convert to a data.frame after taking care of the lengths

v1 <- unique(unlist(df1))
lst1 <- lapply(df1, \(x) 
   {i1 <- match(x, v1)
   replace(rep(NA, max(i1)), i1, v1[i1])
 })
list2DF(lapply(lst1, `length<-`, max(lengths(lst1))))
 1.0       2.0      3.0
1 loud      <NA>     <NA>
2 pain      pain     <NA>
3 dull      <NA>     <NA>
4 <NA> complaint     <NA>
5 <NA>    stress   stress
6 <NA>      <NA> problems
7 <NA>      <NA> confused

data

df1 <- structure(list(`1.0` = c("loud", "pain", "dull"), `2.0` = c("complaint", 
"stress", "pain"), `3.0` = c("problems", "confused", "stress"
)), class = "data.frame", row.names = c(NA, -3L))

Hanleigh answered 11/4, 2023 at 14:46 Comment(1)
The second lapply in list2DF isn't necessary; you can just use the nrow argument: list2DF(lst1, nrow = max(lengths(lst1))Ulises
G
4

Here is a tidyverse version.

suppressMessages(library(tidyverse))
x = tibble(`1.0` = c("loud", "pain", "dull"),
           `2.0` = c("complaint", "stress", "pain"),
           `3.0` = c("problems", "confused", "stress"))

x %>% 
  gather("version", "value") %>% 
  mutate(id = value) %>% 
  spread(version, value) %>% 
  select(-id)
#> # A tibble: 7 x 3
#>   `1.0` `2.0`     `3.0`   
#>   <chr> <chr>     <chr>   
#> 1 <NA>  complaint <NA>    
#> 2 <NA>  <NA>      confused
#> 3 dull  <NA>      <NA>    
#> 4 loud  <NA>      <NA>    
#> 5 pain  pain      <NA>    
#> 6 <NA>  <NA>      problems
#> 7 <NA>  stress    stress

Created on 2023-04-11 by the reprex package (v2.0.0)

If you need to arrange the rows in the order of occurrence you can change the second statement to

mutate(id = fct_inorder(value)) %>% 

NOTE: The functions gather and spread are superseded by pivot_longer and pivot_wider functions. In my opinion, the older are a bit easier to use and are good enough for this case. The new functions are much more powerful.

Grimbly answered 11/4, 2023 at 15:26 Comment(0)
P
4

Fast + efficient data.table solution:

x <- data.table(`1.0` = c("loud", "pain", "dull"),
           `2.0` = c("complaint", "stress", "pain"),
           `3.0` = c("problems", "confused", "stress"))

dcast(unique(melt(x, measure.vars = names(x))), value ~ variable)

       value  1.0       2.0      3.0
1: complaint <NA> complaint     <NA>
2:  confused <NA>      <NA> confused
3:      dull dull      <NA>     <NA>
4:      loud loud      <NA>     <NA>
5:      pain pain      pain     <NA>
6:  problems <NA>      <NA> problems
7:    stress <NA>    stress   stress
Pancreatin answered 11/4, 2023 at 16:47 Comment(0)
S
1

Here is a base R option using stack + reshape

reshape(
  transform(stack(df), v = values),
  direction = "wide",
  idvar = "values",
  timevar = "ind"
)[-1]

which gives

  v.x1      v.x2     v.x3
1 loud      <NA>     <NA>
2 pain      pain     <NA>
3 dull      <NA>     <NA>
4 <NA> complaint     <NA>
5 <NA>    stress   stress
7 <NA>      <NA> problems
8 <NA>      <NA> confused

Data

> dput(df)
structure(list(x1 = c("loud", "pain", "dull"), x2 = c("complaint", 
"stress", "pain"), x3 = c("problems", "confused", "stress")), class = "data.frame", row.names = c(NA,
-3L))
Sapheaded answered 11/4, 2023 at 19:19 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.