How to assign identical unique IDs to matching observations between two dataframes in r?

Asked 13/5, 2015 at 21:33 Answered 13/5, 2015 at 22:24

I have a practical question when I have two (or more) data frames and want to assign unique IDs for each matching observation within each and across both datasets e.g.:

#1. Create dataframe df1:

a1 <- c(1, 1, 1, 1, 2, 2, 2, 2, 1, 1)
b1 <- c(1, 5, 3, 2, 3, 4, 5, 1, 5, 2)
c1 <- c("white", "red", "black", "white", "red", 
        "white", "black", "silver", "red", "green")
df1 <- data.frame(a1, b1, c1)
df1

   a1 b1     c1
1   1  1  white
2   1  5    red
3   1  3  black
4   1  2  white
5   2  3    red
6   2  4  white
7   2  5  black
8   2  1 silver
9   1  5    red
10  1  2  green

#2. Create dataframe df2:

a2 <- c(2, 2, 1, 1, 2, 2, 2, 2, 2, 2)
b2 <- c(3, 1, 3, 2, 1, 3, 4, 5, 3, 5)
c2 <- c("black", "blue", "black", "white", "silver", 
        "green", "green", "red", "blue", "white")
df2 <- data.frame(a2, b2, c2)
df2

   a2 b2     c2
1   2  3  black
2   2  1   blue
3   1  3  black
4   1  2  white
5   2  1 silver
6   2  3  green
7   2  4  green
8   2  5    red
9   2  3   blue
10  2  5  white

#3. Assign unique IDs to each observation in df1:

library(data.table)
df1.2 <- data.table(df1, key="a1,b1,c1") 
df1.2[, id:=.GRP, by=key(df1.2)]
df1.2 <- as.data.frame(df1.2)
df1.2

   a1 b1     c1 id
1   1  1  white  1
2   1  2  green  2
3   1  2  white  3
4   1  3  black  4
5   1  5    red  5
6   1  5    red  5
7   2  1 silver  6
8   2  3    red  7
9   2  4  white  8
10  2  5  black  9

#4. The problematic part!! Assign identical unique IDs to matching observations of df2 as compared to df1.2 
#and assign other unique IDs to all other non-matching obs of df2. 
#Name the resulting dataframe as df2.2 
#My expected result will ideally look as follows:

df2.2

   a2 b2     c2 id
1   2  3  black 10 
2   2  1   blue 11
3   1  3  black  4
4   1  2  white  3
5   2  1 silver  6
6   2  3  green 12
7   2  4  green 13
8   2  5    red 14
9   2  3   blue 15
10  2  5  white 16

Any help on how to get to df2.2 will be very much appreciated. Thanks.

Greco answered 13/5, 2015 at 21:33 Comment(0)

An easy way to approach this is to make a hash:

library(dplyr)
library(digest)

df1 %>%
  rowwise() %>%
  do( data.frame(., id=digest( paste(.$a1,.$b1,.$c1), algo="md5"),
                   stringsAsFactors=FALSE)) %>% ungroup()

df2 %>%
  rowwise() %>%
  do( data.frame(., id=digest( paste(.$a2,.$b2,.$c2), algo="md5"),
               stringsAsFactors=FALSE)) %>% ungroup()

which would produce the following for df1:

   a1 b1     c1                               id
1   1  1  white b86fbb78b27f7db2ee50af2d68cce452
2   1  5    red 68d47f544832989834517630e4a2764c
3   1  3  black 724e37192140cb2009cf3d982f2be1e4
4   1  2  white f731b8b38255b8c312543283f8e1c634
5   2  3    red 2d50b86902056a51faad04d2c566faf2
6   2  4  white 9396667cd51d1e1b61b0b22a7767d3d9
7   2  5  black 9ba1f3e04c61c006d3c5382fcad098e6
8   2  1 silver 38dcd29d200c8b33cd38ac78ef9dd751
9   1  5    red 68d47f544832989834517630e4a2764c
10  1  2  green 7d9b1aadfd79de142b234b83d7867b9b

and the following for df2:

   a2 b2     c2                               id
1   2  3  black d285febc8ab08e99b11609b98f077e66
2   2  1   blue bfa0405276406ac4bc596daf957dfa11
3   1  3  black 724e37192140cb2009cf3d982f2be1e4
4   1  2  white f731b8b38255b8c312543283f8e1c634
5   2  1 silver 38dcd29d200c8b33cd38ac78ef9dd751
6   2  3  green 67eefe9ee2d82486ded30a268289296b
7   2  4  green d773f58cf144eab15ef459e326494a2f
8   2  5    red 0724318a9f59d3960edfe4e90f9c4eff
9   2  3   blue 6883420cc137ba45b773f642176e9ce6
10  2  5  white 5dea9e63b5fbfb31fb81260cb5a5f41c

Marko answered 13/5, 2015 at 22:3 Comment(2)

Thanks. This is actually very smart! – Greco 13/5, 2015 at 22:7

Great that it worked! Please note that it is very important here that paste has a space as a default separator. If one says paste( ...,sep="") then the hash would produce collisions. ("12","3","black") would get the same id as ("1","23","black"). Please be careful with the separator! (Also just added ungroup() at the end to remove rowwise() grouping in the final result for convenience. ) – Marko 13/5, 2015 at 22:19

You can accomplish what you want by writing a function that generates unique IDs and then applying it to the combination of df1 and df2.

# Inspiration: https://mcmap.net/q/366479/-how-to-assign-a-unique-id-number-to-each-group-of-identical-values-in-a-column-duplicate
unique.id <- function(x) as.numeric(factor(x))

(df1.info <- do.call(paste, df1))
#  [1] "1 1 white 1"  "1 5 red 5"    "1 3 black 4"  "1 2 white 3"  "2 3 red 11"  
#  [6] "2 4 white 13" "2 5 black 14" "2 1 silver 7" "1 5 red 5"    "1 2 green 2" 
df2.info <- do.call(paste, df2)
ids <- unique.id(c(df1.info, df2.info))
df1$id <- head(ids, nrow(df1))
df1
#    a1 b1     c1 id
# 1   1  1  white  1
# 2   1  5    red  5
# 3   1  3  black  4
# 4   1  2  white  3
# 5   2  3    red 11
# 6   2  4  white 13
# 7   2  5  black 14
# 8   2  1 silver  7
# 9   1  5    red  5
# 10  1  2  green  2
df2$id <- tail(ids, nrow(df2))
df2
#    a2 b2     c2 id
# 1   2  3  black  8
# 2   2  1   blue  6
# 3   1  3  black  4
# 4   1  2  white  3
# 5   2  1 silver  7
# 6   2  3  green 10
# 7   2  4  green 12
# 8   2  5    red 15
# 9   2  3   blue  9
# 10  2  5  white 16

Stereotropism answered 13/5, 2015 at 21:58 Comment(0)

Assuming that your columns are all the same, your simplest approach might be:

df.all <- rbind(df1, df2)

(You may need to rename the columns to be the same.)

Now do the same trick you already figured out with data table, on the whole dataset. Then split the dataset back up:

df1 <- df.all[1:nrow(df1),]
df2 <- df.all[- (1:nrow(df1)),]

NB: I'm not saying the data table trick is the ideal way to generate numbers for unique combinations! But you have it written out already.

Bowie answered 13/5, 2015 at 22:24 Comment(0)

Recommended topics

Hot tags