adding hash to each row using dplyr and digest in R
Asked Answered
J

2

5

I need to add a fingerprint to each row in a dataset so to check with a later version of the same set to look for difference.

I know how to add hash for each row in R like below:

data.frame(iris,hash=apply(iris,1,digest))

I am learning to use dplyr as the dataset is getting huge and I need to store them in SQL Server, I tried something like below but the hash is not working, all rows give the same hash:

iris %>%
  rowwise() %>%
  mutate(hash=digest(.))

Any clue for row-wise hashing using dplyr? Thanks!

Jeremie answered 21/9, 2017 at 4:38 Comment(0)
R
6

We could use do

res <- iris %>%
         rowwise() %>% 
         do(data.frame(., hash = digest(.)))
head(res, 3)
# A tibble: 3 x 6
#   Sepal.Length Sepal.Width Petal.Length Petal.Width Species                             hash
#         <dbl>       <dbl>        <dbl>       <dbl>  <fctr>                            <chr>
#1          5.1         3.5          1.4         0.2  setosa e261621c90a9887a85d70aa460127c78
#2          4.9         3.0          1.4         0.2  setosa 7bf67322858048d82e19adb6399ef7a4
#3          4.7         3.2          1.3         0.2  setosa c20f3ee03573aed5929940a29e07a8bb

Note that in the apply procedure, all the columns are converted to a single class as apply converts to matrix and matrix can hold only a single class. There will be a warning about converting the factor to character class

Ruthenic answered 21/9, 2017 at 4:44 Comment(4)
Any idea why the below give different results? key <- c('Sepal.Length', 'Sepal.Width', 'Petal.Length') iris %>% rowwise() %>% do(data.frame(., hash = digest::digest(.data[!!key]))) %>% [(i=1,j='hash') vs digest::digest(as.character(c(iris[1,'Sepal.Length'], iris[1,'Sepal.Width'], iris[1,'Petal.Length'])))Trappist
Ah - digest operates by default on a serialization of the object. Is it possible to modify the do call to add serialize = F to the digest function? I'm not entirely clear what object ultimately gets passed to digest in the rowwise case.Trappist
@Trappist Do you need iris %>% select(key) %>% transmute(hash = pmap(., ~ c(...) %>% as.character %>% digest(., serialize = FALSE))) %>% bind_cols(iris, .)Ruthenic
Or using do iris %>% rowwise() %>% do(data.frame(., hash = .data[!!key] %>% as.character %>% digest(., serialize = FALSE)))Ruthenic
A
1

Since do is being superseded, this option may be better now:

library(digest)
library(tidyverse)

# Create a tibble for practice
df <- tibble(x = rep(c(1,2), each=2), y = c(1,1,3,4), z = c(1,1,6,4))

# Note that row 1 and 2 are equal.
# This will generate a sha1 over specific columns (column z is excluded)
df %>% rowwise() %>% mutate(m = sha1( c(x, y ) ))

# This will generate over all columns,
# then convert the hash to integer
# (better for joining or other data operations later)

df %>% 
   rowwise() %>% 
   mutate(sha =
     digest2int( # generates a new integer hash
       sha1( c_across(everything() ) ) # across all columns
     )
   )

It may be a better option to convert everything to character and paste it together to use just one hash function call. You can use unite:

df %>% rowwise() %>% 
  unite(allCols, everything(), sep = "", remove = FALSE) %>% 
  mutate(hash = digest2int(allCols)) %>%
  select(-allCols)
Anthocyanin answered 12/9, 2021 at 0:10 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.