I have two data frames, df1
with reference data and df2
with new data. For each row in df2
, I need to find the best (and the second best) matching row to df1
in terms of hamming distance.
I used e1071
package to compute hamming distance. Hamming distance between two vectors x
and y
can be computed as for example:
x <- c(356739, 324074, 904133, 1025460, 433677, 110525, 576942, 526518, 299386,
92497, 977385, 27563, 429551, 307757, 267970, 181157, 3796, 679012, 711274,
24197, 610187, 402471, 157122, 866381, 582868, 878)
y <- c(356739, 324042, 904133, 959893, 433677, 110269, 576942, 2230, 267130,
92496, 960747, 28587, 429551, 438825, 267970, 181157, 36564, 677220,
711274, 24485, 610187, 404519, 157122, 866413, 718036, 876)
xm <- sapply(x, intToBits)
ym <- sapply(y, intToBits)
distance <- sum(sapply(1:ncol(xm), function(i) hamming.distance(xm[,i], ym[,i])))
and the resulting distance is 25. Yet I need to do this for all rows of df1
and df2
. A trivial method takes a double loop nest and looks terribly slow.
Any ideas how to do this more efficiently? In the end I need to append to df2
:
- a column with the row id from
df1
that gives the lowest distance; - a column with the lowest distance;
- a column with the row id from
df1
that gives the 2nd lowest distance; - a column with the second lowest distance.
Thanks.
apply
andmatch
– Ryanryann