How to Vectorize this R code Using Plyr, Apply, or Similar?

K

2

5

I wrote the following R code that identifies duplicate files in a directory. How can one vectorize the for-loop using the plyr package (or similar)? I would like to achieve a more idiomatic R solution than the one I came up with.

library("digest") # to compute the MD5 digest
test_dir = "/Users/user/Dropbox/kaggle/r_projects/test_photo"
filelist <- dir(test_dir, pattern = "JPG|AVI", recursive=TRUE, 
                all.files =TRUE, full.names=TRUE)

fl = list() #create and empty list to hold md5's and filenames

for (itm in filelist) {
  file_digest = digest(itm, file=TRUE, algo="md5")
  fl[[file_digest]]= c(fl[[file_digest]],itm)
}
fl

the output is ( using a small test directory):

> fl
$`5715b719723c5111b3a38a6ff8b7ca56`
[1] "/Users/user/Dropbox/kaggle/r_projects/test_photo/folder_a/IMG_3480 copy.JPG"
[2] "/Users/user/Dropbox/kaggle/r_projects/test_photo/folder_a/IMG_3480.JPG"     

$`24fd4d7d252ca66c8d7a88b539c55112`
[1] "/Users/user/Dropbox/kaggle/r_projects/test_photo/folder_a/IMG_3481 copy.JPG"
[2] "/Users/user/Dropbox/kaggle/r_projects/test_photo/folder_a/IMG_3481.JPG"     
[3] "/Users/user/Dropbox/kaggle/r_projects/test_photo/folder_b/IMG_3481.JPG"     

$`2a1d668c874dc856b9df0fbf3f2e81ec`
[1] "/Users/user/Dropbox/kaggle/r_projects/test_photo/folder_a/IMG_3482 copy.JPG"
[2] "/Users/user/Dropbox/kaggle/r_projects/test_photo/folder_a/IMG_3482.JPG"     
[3] "/Users/user/Dropbox/kaggle/r_projects/test_photo/folder_b/IMG_3482 copy.JPG"
[4] "/Users/user/Dropbox/kaggle/r_projects/test_photo/folder_b/IMG_3482.JPG"

I tried:

h=ldply(filelist, digest, file=TRUE, algo="md5")
h$filenames=filelist

but ended up with a unique row for every key value pair of (MD5, filename). I was not able to get the compact output desired.

(Background: As an exercise, I converted the python code presented by Raymond Hettinger in his PyCon AU 2011 keynote "What Makes Python Awesome". The slides are here: http://slidesha.re/WKkh9M . I was able to cut the LOC in half, but I think I can do better - and learn more - by vectorizing).

Keratogenous answered 27/12, 2012 at 19:50 Comment(2)

Or follow your ldply command with split(h,h$digest) ? – Easterly 27/12, 2012 at 19:59

Arun and Ben - my goal is to have a list whose keys are the md5 hashes and the values are lists of filenames corresponding to each unique key (see sample output). When I run ldply(seq_along(filelist), function(idx) c(digest(filelist[idx], file=TRUE, algo="md5"), filelist[idx])) the results are duplicated md5 keys and the associated filename values. I tried stumbling through melt and cast to no avail. – Keratogenous 27/12, 2012 at 20:25

S

6

Here is a solution in base that is a little more concise:

md5s<-sapply(filelist,digest,file=TRUE,algo="md5")
split(filelist,md5s)

Samothrace answered 27/12, 2012 at 20:55 Comment(0)

L

4

Here's one answer. First get the md5 and file names on to a data.frame with ldply. Then, create the list you desire with dlply.

fl <- ldply(seq_along(filelist), function(idx) 
          c(digest(filelist[idx], file=TRUE, algo="md5"), 
          filelist[idx]))
fl <- dlply(fl, .(V1), function(x) x$V2)

Lid answered 27/12, 2012 at 20:37 Comment(0)

Recommended topics

Hot tags