How to replace duplicate files with hard links using python?

Asked 21/6, 2013 at 23:35 Answered 21/1, 2015 at 14:30

I'm a photographer and doing many backups. Over the years I found myself with a lot of hard drives. Now I bought a NAS and copied all my pictures on one 3TB raid 1 using rsync. According to my script about 1TB of those files are duplicates. That comes from doing multiple backups before deleting files on my laptop and being very messy. I do have a backup of all those files on the old hard drives, but it would be a pain if my script messes things up. Can you please have a look at my duplicate finder script and tell me if you think I can run it or not? I tried it on a test folder and it seems ok, but I don't want to mess things up on the NAS.

The script has three steps in three files. In this First part I find all image and metadata files and put them into a shelve database (datenbank) with their size as key.

import os
import shelve

datenbank = shelve.open(os.path.join(os.path.dirname(__file__),"shelve_step1"), flag='c', protocol=None, writeback=False)

#path_to_search = os.path.join(os.path.dirname(__file__),"test")
path_to_search = "/volume1/backup_2tb_wd/"
file_exts = ["xmp", "jpg", "JPG", "XMP", "cr2", "CR2", "PNG", "png", "tiff", "TIFF"]
walker = os.walk(path_to_search)

counter = 0

for dirpath, dirnames, filenames in walker:
  if filenames:
    for filename in filenames:
      counter += 1
      print str(counter)
      for file_ext in file_exts:
        if file_ext in filename:
          filepath = os.path.join(dirpath, filename)
          filesize = str(os.path.getsize(filepath))
          if not filesize in datenbank:
            datenbank[filesize] = []
          tmp = datenbank[filesize]
          if filepath not in tmp:
            tmp.append(filepath)
            datenbank[filesize] = tmp

datenbank.sync()
print "done"
datenbank.close()

The second part. Now I drop all file sizes which only have one file in their list and create another shelve database with the md5 hash as key and a list of files as value.

import os
import shelve
import hashlib

datenbank = shelve.open(os.path.join(os.path.dirname(__file__),"shelve_step1"), flag='c', protocol=None, writeback=False)

datenbank_step2 = shelve.open(os.path.join(os.path.dirname(__file__),"shelve_step2"), flag='c', protocol=None, writeback=False)

counter = 0
space = 0

def md5Checksum(filePath):
    with open(filePath, 'rb') as fh:
        m = hashlib.md5()
        while True:
            data = fh.read(8192)
            if not data:
                break
            m.update(data)
        return m.hexdigest()


for filesize in datenbank:
  filepaths = datenbank[filesize]
  filepath_count = len(filepaths)
  if filepath_count > 1:
    counter += filepath_count -1
    space += (filepath_count -1) * int(filesize)
    for filepath in filepaths:
      print counter
      checksum = md5Checksum(filepath)
      if checksum not in datenbank_step2:
        datenbank_step2[checksum] = []
      temp = datenbank_step2[checksum]
      if filepath not in temp:
        temp.append(filepath)
        datenbank_step2[checksum] = temp

print counter
print str(space)

datenbank_step2.sync()
datenbank_step2.close()
print "done"

And finally the most dangerous part. For evrey md5 key i retrieve the file list and do an additional sha1. If it matches I delete every file in that list execept the first one and create a hard link to replace the deleted files.

import os
import shelve
import hashlib

datenbank = shelve.open(os.path.join(os.path.dirname(__file__),"shelve_step2"), flag='c', protocol=None, writeback=False)

def sha1Checksum(filePath):
    with open(filePath, 'rb') as fh:
        m = hashlib.sha1()
        while True:
            data = fh.read(8192)
            if not data:
                break
            m.update(data)
        return m.hexdigest()

for hashvalue in datenbank:
  switch = True
  for path in datenbank[hashvalue]:
    if switch:
      original = path
      original_checksum = sha1Checksum(path)
      switch = False
    else:
      if sha1Checksum(path) == original_checksum:
        os.unlink(path)
        os.link(original, path)
        print "delete: ", path
print "done"

What do you think? Thank you very much.

*if that's somehow important: It's a synology 713+ and has an ext3 or ext4 filesystem.

Splanchnology answered 21/6, 2013 at 23:35 Comment(7)

rather than delete immediately move the duplicates to another folder then delete them all when you are satisfied nothing has been lost. – Felicafelicdad 21/6, 2013 at 23:41

Unfortunately the 3TB NAS is full. I only have 20GB left so I have to delete it. Besides, I'm talking about 139.020 duplicated files. There is no way I can control manually that the script didn't mess up. – Splanchnology 21/6, 2013 at 23:50

@JasonTS: Moving files to another directory on the same filesystem won't waste any space, and creating 128K hardlinks will waste a megabyte or so (probably less than your shelve database), so that probably isn't a good reason to reject suspectus's suggestion. – Farmelo 22/6, 2013 at 0:2

Meanwhile, I think this question belongs on Code Review, not Stack Overflow. – Farmelo 22/6, 2013 at 0:3

@abarnert: Ah, sorry I thought of a copy. Well that might be nice. But I need the space soon so I don't really think I have enough time to see if something wrong or not. Thanks for the tip. I posted it in Code Review as well. – Splanchnology 22/6, 2013 at 0:10

Moves are also very fast compared to copies, so… are you sure you don't have time? (By the way, I think I'd actually create a whole parallel tree to move them to, instead of moving them all to one flat directory. First, a directory with 128K files in it could cause problems (for the filesystem, for your shell, for your Python script, etc.). Second, even if you lose all the metadata in the database, a parallel tree will make it trivial to undo.) – Farmelo 22/6, 2013 at 0:13

Well, I have the time to move them to another directory. But if it doesn't fail very very obviously I don't think there is a way to check all the folders manually. And since I really need to flush my laptop I'd have to delete the folder I moved the files to anyway. But aside from that. Do you see any errors in my code? Or do you think this should work? – Splanchnology 22/6, 2013 at 0:30

This looked good, and after sanitizing a bit (to make it work with python 3.4), I ran this on my NAS. While I had hardlinks for files that had not been modified between backups, files that had moved were being duplicated. This recovered that lost disk space for me.

A minor nitpick is that files that are already hardlinks are deleted and relinked. This does not affect the end result anyway.

I did slightly alter the third file ("3.py"):

if sha1Checksum(path) == original_checksum:
     tmp_filename = path + ".deleteme"
     os.rename(path, tmp_filename)
     os.link(original, path)
     os.unlink(tmp_filename)
     print("Deleted {} ".format(path))

This makes sure that in case of a power-failure or some other similar error, no files are lost, though a trailing "deleteme" is left behind. A recovery script should be quite trivial.

Shallot answered 14/8, 2014 at 14:31 Comment(1)

wow nice, that was probably the first bit of code by me running on somebody elses box. hope it helped! – Splanchnology 12/7, 2022 at 17:35

Why not compare the files byte for byte instead of the second checksum? One in a billion two checksums might accidentally match, but direct comparison shouldn't fail. It shouldn't be slower, and might even be faster. Maybe it could be slower when there are more than two files and you have to read the original file for each other. If you really wanted you could get around that by comparing blocks of all the files at once.

EDIT:

I don't think it would require more code, just different. Something like this for the loop body:

data1 = fh1.read(8192)
data2 = fh2.read(8192)
if data1 != data2: return False

Inarch answered 22/6, 2013 at 2:8 Comment(1)

Otherwise it looks okay to me, but I'm not familiar with all the APIs you used. I am making an educated guess as to what they do. – Inarch 22/6, 2013 at 4:37

Note: If you're not wedded to Python, there are exsting tools to do the heavy lifting for you:

https://unix.stackexchange.com/questions/3037/is-there-an-easy-way-to-replace-duplicate-files-with-hardlinks

Encounter answered 21/1, 2015 at 14:30 Comment(0)

How do you create a hard link.

In linux you do

sudo ln sourcefile linkfile

Sometimes this can fail (for me it fails sometimes). Also your python script needs to run in sudo mode.

So I use symbolic links:

ln -s sourcefile linkfile

I can check for them with os.path.islink

You can call the commands like this in Python:

os.system("ln -s sourcefile linkfile")

or like this using subprocess:

import subprocess
subprocess.call(["ln", "-s", sourcefile, linkfile], shell = True)

Have a look at execution from command line and hard vs. soft links

When it works, could you post your whole code? I would like to use it, too.

Belsky answered 22/6, 2013 at 9:18 Comment(2)

Thanks! I decided against soft links, because I don't know where the file actually should be. I'll try to tidy up manually later. But for now I really need space. With a hard link it doesn't matter which 'file' i'll delete. But with soft links I can only delete the 'true files' if I want to keep the data. Also I think some of my photo editing software wouldn't like soft links. But I think you are right, creating a link might fail and I should put in an exception for when it fails. I don't need to use sudo, because I'm running as root. There is nothing except the photos on there. – Splanchnology 22/6, 2013 at 12:54

Soft links are dangerous for this scenario, because deleting one backup would break all other backups that contain a same file. You also do not need to run as root to create hard links. – Shallot 14/8, 2014 at 12:36

Recommended topics

Hot tags