Martin's answer is correct, but in my case I wanted to ignore the last modified date of each file in the tar as well, so that even if a file was "modified" but with no actual changes, it still has the same hash.
When creating the tar, I can override values I don't care about so they are always the same.
In this example I show that just using a normal tar.bz2, if I re-create my source file with a new creation timestamp, the hash will change (1 and 2 are the same, after re-creation, 4 will differ). However, if I set the time to Unix Epoch 0 (or any other arbitrary time), my files will all hash the same (3, 5 and 6)
To do this you need to pass a filter
function to tar.add(DIR, filter=tarInfoStripFileAttrs)
that removes the desired fields, as in the example below
import tarfile, time, os
def createTestFile():
with open(DIR + "/someFile.txt", "w") as file:
file.write("test file")
# Takes in a TarInfo and returns the modified TarInfo:
# https://docs.python.org/3/library/tarfile.html#tarinfo-objects
# intented to be passed as a filter to tarfile.add
# https://docs.python.org/3/library/tarfile.html#tarfile.TarFile.add
def tarInfoStripFileAttrs(tarInfo):
# set time to epoch timestamp 0, aka 00:00:00 UTC on 1 January 1970
# note that when extracting this tarfile, this time will be shown as the modified date
tarInfo.mtime = 0
# file permissions, probably don't want to remove this, but for some use cases you could
# tarInfo.mode = 0
# user/group info
tarInfo.uid= 0
tarInfo.uname = ''
tarInfo.gid= 0
tarInfo.gname = ''
# stripping paxheaders may not be required
# see https://mcmap.net/q/619482/-paxheaders-in-tarball
tarInfo.pax_headers = {}
return tarInfo
# COMPRESSION_TYPE = "gz" # does not work even with filter
COMPRESSION_TYPE = "bz2"
DIR = "toTar"
if not os.path.exists(DIR):
os.mkdir(DIR)
createTestFile()
tar1 = tarfile.open("one.tar." + COMPRESSION_TYPE, "w:" + COMPRESSION_TYPE)
tar1.add(DIR)
tar1.close()
tar2 = tarfile.open("two.tar." + COMPRESSION_TYPE, "w:" + COMPRESSION_TYPE)
tar2.add(DIR)
tar2.close()
tar3 = tarfile.open("three.tar." + COMPRESSION_TYPE, "w:" + COMPRESSION_TYPE)
tar3.add(DIR, filter=tarInfoStripFileAttrs)
tar3.close()
# Overwrite the file with the same content, but an updated time
time.sleep(1)
createTestFile()
tar4 = tarfile.open("four.tar." + COMPRESSION_TYPE, "w:" + COMPRESSION_TYPE)
tar4.add(DIR)
tar4.close()
tar5 = tarfile.open("five.tar." + COMPRESSION_TYPE, "w:" + COMPRESSION_TYPE)
tar5.add(DIR, filter=tarInfoStripFileAttrs)
tar5.close()
tar6 = tarfile.open("six.tar." + COMPRESSION_TYPE, "w:" + COMPRESSION_TYPE)
tar6.add(DIR, filter=tarInfoStripFileAttrs)
tar6.close()
$ md5sum one.tar.bz2 two.tar.bz2 three.tar.bz2 four.tar.bz2 five.tar.bz2 six.tar.bz2
0e51c97a8810e45b78baeb1677c3f946 one.tar.bz2 # same as 2
0e51c97a8810e45b78baeb1677c3f946 two.tar.bz2 # same as 1
54a38d35d48d4aa1bd68e12cf7aee511 three.tar.bz2 # same as 5/6
22cf1161897377eefaa5ba89e3fa6acd four.tar.bz2 # would be same as 1/2, but timestamp has changed
54a38d35d48d4aa1bd68e12cf7aee511 five.tar.bz2 # same as 3, even though timestamp has changed
54a38d35d48d4aa1bd68e12cf7aee511 six.tar.bz2 # same as 3, even though timestamp has changed
You may want to tweak which params are modified and how in your filter function based on your use case.
os.system("tar -c ./bin |gzip -n >one.tar.gz")
β Grindstonemtime
argument togzip.GzipFile()
? β Conclusiontarfile
that is much less trivially converted to shell commands...tar
IS NOT PORTABLE. Maybe, like me, the OP is usingtarfile
for portability reasons. And having to manually construct aGzipFile
vs. using:gz
is a pain π. β Eustacia