Compute hash of only the core image data (excluding metadata) for an image
Asked Answered
B

4

28

I'm writing a script to calculate the MD5 sum of an image excluding the EXIF tag.

In order to do this accurately, I need to know where the EXIF tag is located in the file (beginning, middle, end) so that I can exclude it.

How can I determine where in the file the tag is located?

The images that I am scanning are in the format TIFF, JPG, PNG, BMP, DNG, CR2, NEF, and some videos MOV, AVI, and MPG.

Behrens answered 9/4, 2012 at 14:53 Comment(5)
What are you trying to achieve?Fleshings
Trying to efficiently create a hash of an image that does not change when the EXIF data is edited. (ImageMagick has a visual sum function, but this is very slow.)Behrens
Check the specification kodak.com/global/plugins/acrobat/en/service/digCam/…Bova
Note that you probably don't want to just exclude EXIF, but include only "core" parts of the image. Many software packages (such as photo organizers) will add their metadata to the file, if the format supports private data chunks.Schism
possible duplicate of Unique image hash that does not change if EXIF info updatedStaple
S
8

One simple way to do it is to hash the core image data. For PNG, you could do this by counting only the "critical chunks" (i.e. the ones starting with capital letters). JPEG has a similar but simpler file structure.

The visual hash in ImageMagick decompresses the image as it hashes it. In your case, you could hash the compressed image data right away, so (if implemented correctly) a it should be just as quick as hashing the raw file.

This is a small Python script illustrating the idea. It may or may not work for you, but it should at least give an indication to what I mean :)

import struct
import os
import hashlib

def png(fh):
    hash = hashlib.md5()
    assert fh.read(8)[1:4] == "PNG"
    while True:
        try:
            length, = struct.unpack(">i",fh.read(4))
        except struct.error:
            break
        if fh.read(4) == "IDAT":
            hash.update(fh.read(length))
            fh.read(4) # CRC
        else:
            fh.seek(length+4,os.SEEK_CUR)
    print "Hash: %r" % hash.digest()

def jpeg(fh):
    hash = hashlib.md5()
    assert fh.read(2) == "\xff\xd8"
    while True:
        marker,length = struct.unpack(">2H", fh.read(4))
        assert marker & 0xff00 == 0xff00
        if marker == 0xFFDA: # Start of stream
            hash.update(fh.read())
            break
        else:
            fh.seek(length-2, os.SEEK_CUR)
    print "Hash: %r" % hash.digest()


if __name__ == '__main__':
    png(file("sample.png"))
    jpeg(file("sample.jpg"))
Schism answered 9/4, 2012 at 15:1 Comment(4)
Thanks -- how do I count the critical chunks? Any sample code is greatly appreciated.Behrens
Going back to the answer. How can I extend this to work for TIFF, CR2, DNG, MOV, and AVI files? Or more generally, any suggestions on how to find patterns inside the file to see where the critical chunks begin?Behrens
If you are using Python, you should be more or less like me: will trade cpu time for development speed any time. If so, implementing the hash function for a single format and just convert every other format to that one seems reasonable - deal performance problems "a posteriori" (if you ever find any problem at all).Fibrin
@Behrens The file formats you mention (except CR2 and DNG which I have no idea about) are built essentially the same way as PNG in my code example, so you should be able to use the same approach.Schism
W
22

It is much easier to use the Python Imaging Library to extract the picture data (example in iPython):

In [1]: import Image

In [2]: import hashlib

In [3]: im = Image.open('foo.jpg')

In [4]: hashlib.md5(im.tobytes()).hexdigest()
Out[4]: '171e2774b2549bbe0e18ed6dcafd04d5'

This works on any type of image that PIL can handle. The tobytes method returns the a string containing the pixel data.

BTW, the MD5 hash is now seen as pretty weak. Better to use SHA512:

In [6]: hashlib.sha512(im.tobytes()).hexdigest()
Out[6]: '6361f4a2722f221b277f81af508c9c1d0385d293a12958e2c56a57edf03da16f4e5b715582feef3db31200db67146a4b52ec3a8c445decfc2759975a98969c34'

On my machine, calculating the MD5 checksum for a 2500x1600 JPEG takes around 0.07 seconds. Using SHA512, it takes 0,10 seconds. Complete example:

#!/usr/bin/env python3

from PIL import Image
import hashlib
import sys

im = Image.open(sys.argv[1])
print(hashlib.sha512(im.tobytes()).hexdigest(), end="")

For movies, you can extract frames from them with e.g. ffmpeg, and then process them as shown above.

Warr answered 29/8, 2012 at 10:33 Comment(8)
Please note that MD5 is weak with regard to hash collisions. For quickly checking if a file has changed (after which you can do a byte-per-byte check), it is still an excellent and very quick algorithm.Verdellverderer
On my machine, creating the MD5 checksum of a 2560x1600 JPEG file as shown above takes around 0.07 seconds. Using SHA512 takes around 0.10 seconds. Not a huge difference. The SHA256 checksum takes the longest, around 0.14 seconds. From a human perspective, all are fast. I'll stand by my recommendation to use SHA512.Warr
You have convinced me. Thank you for doing the research.Verdellverderer
@gbin: Video is just a sequence of pictures, cleverly encoded. So extract frames (e.g. the key frames) from the video with e.g. ffmpeg, and then process them as pictures as outlined above.Warr
@RolandSmith I would say "it depends", decoding a video is not a deterministic process (it is not universal, even between versions of decoders it won't decode the same, you can even change the settings of the same decoder etc ...), comparing the encoded binary stream proves it is the same stream with a quite high probability that it was encoded once by a person despite the metadata changes in between. My 2 cents from the digital forensics world... So I feel my solution to by far more exact : strip the metadata completely and binary compare the rest. Fast and sure.Heger
@gbin: But wouldn't your argument that video encoding isn't deterministic also necessarily invalidate your suggestion that comparing binary streams works?Warr
@RolandSmith hence my "it depends" : if your goal is to compare a binary "source" of the media somebody published with the republication of this same media with altered metadata OR if your goal is to try to compare if it has been reencoded from the same source. The second case won't work by comparing the decoded stream for sure, you need a way more complex algorithm with image distance hashing etc ... somewhat like the one used in google image.Heger
@RolandSmith a 30% to 100% difference in hashing time really adds up when you have 100k images to hash. MD5 is unsuitable for password hashing, but when hashing images you have to consider the risk of collisions very small. You are not hashing random blobs of data, but image files. What are the odds that another file with the exact same hash also happens to be a valid image file of the same type and without visually obvious artefacts?Microeconomics
S
8

One simple way to do it is to hash the core image data. For PNG, you could do this by counting only the "critical chunks" (i.e. the ones starting with capital letters). JPEG has a similar but simpler file structure.

The visual hash in ImageMagick decompresses the image as it hashes it. In your case, you could hash the compressed image data right away, so (if implemented correctly) a it should be just as quick as hashing the raw file.

This is a small Python script illustrating the idea. It may or may not work for you, but it should at least give an indication to what I mean :)

import struct
import os
import hashlib

def png(fh):
    hash = hashlib.md5()
    assert fh.read(8)[1:4] == "PNG"
    while True:
        try:
            length, = struct.unpack(">i",fh.read(4))
        except struct.error:
            break
        if fh.read(4) == "IDAT":
            hash.update(fh.read(length))
            fh.read(4) # CRC
        else:
            fh.seek(length+4,os.SEEK_CUR)
    print "Hash: %r" % hash.digest()

def jpeg(fh):
    hash = hashlib.md5()
    assert fh.read(2) == "\xff\xd8"
    while True:
        marker,length = struct.unpack(">2H", fh.read(4))
        assert marker & 0xff00 == 0xff00
        if marker == 0xFFDA: # Start of stream
            hash.update(fh.read())
            break
        else:
            fh.seek(length-2, os.SEEK_CUR)
    print "Hash: %r" % hash.digest()


if __name__ == '__main__':
    png(file("sample.png"))
    jpeg(file("sample.jpg"))
Schism answered 9/4, 2012 at 15:1 Comment(4)
Thanks -- how do I count the critical chunks? Any sample code is greatly appreciated.Behrens
Going back to the answer. How can I extend this to work for TIFF, CR2, DNG, MOV, and AVI files? Or more generally, any suggestions on how to find patterns inside the file to see where the critical chunks begin?Behrens
If you are using Python, you should be more or less like me: will trade cpu time for development speed any time. If so, implementing the hash function for a single format and just convert every other format to that one seems reasonable - deal performance problems "a posteriori" (if you ever find any problem at all).Fibrin
@Behrens The file formats you mention (except CR2 and DNG which I have no idea about) are built essentially the same way as PNG in my code example, so you should be able to use the same approach.Schism
M
4

You can use stream which is part of the ImageMagick suite:

$ stream -map rgb -storage-type short image.tif - | sha256sum
d39463df1060efd4b5a755b09231dcbc3060e9b10c5ba5760c7dbcd441ddcd64  -

or

$ sha256sum <(stream -map rgb -storage-type short image.tif -)
d39463df1060efd4b5a755b09231dcbc3060e9b10c5ba5760c7dbcd441ddcd64  /dev/fd/63

This example is for a TIFF file which is RGB with 16 bits per sample (i.e. 48 bits per pixel). So I use map to rgb and a short storage-type (you can use char here if the RGB values are 8-bits).

This method reports the same signature hash that the verbose Imagemagick identify command reports:

$ identify -verbose image.tif | grep signature
signature: d39463df1060efd4b5a755b09231dcbc3060e9b10c5ba5760c7dbcd441ddcd64

(for ImageMagick v6.x; the hash reported by identify on version 7 is different to that obtained using stream, but the latter may be reproduced by any tool capable of extracting the raw bitmap data - such as dcraw for some image types.)

Menispermaceous answered 23/2, 2018 at 10:44 Comment(6)
Do you know a way to extract from the image the correct parameters to -map and -storage-type options? To just get a hash given only typeOfHash and the file. Also, as you notice is not clear how to reproduce identify signature, eveny they say will change on v7.Softboiled
The only way I can think of is to interpret the metadata. Unfortunately it isn't consistent across file types. I just tried a TIFF (exiftool -PhotometricInterpretation -BitsPerSample) whereas for a JPEG you'd need to use -ColorMode. Regarding the ImageMagick hash change, that only applies to the value reported by identify, not to the output of stream illustrated by the question. I was able to generate the same hash as stream using dcraw on some linear DNG files that I had, so I was happy enough that the stream output was reproducable using another tool.Menispermaceous
From this possible duplicate: On PHP docs: Imagemagick Signature Generates an SHA-256 message digest for the image pixel stream. Seems "the" answer, but still pending an alternative clear way to get same signature.Softboiled
@PabloBianchi if the library that PHP uses invokes the same code as the identify command then it will be affected by the v7 algorithm change. The output of stream, however, really is the image pixel stream which, prior to v7 identify also used but that isn't the case with v7. I have cross checked the output of stream using another tool (dcraw) so I'm pretty confident that it is the raw stream data. Unfortulately, since v7, the signature calculated by ImageMagick is no longer a SHA-256 message digest for the image pixel stream.Menispermaceous
identify -verbose works as promised (a signature is the same for the same image with different metadata), but it is four times as slow compared to streamVociferance
identify -format '%#' reports just the hash and in my testing is four to nine times as fast as identify -verbose. Cheers :)Runkel
H
1

I would use a metadata stripper to preprocess your hashing :

From ImageMagick package you have ...

mogrify -strip blah.jpg

and if you do

identify -list format 

it apparently works with all the cited formats.

Heger answered 29/8, 2012 at 21:4 Comment(1)
This solution works for all the video & images format which was a requirement for the bounty no ?Heger

© 2022 - 2024 — McMap. All rights reserved.