Algorithm for determining a file's identity
Asked Answered
T

8

5

For an open source project I have I am writing an abstraction layer on top of the filesystem.

This layer allows me to attach metadata and relationships to each file.

I would like the layer to handle file renames gracefully and maintain the metadata if a file is renamed / moved or copied.

To do this I will need a mechanism for calculating the identity of a file. The obvious solution is to calculate an SHA1 hash for each file and then assign metadata against that hash. But ... that is really expensive, especially for movies.

So, I have been thinking of an algorithm that though not 100% correct will be right the vast majority of the time, and is cheap.

One such algorithm could be to use file size and a sample of bytes for that file to calculate the hash.

Which bytes should I choose for the sample? How do I keep the calculation cheap and reasonably accurate? I understand there is a tradeoff here, but performance is critical. And the user will be able to handle situations where the system makes mistakes.

I need this algorithm to work for very large files (1GB+ and tiny files 5K)

EDIT

I need this algorithm to work on NTFS and all SMB shares (linux or windows based), I would like it to support situations where a file is copied from one spot to another (2 physical copies exist are treated as one identity). I may even consider wanting this to work in situations where MP3s are re-tagged (the physical file is changed, so I may have an identity provider per filetype).

EDIT 2

Related question: Algorithm for determining a file’s identity (Optimisation)

Think answered 19/1, 2009 at 0:5 Comment(0)
L
1

How about storing some random integers ri, and looking up bytes (ri mod n) where n is the size of file? For files with headers, you can ignore them first and then do this process on the remaining bytes.

If your files are actually pretty different (not just a difference in a single byte somewhere, but say at least 1% different), then a random selection of bytes would notice that. For example, with a 1% difference in bytes, 100 random bytes would fail to notice with probability 1/e ~ 37%; increasing the number of bytes you look at makes this probability go down exponentially.

The idea behind using random bytes is that they are essentially guaranteed (well, probabilistically speaking) to be as good as any other sequence of bytes, except they aren't susceptible to some of the problems with other sequences (e.g. happening to look at every 256-th byte of a file format where that byte is required to be 0 or something).

Some more advice:

  • Instead of grabbing bytes, grab larger chunks to justify the cost of seeking.
  • I would suggest always looking at the first block or so of the file. From this, you can determine filetype and such. (For example, you could use the file program.)
  • At least weigh the cost/benefit of something like a CRC of the entire file. It's not as expensive as a real cryptographic hash function, but still requires reading the entire file. The upside is it will notice single-byte differences.
Littleton answered 19/1, 2009 at 0:57 Comment(1)
AWesome, I think something along these lines could work with a little changes, the thing is as soon the largest cost is seeking so you probably want to read more bytes than one at each random point since you are already thereThink
C
5

Bucketing, multiple layers of comparison should be fastest and scalable across the range of files you're discussing.

First level of indexing is just the length of the file.

Second level is hash. Below a certain size it is a whole-file hash. Beyond that, yes, I agree with your idea of a sampling algorithm. Issues that I think might affect the sampling speed:

  1. To avoid hitting regularly spaced headers which may be highly similar or identical, you need to step in a non-conforming number, eg: multiples of a prime or successive primes.
  2. Avoid steps which might end up encountering regular record headers, so if you are getting the same value from your sample bytes despite different location, try adjusting the step by another prime.
  3. Cope with anomalous files with large stretches of identical values, either because they are unencoded images or just filled with nulls.
Croupier answered 19/1, 2009 at 16:36 Comment(3)
All good points. I'd also say you'd need to do a lot of empirical analysis to solve problems like those mentioned in 2 and 3.Scuppernong
Err, primes have no non-trivial factors.Morion
badly worded "factors of a prime" indeed (blush) I think I meant stepping by successive primes or multiples of an odd prime. My math is better than that (just).Croupier
M
4

Do the first 128k, another 128k at the 1mb mark, another 128k at the 10mb mark, another 128k at the 100mb mark, another 128k at the 1000mb mark, etc. As the file sizes get larger, and it becomes more likely that you'll be able to distinguish two files based on their size alone, you hash a smaller and smaller fraction of the data. Everything under 128k is taken care of completely.

Maximo answered 19/1, 2009 at 1:11 Comment(2)
+1. Locality of filesystem reads is important for reasonable performance -- reading a few big chunks will be much faster than reading the same number of bytes, scattered throughout the file.Incarnate
yerp from empirical tests, the locality makes a massive difference.Think
S
2

Believe it or not, I use the ticks for the last write time for the file. It is as cheap as it gets and I am still to see a clash between different files.

Stockjobber answered 19/1, 2009 at 0:9 Comment(9)
the problem is this would not support renames ... and it seems mighty fragile.Think
renaming shouldn't change the time the file was created. anyway, I use it to detect if a file was overwritten by another (same name). I also store the file's hash, but the ticks tell me it if it still is the same file.Eckhart
I see .. calculating a full file hash is going to be way to expensive for my purposesThink
plus... you yourself said the words "cheap", "not 100% correct" and "critical performance". If you want to be precise, you have to calculate the hash.Eckhart
give it a try and let me know of the limitations, that will be useful information for me as well.Eckhart
Also, a big issue with the file time is that linux filesystems dont store a created date, you only have modified data. I need my algorithm to work with SMB shares of all types.Think
I thought you were in NTFS, I'm not sure about Linux.Eckhart
:) I keep on adding requirements .... when I write this layer I'll let you know if I find any serious limitations with the created date stuff.Think
+1, interesting idea. Be warned that on Windows, system files sometimes have their various dates set to specific times to indicate the time they were released. E.g. in my C:\WINDOWS directory, there are 21 files all having last-write date "04/08/2004 09:00 a.m.".Incarnate
B
2

If you can drop the Linux share requirement and confine yourself to NTFS, then NTFS Alternate Data Streams will be a perfect solution that:

  • doesn't require any kind of hashing;
  • survives renames; and
  • survives moves (even between different NTFS volumes).

You can read more about it here. Basically you just append a colon and a name for your stream (e.g. ":meta") and write whatever you like to it. So if you have a directory "D:\Movies\Terminator", write your metadata using normal file I/O to "D:\Movies\Terminator:meta". You can do the same if you want to save the metadata for a specific file (as opposed to a whole folder).

If you'd prefer to store your metadata somewhere else and just be able to detect moves/renames on the same NTFS volume, you can use the GetFileInformationByHandle API call (see MSDN /en-us/library/aa364952(VS.85).aspx) to get the unique ID of the folder (combine VolumeSerialNumber and FileIndex members). This ID will not change if the file/folder is moved/renamed on the same volume.

Buzz answered 14/8, 2009 at 22:42 Comment(2)
Sorry about the weird MSDN link - SO wouldn't allow me to post 2 hyperlinks since I'm a new user.Buzz
The documentation for GetFileInformationByHandle says: "nFileIndexLow: Low-order part of a unique identifier that is associated with a file. This value is useful ONLY WHILE THE FILE IS OPEN by at least one process. If no processes have it open, the index may change the next time the file is opened."Elevon
L
1

How about storing some random integers ri, and looking up bytes (ri mod n) where n is the size of file? For files with headers, you can ignore them first and then do this process on the remaining bytes.

If your files are actually pretty different (not just a difference in a single byte somewhere, but say at least 1% different), then a random selection of bytes would notice that. For example, with a 1% difference in bytes, 100 random bytes would fail to notice with probability 1/e ~ 37%; increasing the number of bytes you look at makes this probability go down exponentially.

The idea behind using random bytes is that they are essentially guaranteed (well, probabilistically speaking) to be as good as any other sequence of bytes, except they aren't susceptible to some of the problems with other sequences (e.g. happening to look at every 256-th byte of a file format where that byte is required to be 0 or something).

Some more advice:

  • Instead of grabbing bytes, grab larger chunks to justify the cost of seeking.
  • I would suggest always looking at the first block or so of the file. From this, you can determine filetype and such. (For example, you could use the file program.)
  • At least weigh the cost/benefit of something like a CRC of the entire file. It's not as expensive as a real cryptographic hash function, but still requires reading the entire file. The upside is it will notice single-byte differences.
Littleton answered 19/1, 2009 at 0:57 Comment(1)
AWesome, I think something along these lines could work with a little changes, the thing is as soon the largest cost is seeking so you probably want to read more bytes than one at each random point since you are already thereThink
S
0

Well, first you need to look more deeply into how filesystems work. Which filesystems will you be working with? Most filesystems support things like hard links and soft links and therefore "filename" information is not necessarily stored in the metadata of the file itself.

Actually, this is the whole point of a stackable layered filesystem, that you can extend it in various ways, say to support compression or encryption. This is what "vnodes" are all about. You could actually do this in several ways. Some of this is very dependent on the platform you are looking at. This is much simpler on UNIX/Linux systems that use a VFS concept. You could implement your own layer on tope of ext3 for instance or what have you.

** After reading your edits, a couplre more things. File systems already do this, as mentioned before, using things like inodes. Hashing is probably going to be a bad idea not just because it is expensive but because two or more preimages can share the same image; that is to say that two entirely different files can have the same hashed value. I think what you really want to do is exploit the metadata of that the filesystem already exposes. This would be simpler on an open source system, of course. :)

Scuppernong answered 19/1, 2009 at 0:9 Comment(4)
I'll be working with NTFS .. see my expanded question. Also I would like to support the file living on 2 separate file systems.Think
Re: "Two different files sharing the same hash", that is completely desirable in my situation ... if a movie exists twice in my filesystem, I would like to get the same identity for both files.Think
Well, what I mean is that you could potentially have a movie and a database log file or whatever share the same hash. Unless you are defining identity as "having the same hash value" then I don't think that helps.Scuppernong
Ok, yes I am defining identity as the SHA1 hash of a collection of bytes .... and looking for an algorithm that can sample the bytes (not read all of them) and produce a SHA1 hash that is right the vast majority of the time for (mp3/avi/ogg/doc/xml etc... files)Think
C
0

Which bytes should I choose for the sample?

I think that I would try to use some arithmetic progression like Fibonacci numbers. These are easy to calculate, and they have a diminishing density. Small files would have a higher sample ratio than big files, and the sample would still go over spots in the whole file.

Convex answered 19/1, 2009 at 0:57 Comment(1)
A problem with using something like the Fibonacci numbers is they can perform poorly for some file sizes (for example, the Fibonacci numbers mod 144 or 6765 have small period). Arithmetic progressions (a+n*b) can perform poorly for certain filetypes. But a "sufficiently random" sequence should work.Littleton
E
0

This work sounds like it could be more effectively implemented at the filesystem level or with some loose approximation of a version control system (both?).

To address the original question, you could keep a database of (file size, bytes hashed, hash) for each file and try to minimize the number of bytes hashed for each file size. Whenever you detect a collision you either have an identical file, or you increase the hash length to go just past the first difference.

There's undoubtedly optimizations to be made and CPU vs. I/O tradeoffs as well, but it's a good start for something that won't have false-positives.

Existent answered 19/1, 2009 at 3:49 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.