ire_and_curses's suggestion of using tar c <dir>
has some issues:
- tar processes directory entries in the order which they are stored in the filesystem, and there is no way to change this order. This effectively can yield completely different results if you have the "same" directory on different places, and I know no way to fix this (tar cannot "sort" its input files in a particular order).
- I usually care about whether groupid and ownerid numbers are the same, not necessarily whether the string representation of group/owner are the same. This is in line with what for example
rsync -a --delete
does: it synchronizes virtually everything (minus xattrs and acls), but it will sync owner and group based on their ID, not on string representation. So if you synced to a different system that doesn't necessarily have the same users/groups, you should add the --numeric-owner
flag to tar
- tar will include the filename of the directory you're checking itself, just something to be aware of.
As long as there is no fix for the first problem (or unless you're sure it does not affect you), I would not use this approach.
The proposed find
-based solutions are also no good because they only include files, not directories, which becomes an issue if you the checksumming should keep in mind empty directories.
Finally, most suggested solutions don't sort consistently, because the collation might be different across systems.
This is the solution I came up with:
dir=<mydir>; (find "$dir" -type f -exec md5sum {} +; find "$dir" -type d) | LC_ALL=C sort | md5sum
Notes about this solution:
- The
LC_ALL=C
is to ensure reliable sorting order across systems
- This doesn't differentiate between a directory "named\nwithanewline" and two directories "named" and "withanewline", but the chance of that occurring seems very unlikely. One usually fixes this with a
-print0
flag for find
, but since there's other stuff going on here, I can only see solutions that would make the command more complicated than it's worth.
PS: one of my systems uses a limited busybox find
which does not support -exec
nor -print0
flags, and also it appends '/' to denote directories, while findutils find doesn't seem to, so for this machine I need to run:
dir=<mydir>; (find "$dir" -type f | while read f; do md5sum "$f"; done; find "$dir" -type d | sed 's#/$##') | LC_ALL=C sort | md5sum
Luckily, I have no files/directories with newlines in their names, so this is not an issue on that system.