Compare checksum of files between two servers and report mismatch
Asked Answered
J

4

8

I have to compare checksum of all files in /primary and /secondary folders in machineA with files in this folder /bat/snap/ which is in remote server machineB. The remote server will have lots of files along with the files we have in machineA.

  • If there is any mismatch in checksum then I want to report all those files that have issues in machineA with full path and exit with non zero status code.
  • If everything is matching then exit zero.

I wrote one command (not sure whether there is any better way to write it) that I am running on machineA but its very slow. Is there any way to make it faster?

(cd /primary && find . -type f -exec md5sum {} +; cd /secondary && find . -type f -exec md5sum {} +) | ssh machineB '(cd /bat/snap/ && md5sum -c)'

Also it prints out file name like this ./abc_monthly_1536_proc_7.data: OK. Is there any way by which it can print out full path name of that file on machineA?

ssh to remote host for every file definitely isn't very efficient. parallel could speed it up by doing it concurrently for more files, but the more efficient way is likely to tweak the command a bit so it does ssh to machineB and gets all the md5sum in one shot. Is this possible to do?

Jesseniajessey answered 27/4, 2018 at 22:4 Comment(7)
To output the absolute paths just give it the present working directory: find $(pwd) -type f...Screeching
where I should add this in my command?Jesseniajessey
Instead of cd /primary && find . ... just use find /full/path/primary., find does not care what is your current directory as long a you pass absolute paths.Waites
I see what you mean. Got it now. Also how can I make this command fast? Is there any way to do that? Or any better way to write it?Jesseniajessey
If /primary and /secondary are on different physical disks, you may be able to get a slight speedup by changing the ; before cd /secondary to a &. Otherwise you're already running at very close to max speed AFAICT.Footbridge
How fast do you need this to be? Are you seeing worse performance than shopt -s globstar time md5sum /primary/**/* plus time md5sum /secondary/**/*?Footbridge
Two separate ideas for you (or anybody answering here): First, I'd try using rsync -ncav (which uses MD4 instead of MD5 but more to the point implements most if not all of what's needed here). If that doesn't work, my second try would be to compare file size before calculating MD5 (or perhaps cksum or CRC?); a mismatch fails w/out needing to be checksummed.Service
C
4

If your primary goal is not to count the checksums but list differences, perhaps faster (and easier) way would be to run rsync with --dry-run option. If any files listed, they differs, for example:

MBP:~ jhartman$ rsync -avr --dry-run rsync-test 192.168.1.100:/tmp/; echo $?
building file list ... done
rsync-test/file1.txt

sent 172 bytes  received 26 bytes  396.00 bytes/sec
total size is 90  speedup is 0.45

Of course, because of --dry-run no files changed on the target.

I hope it will help, Jarek

Conformation answered 20/5, 2018 at 18:47 Comment(0)
S
0

If the files are in the directory /primary and /secondary instead of under these directories, lose the find.You may also wish to parallelize the md5-calculation. So that would make it:

#!/bin/bash
cd /primary
md5sum * > /tmp/file-p &
cd /secondary
md5sum * > /tmp/file-s &
wait
cat  /tmp/file-p /tmp/file-s | ssh machineB '(cd /bat/snap/ && md5sum -c)'

With a relatively small set of files:

$ time find . -exec md5sum {} \;
7e74a9f865a91c5b56b5cab9709f1f36  ./file
631f01c98ff2016971fb1ea22be3c2cf  ./hosts
d41d8cd98f00b204e9800998ecf8427e  ./fortune8547
49d05af711e2d473f12375d720fb0a92  ./vboxdrv-Module.symvers
bf4b1d740f7151dea0f42f5e9e2b0c34  ./tmpavG1pB
a9b0d3af1b80a46b92dfe1ce56b2e85c  ./in.clean.4524

real    0m0.046s
user    0m0.035s
sys 0m0.006s
$ time md5sum *
7e74a9f865a91c5b56b5cab9709f1f36  file
d41d8cd98f00b204e9800998ecf8427e  fortune8547
631f01c98ff2016971fb1ea22be3c2cf  hosts
a9b0d3af1b80a46b92dfe1ce56b2e85c  in.clean.4524
bf4b1d740f7151dea0f42f5e9e2b0c34  tmpavG1pB
49d05af711e2d473f12375d720fb0a92  vboxdrv-Module.symvers

real    0m0.005s
user    0m0.003s
sys 0m0.002s

(just to prove that find is not always the quickest).

Shotten answered 30/4, 2018 at 15:57 Comment(1)
Hmm, it does look like find adds a bit. If recursion is needed, we can still take find out of the equation by doing shopt -s globstar and then md5sum /primary/**/*.Footbridge
A
0

Using md5sum you can ask it to check files against an input md5sum file.

man md5sum: the following two options are useful:

  • -c, --check: read MD5 sums from the FILEs and check them
  • --quiet : don't print OK for each successfully verified file

So all we need to do is build such a file and pass it on. The easiest is the following (from machineA) :

$ cd /primary; md5sum * | ssh machineB '(cd /bat/snap; md5sum -c - --quiet 2>/dev/null)`
$ cd /secondary; md5sum * | ssh machineB '(cd /bat/snap; md5sum -c - --quiet 2>/dev/null)`

This will report things as :

file1: FAILED
file2: FAILED open or read

This will give you all the failed files per directory. You can do any post processing later on with any flavour of awk.

Attract answered 3/5, 2018 at 9:32 Comment(0)
D
0

You can try to parallelize the process mentioned in the other answer. change the + to a \;, execute bash with &.

find $(pwd) -type f -exec bash -c "md5sum '{}' &" \; 
Drysalter answered 8/8, 2018 at 18:3 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.