A way to make md5_file() faster?

Asked 1/5, 2010 at 14:12 Answered 29/8, 2017 at 11:46

I currently use md5_file() to run through about 15 URLs and verify their MD5 hashes. Is there a way that I can make this faster? It takes far too long to run through all of them.

Thirtyone answered 1/5, 2010 at 14:12 Comment(5)

"run through about 15 URLs" means something like md5_file('http://some.url/foo') in a loop with 15 different urls? How large are those "files"? – Christianna 1/5, 2010 at 14:17

Yeah, that's exactly it. I pull them from a MySQL database and then run them in md5_file($result) in a loop. The files are VERY small, and in fact have no display output, no UI, just a blank white page when viewed – Thirtyone 1/5, 2010 at 14:19

The issue is that you're calculating the hashes in sequence rather than in parallel; md5_file is not the bottleneck. Also, surely the hash of an empty file is always going to be the same. – Media 1/5, 2010 at 14:38

The hash will change if the file changes. – Thirtyone 1/5, 2010 at 14:43

md5_file() in itself is slow. it takes 0.4 sec to return the md5 for a file of 70kb. – Awad 21/9, 2010 at 11:49

Probably you're doing it sequentially right now. I.e. fetch data 1, process data1, fetch data 2, process data 2, ... and the bottleneck might be the data transfer.
You could use curl_multi_exec() to parallelize that a bit. Either register a CURLOPT_WRITEFUNCTION and process each chunk of data (tricky since md5() works on exactly one chunk of data).
Or check for curl handles that are already finished and then process the data of that handle.

edit: quick&dirty example using the hash extension (which provides functions for incremental hashes) and a php5.3+ closure:

$urls = array(
  'http://stackoverflow.com/',
  'http://sstatic.net/so/img/logo.png',
  'http://www.gravatar.com/avatar/212151980ba7123c314251b185608b1d?s=128&d=identicon&r=PG',
  'http://de.php.net/images/php.gif'
);

$data = array();
$fnWrite = function($ch, $chunk) use(&$data) {
  foreach( $data as $d ) {
    if ( $ch===$d['curlrc'] ) {
      hash_update($d['hashrc'], $chunk);
    }
  }
};

$mh = curl_multi_init();
foreach($urls as $u) {
  $current = curl_init();
  curl_setopt($current, CURLOPT_URL, $u);
  curl_setopt($current, CURLOPT_RETURNTRANSFER, 0);
  curl_setopt($current, CURLOPT_HEADER, 0);
  curl_setopt($current, CURLOPT_WRITEFUNCTION, $fnWrite);
  curl_multi_add_handle($mh, $current);
  $hash = hash_init('md5');
  $data[] = array('url'=>$u, 'curlrc'=>$current, 'hashrc'=>$hash); 
}

$active = null;
//execute the handles
do {
  $mrc = curl_multi_exec($mh, $active);
} while ($mrc == CURLM_CALL_MULTI_PERFORM);

while ($active && $mrc == CURLM_OK) {
  if (curl_multi_select($mh) != -1) {
    do {
      $mrc = curl_multi_exec($mh, $active);
    } while ($mrc == CURLM_CALL_MULTI_PERFORM);
  }
}

foreach($data as $d) {
  curl_multi_remove_handle($mh, $d['curlrc']);
  echo $d['url'], ': ', hash_final($d['hashrc'], false), "\n";
}
curl_multi_close($mh);

(haven't checked the results though ...it's only a starting point)

Christianna answered 1/5, 2010 at 14:24 Comment(2)

+1. Parallelizing downloads is likely a huge win here. You could actually parallelize the md5 portion as well, either by using the md5sum CLI command (e.g. exec('bash -c "md5sum file1 > file1.md5 &"')), or using something like PHP's pcntl_fork() to fork multiple calls to md5_sum(). These both have their drawbacks, but in the right context, they may be the best thing to do. – Buzz 1/5, 2010 at 15:1

And I must admit that I haven't even tested whether the download really continues while the callback is executed. But since the data portions are supposedly small I hope it doesn't matter (much). – Christianna 1/5, 2010 at 15:26

The md5 algorithm is pretty much as fast as it can get, and fetching urls is pretty much as fast as it can get (slow if the files are huge or you have a slow connection). So no. You can't make it faster.

Weaken answered 1/5, 2010 at 14:14 Comment(0)

Well obviously you can not do anything with md5_file() to make faster, however, you can use some micro-optimizations or code re-factoring to get some speed gain but again you can not speed up the built-in function md5_file().

Snuck answered 1/5, 2010 at 14:14 Comment(2)

...Sure, a few micro-optimizations might shave 2 milliseconds of his runtime. Maybe. Or he could just pull the URLs in parallel and save a few seconds. "Micro-optimizations" are almost never worth the effort. – Buzz 1/5, 2010 at 14:57

@Frank, This was posted prior to the question being edited to actually include the code in question (which, until code was added, basically asked how to speed up md5_file()). – Frasier 1/5, 2010 at 15:16

No. Since this is a built in function there's no way to make it faster.

But if your code is downloading files before MD5ing them, it may be possible to optimize your downloads to be faster. You may also see a small speed increase by setting the size of the file (using ftruncate) before writing it if you know the size ahead of time.

Also, if the files are small enough to hold in memory and you already have them in memory (because they have been downloaded, or are being read for some other purpose) then you could use md5 to operate on it in memory rather than md5_file which requires it be read again from the disk.

Eliezer answered 1/5, 2010 at 14:17 Comment(0)

Presumably you are checking the same URLs over a period of time? Could you check the last modified headers for the URL? If the page being checked has not changed then there would be no need to re-compute the MD5.

You could also request the pages asynchronously so they could be processed in parallel, rather than in serial, which should speed it up.

Morbihan answered 1/5, 2010 at 14:36 Comment(0)

The speed of the MD5 algorithm is linear. The bigger the input, the more time it will take, so if the file is big, there's not much you can do, really.

Now, as VolkerK already suggested, the problem is most likely not the md5 hashing but retrieving and reading the file over the net.

Silvestro answered 1/5, 2010 at 14:38 Comment(0)

I see a very good suggestion of optimizing here. This will work well especially for big files, where md5_file is reading the file and this function is just comparing the second byte of each file.

Awad answered 21/9, 2010 at 11:53 Comment(0)

-1

Explaining what you want to do would help. In case you want to verify a file with their MD5 hashes:

It's not a secure method as it is prone to Collision attack. You should use multiple hashes (maybe by splitting the file) or using other hash methods.

Eclogue answered 29/8, 2017 at 11:46 Comment(0)

Recommended topics

Hot tags