md5 hash a large file incrementally?
Asked Answered
B

5

6

In the browser, I read in a file using the JS FileReader().readAsBinaryString(). Using the CryptoJS library I can MD5 hash the data.

This works fine but I do not know how to handle large files. E.g. Just reading a 2GiB file crashes the browser window. I can slice blobs from the file data and hash that as I go but wouldn't this prevent anyone else from verifying the same hash without following the same steps as me?

Is there a way to get the md5 hash of a large file in this circumstance? How would you calc the md5 hash of a 1TB file, for example? Do I need to read the file in as a stream?

First time cutting my teeth on this one and I'm not sure how to do it.

This resides in an angular directive, hence the scope.

var reader = new FileReader();
                reader.onload = function (loadEvent) {
                    scope.$apply(function () {
                        scope.files = changeEvent.target.files;
                        scope.fileread = loadEvent.target.result;
                        scope.md5Data = CryptoJS.MD5(scope.fileread).toString();
                    });
                }
                // First ten megs of the file
                reader.readAsBinaryString((changeEvent.target.files[0]).slice(0, 10 * 1024 * 1024));
Blakley answered 22/5, 2015 at 17:19 Comment(2)
This is programming-related and belongs on SO.Bestialize
A good hashing library should have some kind of init/update/finish API, where you can call update for each chunk of the file.Selfdeceit
L
1

I can slice blobs from the file data and hash that as I go but wouldn't this prevent anyone else from verifying the same hash without following the same steps as me?

Yes, therefore this is exactly what the MD5 algorithm provides in its contract:

  1. you have a file
  2. the file is padded by adding a single '1' and mutliple '0', so the file is dividable by 512.
  3. each turn computes the md5 hash of one slice of 512 bytes of the file and combines it with the previous result.

So you will not need to repeat these steps and make sure another user does the same.

Since MD5 is computed in blocks, streaming is possible, as you can read here (although done with the crypt module of nodejs which is a modularized port of googlecode project crypto-js.):

http://www.hacksparrow.com/how-to-generate-md5-sha1-sha512-sha256-checksum-hashes-in-node-js.html

Lithometeor answered 24/5, 2015 at 7:4 Comment(1)
It'd better to use the word "block" instead of "round". The latter has a specific meaning in cryptography, which is unrelated to what you want to convey.Selfdeceit
W
5

Use spark-md5 and Q

Since none of the other answers provided a full snippet, here's how you would calculage the MD5 Hash of a large file

function calculateMD5Hash(file, bufferSize) {
  var def = Q.defer();

  var fileReader = new FileReader();
  var fileSlicer = File.prototype.slice || File.prototype.mozSlice || File.prototype.webkitSlice;
  var hashAlgorithm = new SparkMD5();
  var totalParts = Math.ceil(file.size / bufferSize);
  var currentPart = 0;
  var startTime = new Date().getTime();

  fileReader.onload = function(e) {
    currentPart += 1;

    def.notify({
      currentPart: currentPart,
      totalParts: totalParts
    });

    var buffer = e.target.result;
    hashAlgorithm.appendBinary(buffer);

    if (currentPart < totalParts) {
      processNextPart();
      return;
    }

    def.resolve({
      hashResult: hashAlgorithm.end(),
      duration: new Date().getTime() - startTime
    });
  };

  fileReader.onerror = function(e) {
    def.reject(e);
  };

  function processNextPart() {
    var start = currentPart * bufferSize;
    var end = Math.min(start + bufferSize, file.size);
    fileReader.readAsBinaryString(fileSlicer.call(file, start, end));
  }

  processNextPart();
  return def.promise;
}

function calculate() {

  var input = document.getElementById('file');
  if (!input.files.length) {
    return;
  }

  var file = input.files[0];
  var bufferSize = Math.pow(1024, 2) * 10; // 10MB

  calculateMD5Hash(file, bufferSize).then(
    function(result) {
      // Success
      console.log(result);
    },
    function(err) {
      // There was an error,
    },
    function(progress) {
      // We get notified of the progress as it is executed
      console.log(progress.currentPart, 'of', progress.totalParts, 'Total bytes:', progress.currentPart * bufferSize, 'of', progress.totalParts * bufferSize);
    });
}
<script src="https://cdnjs.cloudflare.com/ajax/libs/q.js/1.4.1/q.js"></script>
<script src="https://cdnjs.cloudflare.com/ajax/libs/spark-md5/2.0.2/spark-md5.min.js"></script>


<div>
  <input type="file" id="file"/>
  <input type="button" onclick="calculate();" value="Calculate" class="btn primary" />
</div>
Wagoner answered 18/7, 2016 at 9:13 Comment(0)
R
2

use SparkMD5 https://github.com/satazor/SparkMD5

var spark = new SparkMD5(); 
spark.append('Hi');
spark.append('there');
var hexHash = spark.end();

and it has a file-slice example

Rightly answered 8/12, 2015 at 10:47 Comment(0)
L
1

I can slice blobs from the file data and hash that as I go but wouldn't this prevent anyone else from verifying the same hash without following the same steps as me?

Yes, therefore this is exactly what the MD5 algorithm provides in its contract:

  1. you have a file
  2. the file is padded by adding a single '1' and mutliple '0', so the file is dividable by 512.
  3. each turn computes the md5 hash of one slice of 512 bytes of the file and combines it with the previous result.

So you will not need to repeat these steps and make sure another user does the same.

Since MD5 is computed in blocks, streaming is possible, as you can read here (although done with the crypt module of nodejs which is a modularized port of googlecode project crypto-js.):

http://www.hacksparrow.com/how-to-generate-md5-sha1-sha512-sha256-checksum-hashes-in-node-js.html

Lithometeor answered 24/5, 2015 at 7:4 Comment(1)
It'd better to use the word "block" instead of "round". The latter has a specific meaning in cryptography, which is unrelated to what you want to convey.Selfdeceit
C
1

You may want to check the paragraph progressive hashing on the CryptoJS site.

The example:

var sha256 = CryptoJS.algo.SHA256.create();
sha256.update("Message Part 1");
sha256.update("Message Part 2");
sha256.update("Message Part 3");
var hash = sha256.finalize();

replace SHA256 with MD5 and presto (rename the variable as well, I'll let you chose a good name).

Corkhill answered 24/5, 2015 at 14:27 Comment(2)
Upon trying MD5, it gives error as "Uncaught (in promise) TypeError: Cannot read property 'create' of undefined". BTW, does SHA256 (or MD5) work with all the browser in incremental way by generating the same checksum. In my testing, they are giving varying results. See How to generate checksum & convert to 64 bit in Javascript for very large files without overflowing RAM?Tweedy
So you haven't been able to find the MD5 class. That's a program configuration issue. And the different result is probably because of binary differences of the input, e.g. after Result#text(). Do please make sure that that is 100 percent identical.Corkhill
S
1

Usage:

const md5 = await incrementalMD5(file)

incrementalMD5 source:

import SparkMD5 from 'spark-md5'

export const incrementalMD5 = file =>
  new Promise((resolve, reject) => {
    const fileReader = new FileReader()
    const spark = new SparkMD5.ArrayBuffer()
    const chunkSize = 2097152 // Read in chunks of 2MB
    const chunks = Math.ceil(file.size / chunkSize)
    let currentChunk = 0

    fileReader.onload = event => {
      spark.append(event.target.result) // Append array buffer
      ++currentChunk
      currentChunk < chunks ? loadNext() : resolve(spark.end()) // Compute hash
    }

    fileReader.onerror = () => reject(fileReader.error)

    const loadNext = () => {
      const start = currentChunk * chunkSize
      const end = start + chunkSize >= file.size ? file.size : start + chunkSize
      fileReader.readAsArrayBuffer(File.prototype.slice.call(file, start, end))
    }

    loadNext()
  })
Sacellum answered 5/1, 2021 at 5:57 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.