about MD5 checksum for Http big file downloading

Asked 4/1, 2011 at 10:53 Answered 12/11, 2017 at 1:13

MD5 checksum is widely used for integrity checking for Http downloading big files. My question is, since TCP itself provides reliable mechanism (i.e. checksum for each TCP package to ensure its integrity). So, in short TCP is reliable. Http is based on TCP (so Http should also be reliable), so why we need another mechanism of integrity checking (i.e. MD5 checksum)?

thanks in advance, George

Jess answered 4/1, 2011 at 10:53 Comment(2)

The checksum is just for that packet. It doesn't mean that these small chunks of data that are all checked for integrity will produce a big file that has the same integrity. – Lingcod 4/1, 2011 at 10:59

Hi Thai, I am confused. I think if small package integrity is ok, the whole file (which is consisting of small packages) should also be ok. Any comments? – Jess 5/1, 2011 at 9:24

Most often you use the hash sum for an out of band (printed on the webiste for example) check of the download integrity, not programmatic.

This prevents manipulation of the download artifact.

Andriaandriana answered 4/1, 2011 at 11:17 Comment(10)

"manipulation of the download artifact" -- what do you mean manipulation of the download artifact, could you show me a sample please? – Jess 5/1, 2011 at 9:26

What i mean is simply an attack where the download is replaced - by breaking into the server, making a creative redirection... You download simply another, manipulated file. But if you compare with the checksum provided by some other means, you maybe get aware of the cheat. This other means is often called "out of band" - the attacker has to break two mechanisms. – Andriaandriana 5/1, 2011 at 10:17

If the hacker could manipulate the file on server, why the hacker can not hacking the checksum? – Jess 6/1, 2011 at 4:19

He maybe can, but he has to break into the download area and the content management system (often a DB) where the page is that contains the checksum – Andriaandriana 6/1, 2011 at 8:36

Thanks. But I am confused. Normally url link for download and checksum are on the same page, and if the hacker could hack the page for download url, the hacker should be easily to hack checksum on the same page, and checksum is just a string/number on the page. Why do you think hacking them both is hard. Any more descriptions? – Jess 6/1, 2011 at 15:0

Have a look at a typical download page like hc.apache.org/downloads.cgi. The checksum is always hosted with apache. The download is on some mirror. If you have a secure base, you can use less secure delegates. But you're right - if someone breaks in the apache server, this will not help you much. If you have higher security needs, you must use other OOB (someone sends you a mail after a download, calls you on the phone,...) or other techniques (signatures provide higher security, you find them on the apache page, too). – Andriaandriana 6/1, 2011 at 15:19

Thanks mtraut. Your description really makes senses. Besides preventing hacking, do you think MD5 checksum also prevents data transfer issue during TCP/Http (e.g. some bits are not transferred correctly from one end to the other)? – Jess 7/1, 2011 at 2:10

I wouldn't add a hashing feature to a plain TCP based protocol, like a HTTP request/response scenario. Maybe (no practical experience) this is useful if you mount a more complex protocol (a downloader that can resume, for example) on top of this, where subtle errors (both parties go out of synch with file pointer while downloading, file content changed while downloading,...) could occur – Andriaandriana 7/1, 2011 at 9:3

Nice, answer. May I add one more reason: the download manager may have bug. The TCP only ensure: 1. Once the package is received, it is error-free 2.The order of the received packages is the same as the sent order. What to do which those packages is the job of the Download Manager. – Dupree 20/6, 2015 at 1:49

There is a chance in TCP/IP that errors are undetected: noahdavids.org/self_published/CRC_and_checksum.html – Castiglione 29/7, 2015 at 6:8

More than 3 times in my life I downloaded a broken ISO or EXE and when I downloaded it again it worked. This proves to me that the TCP mechanism isn't enough to ensure integrity.

Pence answered 26/1, 2011 at 12:28 Comment(1)

Happened to me too, might have come from the browser though – Mopey 1/7, 2015 at 17:9

Answer is simple. The source file may already be corrupt before you even begin downloading. TCP only verifies that the file you download is the same as the source. MD5 guarantees that you could know if it's corrupt whether the cause be a problem in transfer or the initial file itself.

Adin answered 4/1, 2011 at 10:57 Comment(2)

"cause be a problem in transfer" -- confused about this point. I think transfer during TCP is reliable. Why do you think there is problem in transfer, example? – Jess 5/1, 2011 at 9:25

TCP is as reliable as a connection could be, however there are still potential problems in transfer using TCP. If the connection dies (unplugging your computer from the network for example), TCP does its very best to re-establish a connection and continue where it left off, but after a certain number of attempts it stops. The unfinished file is left on the disk. Granted, it rarely happens since most failed connections are re-established within a couple tries. – Adin 11/1, 2011 at 11:26

When it comes to the 35G of TED-LIUM corpus or the even larger 400G of tiny-images, it seems almost something error every time in the downloaded file. For the 35G TED-LIUM corpus, I did the download for at least 20 times and totally 700G of the network transmission for several months. CRC is just a nightmare.

Palimpsest answered 12/11, 2017 at 1:13 Comment(0)

Recommended topics

Hot tags