Correctness of Fletcher32 checksum algorithm

Asked 26/10, 2016 at 19:15 Answered 3/2, 2020 at 12:30

I'm having a hard time figuring out which implementation of the 32-bit variation of the Fletcher checksum algorithm is correct. Wikipedia provides the following optimized implementation:

uint32_t fletcher32( uint16_t const *data, size_t words ) {
        uint32_t sum1 = 0xffff, sum2 = 0xffff;
        size_t tlen;

        while (words) {
                tlen = words >= 359 ? 359 : words;
                words -= tlen;
                do {
                        sum2 += sum1 += *data++;
                } while (--tlen);
                sum1 = (sum1 & 0xffff) + (sum1 >> 16);
                sum2 = (sum2 & 0xffff) + (sum2 >> 16);
        }
        /* Second reduction step to reduce sums to 16 bits */
        sum1 = (sum1 & 0xffff) + (sum1 >> 16);
        sum2 = (sum2 & 0xffff) + (sum2 >> 16);
        return sum2 << 16 | sum1;
}

In addition, I've adapted the non-optimized 16-bit example from the Wikipedia article to compute a 32-bit checksum:

uint32_t naive_fletcher32(uint16_t *data, int words) {
   uint32_t sum1 = 0;
   uint32_t sum2 = 0;

   int index;
   for( index = 0; index < words; ++index ) {
      sum1 = (sum1 + data[index]) % 0xffff;
      sum2 = (sum2 + sum1) % 0xffff;
   }
   return (sum2 << 16) | sum1;
}

Both these implementations yield the same results, e.g. 0x56502d2a for the string abcdef. To verify that this is indeed correct, I tried to find other implementations of the algorithm:

All of these seem to agree that the checksum for abcdef is 0x8180255 instead of the value given by the implementation on Wikipedia. I've narrowed this down to how the data buffer the implementation operates on. All the above non-wikipedia implementation operate one byte at a time, whereas the Wikipedia implementation computes the checksum using 16-bit words. If I modify the above "naive" Wikipedia implementation to operate per-byte instead, it reads like this:

uint32_t naive_fletcher32_per_byte(uint8_t *data, int words) {
   uint32_t sum1 = 0;
   uint32_t sum2 = 0;

   int index;
   for( index = 0; index < words; ++index ) {
      sum1 = (sum1 + data[index]) % 0xffff;
      sum2 = (sum2 + sum1) % 0xffff;
   }
   return (sum2 << 16) | sum1;
}

The only thing that changes is the signature, really. So this modified naive implementation and the above mentioned implementations (except Wikipedia) agree that the checksum of abcdef is indeed 0x8180255.

My problem now is: which one is correct?

Carrycarryall answered 26/10, 2016 at 19:15 Comment(7)

In the naive_fletcher's the % 0xffff in the loop are not necessary. You can do that after the loop. – Ludmilla 26/10, 2016 at 19:59

That's why its the naive implementation I suppose :D Thanks for the hint, but the question isn't really about optimization :) – Carrycarryall 26/10, 2016 at 22:22

@PaulOgilvie: % 0xffff in the loop are not necessary as long as there is no overflow. – Vann 17/2, 2018 at 9:50

@greybeard, and what would happen in overflow? Bits that will never be used will just fall out of the register. – Ludmilla 17/2, 2018 at 10:43

@greybeard, I don't get your point. Nothing will be added to the low part upon overflow. The high part of 16 bits are not used. – Ludmilla 17/2, 2018 at 15:45

@PaulOgilvie: 0x10000%0xffff is 1, not 0: carries need to be accounted for. – Vann 17/2, 2018 at 16:50

@greybeard, you are correct. I did not see that. I was reasoning about bitwise and. – Ludmilla 18/2, 2018 at 10:27

According to the standard, the right method is the one that Wikipedia provides — except the name:

Note that the 8-bit Fletcher algorithm gives a 16-bit checksum and the 16-bit algorithm gives a 32-bit checksum.

Koppel answered 26/10, 2016 at 19:44 Comment(0)

In the standard quoted in the answer of HideFromKGB, the algorithm is trivial: the 8-bit version uses only 8 bit accumulators ("ints"), producing 8 bit results A and B, and the 16-bit version uses 16 bit "ints", producing 16 bit results A and B.

It should be noted that what Wikipedia calls the "32 bit Fletcher" is actually the "16 bit Fletcher". The number of bits in the name refers in the standard to the number of bits in each D[i] and in each of A and B, but on Wikipedia it refers to the number of bits in the "stacked result", i.e. in A<<16 | B for the 32 bit result.

I did not implement this, but maybe this can explain the difference. I am inclined to say that your interpretation (implementation) is correct.

N.b.: also note that it is necessary to pad data with zeroes to the appropriate number of bytes.

Ludmilla answered 27/10, 2016 at 0:26 Comment(2)

Thanks for the answer! To me, this still leaves the question why all other implementations I have found do not conform to this standard. That is, a standard that nobody implements is kind of worthless, but then again I haven't seen an implementation within TCP either, maybe I should check that out. – Carrycarryall 27/10, 2016 at 7:34

I left a request for clarification with your first reference. The second and third seem quite unofficial and there was no option to put in a request for clarification. In your third reference I did not find a reference to Fletcher or its RFC. – Ludmilla 28/10, 2016 at 21:30

These are test vectors, which are cross checked with two different implementations for 16-bit and for 32-bit check sums:

8-bit implementation (16-bit checksum)
 "abcde" -> 51440 (0xC8F0)
 "abcdef" -> 8279 (0x2057)
 "abcdefgh" -> 1575 (0x0627)

16-bit implementation (32-bit checksum)
 "abcde" -> 4031760169 (0xF04FC729)
 "abcdef" -> 1448095018 (0x56502D2A)
 "abcdefgh" -> 3957429649 (0xEBE19591)

Choragus answered 3/10, 2017 at 16:58 Comment(0)

TCP Alternate Checksum Options describes the Fletcher checksum algorithm for use with TCP: RFC 1146 dated March 1990.

The 8-bit Fletcher algorithm which gives a 16-bit checksum and the 16-bit algorithm which gives a 32-bit checksum are discussed.

The 8-bit Fletcher Checksum Algorithm is calculated over a sequence of data octets (call them D[1] through D[N]) by maintaining 2 unsigned 1's-complement 8-bit accumulators A and B whose contents are initially zero, and performing the following loop where i ranges from 1 to N:

       A := A + D[i]
       B := B + A

The 16-bit Fletcher Checksum algorithm proceeds in precisely the same manner as the 8-bit checksum algorithm, except that A, B and the D[i] are 16-bit quantities. It is necessary (as it is with the standard TCP checksum algorithm) to pad a datagram containing an odd number of octets with a zero octet.

That agrees with Wikipedia algorithms. The simple testing program confirms quoted results:

    #include <stdio.h>
    #include <string.h>
    #include <stdint.h> // for uint32_t
    
    uint32_t fletcher32_1(const uint16_t *data, size_t len)
    {
            uint32_t c0, c1;
            unsigned int i;
    
            for (c0 = c1 = 0; len >= 360; len -= 360) {
                    for (i = 0; i < 360; ++i) {
                            c0 = c0 + *data++;
                            c1 = c1 + c0;
                    }
                    c0 = c0 % 65535;
                    c1 = c1 % 65535;
            }
            for (i = 0; i < len; ++i) {
                    c0 = c0 + *data++;
                    c1 = c1 + c0;
            }
            c0 = c0 % 65535;
            c1 = c1 % 65535;
            return (c1 << 16 | c0);
    }
    
    uint32_t fletcher32_2(const uint16_t *data, size_t l)
    {
        uint32_t sum1 = 0xffff, sum2 = 0xffff;
    
        while (l) {
            unsigned tlen = l > 359 ? 359 : l;
            l -= tlen;
            do {
                sum2 += sum1 += *data++;
            } while (--tlen);
            sum1 = (sum1 & 0xffff) + (sum1 >> 16);
            sum2 = (sum2 & 0xffff) + (sum2 >> 16);
        }
        /* Second reduction step to reduce sums to 16 bits */
        sum1 = (sum1 & 0xffff) + (sum1 >> 16);
        sum2 = (sum2 & 0xffff) + (sum2 >> 16);
        return (sum2 << 16) | sum1;
    }
    
    int main()
    {
        char *str1 = "abcde";  
        char *str2 = "abcdef";
        
        size_t len1 = (strlen(str1)+1) / 2; //  '\0' will be used for padding 
        size_t len2 = (strlen(str2)+1) / 2; // 
        
        uint32_t f1 = fletcher32_1(str1,  len1);
        uint32_t f2 = fletcher32_2(str1,  len1);
    
        printf("%u %X \n",    f1,f1);
        printf("%u %X \n\n",  f2,f2);
       
        f1 = fletcher32_1(str2,  len2);
        f2 = fletcher32_2(str2,  len2);
    
        printf("%u %X \n",f1,f1);
        printf("%u %X \n",f2,f2);
       
        return 0;
    }

Output:

4031760169 F04FC729                                                                                                                                                                                                                              
4031760169 F04FC729                                                                                                                                                                                                                              
                                                                                                                                                                                                                                                 
1448095018 56502D2A                                                                                                                                                                                                                              
1448095018 56502D2A

Deadlight answered 17/2, 2018 at 2:35 Comment(1)

However, fletcher32_1 and fletcher32_2 don't always generate the same result (See analysis by Sven.) E.g., fletcher32_1(0,0) gives 0x0 and fletcher32_2(0,0)=0xffffffffffffffff. In general any code found on Wikipedia that has been modified compared to references should be treated as very suspicious. – Proposition 4/2, 2020 at 9:6

My answer is focusses on the correctness of s = (s & 0xffff) + (s >> 16);. This is obviously supposed to replace the modulo operation. Now the big issue with the modulo operation is the division that needs to be performed. The trick is to not do the division and to estimate floor(s / 65535). So instead of computing s - floor(s/65535)*65535, which would be the same as modulo, we compute s - floor(s/65536)*65535. This will obviously not be equivalent to doing modulo. But it's good enough to quickly reduce the size of s.

Now we have

  s - floor(s / 65536) * 65535
= s - (s >> 16) * 65535
= s - (s >> 16) * (65536 - 1)
= s - (s >> 16) * 65536 + (s >> 16)
= (s & 0xffff) + (s >> 16)

Since the (s & 0xffff) + (s >> 16) is not equivalent to doing the modulo, it does not suffice to use this formula. If s == 65535 then s % 65535 would yield zero. However, the former formula yields 65535. So the optimized Wikipedia implementation posted here is obviously false! The last 3 lines need to be changed to

        /* Second reduction step to reduce sums to 16 bits */
        sum1 = (sum1 & 0xffff) + (sum1 >> 16);
        sum2 = (sum2 & 0xffff) + (sum2 >> 16);
        if (sum1 >= 65535) { sum1 -= 65535; }
        if (sum2 >= 65535) { sum2 -= 65535; }
        return (sum2 << 16) | sum1;

It is noteworthy, that I can't find the optimized implementation on the Wikipedia page anymore (February 2020).

Addendum: Imagine s would be equal to the maximum unsigned 32 bit value, that is 0xFFFF_FFFF. Then the formula (s & 0xffff) + (s >> 16); yields 0x1FFFE. That is exactly two times 65535. So the correction step if (s >= 65535) { s -= 65535; } will not work since it subtracts 65535 at most once. So we want to keep sum1 and sum2 in the loops strictly smaller than 0xFFFF_FFFF. Then the formula yields at most 2*65535-1 and the correction step will work. The following simple python program determines, that sum2 would become too big after 360 iterations. So processing at most 359 16 bit words at a time is exactly right.

s1 = 0x1FFFD
s2 = 0x1FFFD
for i in range(1,1000):
    s1 += 0xFFFF
    s2 += s1
    if s2 >= 0xFFFFFFFF:
        print(i)
        break

Whitefaced answered 3/2, 2020 at 12:30 Comment(3)

After that correction it seems unclear if the optimized version is faster. (It will likely depend on the hardware.) Note that this change only concerns the optimization outside the loop, right? – Proposition 3/2, 2020 at 13:9

I didn't do the math whether 359 is the correct bound. Apart from that, the loop is correct. My correction is not needed inside the loop. It's only needed once in the end. – Whitefaced 3/2, 2020 at 13:32

I added an explanation of the number 359. – Whitefaced 5/2, 2020 at 23:50

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags