The basic idea is to meet alignment constraints: each "word" to be copied one word at a time must be aligned on a "word" boundary.
Some CPUs have this as a fundamental constraint, that loads and stores must occur on a "natural" boundary. On older ARM processors the low bits of the address are actually ignored entirely, so that a load or store of two bytes from an odd-valued address have the same effect as:
short w = *(short *)(addr & ~1);
for instance. On some other CPUs an unaligned load or store results in a trap (MIPS and SPARC for instance), and still others will just do it, but with a performance penalty (x86).
So, suppose you're copying a large number of bytes (say, 4096 of them) from address 0x12345 to address 0x22345, and that the "word size" is 4 bytes. If we first copy three bytes, the addresses will now be 0x12348 and 0x22348. At this point we can copy just 1023 4-byte words, one word at a time, without tripping over any alignment issues. After that we'll have one remaining byte to copy, because 4096 = 3 + (4 * 1023) + 1.
This all makes the assumption that bytes are all addressed individually, even when loading and storing "words". This assumption is false on some machines: for instance, the old Data General MV10000 CPU would address "words" using "word addresses", which are essentially byte addresses divided by two. (It's thus impossible to address a "word" that spans two bytes: the word at location 0 has byte addresses 0 and 1 but word address 0; the word at location 1 has byte addresses 2 and 3; the word at location 2 has byte addresses 4 and 5; and so on.) On machines like this, you may need to use a different version of bcopy.c.
As @Alex noted, the XOR is just making sure that it is actually possible to align the two addresses. If you're copying from 0x12345 to 0x22345, it is; but if you're copying from 0x12345 to 0x22344, the two addresses will never line up.