Right after I looked at this question I found MULQ in my generated code when dividing.
The full code is turning a large binary number into chunks of a billion so that it can be easily converted to a string.
C++ code:
for_each(TempVec.rbegin(), TempVec.rend(), [&](Short & Num){
Remainder <<= 32;
Remainder += Num;
Num = Remainder / 1000000000;
Remainder %= 1000000000;//equivalent to Remainder %= DecimalConvert
});
Optimized Generated Assembly
00007FF7715B18E8 lea r9,[rsi-4]
00007FF7715B18EC mov r13,12E0BE826D694B2Fh
00007FF7715B18F6 nop word ptr [rax+rax]
00007FF7715B1900 shl r8,20h
00007FF7715B1904 mov eax,dword ptr [r9]
00007FF7715B1907 add r8,rax
00007FF7715B190A mov rax,r13
00007FF7715B190D mul rax,r8
00007FF7715B1910 mov rcx,r8
00007FF7715B1913 sub rcx,rdx
00007FF7715B1916 shr rcx,1
00007FF7715B1919 add rcx,rdx
00007FF7715B191C shr rcx,1Dh
00007FF7715B1920 imul rax,rcx,3B9ACA00h
00007FF7715B1927 sub r8,rax
00007FF7715B192A mov dword ptr [r9],ecx
00007FF7715B192D lea r9,[r9-4]
00007FF7715B1931 lea rax,[r9+4]
00007FF7715B1935 cmp rax,r14
00007FF7715B1938 jne NumToString+0D0h (07FF7715B1900h)
Notice the MUL instruction 5 lines down.
This generated code is extremely unintuitive, I know, in fact it looks nothing like the compiled code but DIV is extremely slow ~25 cycles for a 32 bit div, and ~75 according to this chart on modern PCs compared with MUL or IMUL (around 3 or 4 cycles) and so it makes sense to try to get rid of DIV even if you have to add all sorts of extra instructions.
I don't fully understand the optimization here, but if you would like to see a rational and a mathematical explanation of using compile time and multiplication to divide constants, see this paper.
This is an example of is the compiler making use of the performance and capability of the full 64 by 64 bit untruncated multiply without showing the c++ coder any sign of it.