Loading/Storing to XMFLOAT4 faster than using XMVECTOR?

Asked 2/5, 2014 at 1:11 Answered 15/12, 2014 at 7:7

I'm going through the DirectX Math/XNA Math library, and I got curious when I read about the alignment requirements for XMVECTOR (Now DirectX::XMVECTOR), and how it is expected of you to use XMFLOAT* for members instead, using XMLoad* and XMStore* when performing mathematical operations. I was specifically curious about the tradeoffs, so I did an experiment, as I'm sure many others have, and tested to see exactly how much you lose having to load and store the vectors for each operation. This is the resulting code:

#include <Windows.h>

#include <chrono>
#include <cstdint>
#include <DirectXMath.h>
#include <iostream>

using std::chrono::high_resolution_clock;

#define TEST_COUNT          1000000000l

int main(void)
{
    DirectX::XMVECTOR v1 = DirectX::XMVectorSet(1, 2, 3, 4);
    DirectX::XMVECTOR v2 = DirectX::XMVectorSet(2, 3, 4, 5);
    DirectX::XMFLOAT4 x{ 1, 2, 3, 4 };
    DirectX::XMFLOAT4 y{ 2, 3, 4, 5 };

    std::chrono::system_clock::time_point start, end;
    std::chrono::milliseconds duration;

    // Test with just the XMVECTOR
    start = high_resolution_clock::now();
    for (uint64_t i = 0; i < TEST_COUNT; i++)
    {
        v1 = DirectX::XMVectorAdd(v1, v2);
    }
    end = high_resolution_clock::now();
    duration = std::chrono::duration_cast<std::chrono::milliseconds>(end - start);

    DirectX::XMFLOAT4 z;
    DirectX::XMStoreFloat4(&z, v1);
    std::cout << std::endl << "z = " << z.x << ", " << z.y << ", " << z.z << std::endl;
    std::cout << duration.count() << " milliseconds" << std::endl;

    // Now try with load/store
    start = high_resolution_clock::now();
    for (uint64_t i = 0; i < TEST_COUNT; i++)
    {
        v1 = DirectX::XMLoadFloat4(&x);
        v2 = DirectX::XMLoadFloat4(&y);

        v1 = DirectX::XMVectorAdd(v1, v2);
        DirectX::XMStoreFloat4(&x, v1);
    }
    end = high_resolution_clock::now();
    duration = std::chrono::duration_cast<std::chrono::milliseconds>(end - start);

    std::cout << std::endl << "x = " << x.x << ", " << x.y << ", " << x.z << std::endl;
    std::cout << duration.count() << " milliseconds" << std::endl;
}

Running a debug build yields the output:

z = 3.35544e+007, 6.71089e+007, 6.71089e+007
25817 milliseconds

x = 3.35544e+007, 6.71089e+007, 6.71089e+007
84344 milliseconds

Okay, so about thrice as slow, but does anyone really take perf tests on debug builds seriously? Here are the results when I do a release build:

z = 3.35544e+007, 6.71089e+007, 6.71089e+007
1980 milliseconds

x = 3.35544e+007, 6.71089e+007, 6.71089e+007
670 milliseconds

Like magic, XMFLOAT4 runs almost three times faster! Somehow the tables have turned. Looking at the code, this makes no sense to me; the second part runs a superset of the commands that the first part runs! There must be something going wrong, or something I am not taking into account. It is difficult to believe that the compiler was able to optimize the second part nine-fold over the much simpler, and theoretically more efficient first part. The only reasonable explanations I have involve either (1) cache behavior, (2) some crazy out of order execution that XMVECTOR can't take advantage of, (3) the compiler is making some insane optimizations, or (4) using XMVECTOR directly has some implicit inefficiency that was able to be optimized away when using XMFLOAT4. That is, the default way the compiler loads and stores XMVECTORs from memory is less efficient than XMLoad* and XMStore*. I attempted to inspect the disassembly, but I'm not all that familiar with X86 and/or SSE2 and Visual Studio does some crazy optimizations making it difficult to follow along with the source code. I also tried the Visual Studio performance analysis tool, but that didn't help as I can't figure out how to make it show the disassembly instead of the code. The only useful information I get out of that is that the first call to XMVectorAdd accounts for ~48.6% of all cycles while the second call to XMVectorAdd accounts for ~4.4% of all cycles.

EDIT: After some more debugging, here is the assembly for the code that gets run inside of the loop. For the first part:

002912E0  movups      xmm1,xmmword ptr [esp+18h]     <-- HERE
002912E5  add         ecx,1  
002912E8  movaps      xmm0,xmm2                      <-- HERE
002912EB  adc         esi,0  
002912EE  addps       xmm0,xmm1  
002912F1  movups      xmmword ptr [esp+18h],xmm0     <-- HERE
002912F6  jne         main+60h (0291300h)  
002912F8  cmp         ecx,3B9ACA00h  
002912FE  jb          main+40h (02912E0h)

And for the second part:

00291400  add         ecx,1  
00291403  addps       xmm0,xmm1  
00291406  adc         esi,0  
00291409  jne         main+173h (0291413h)  
0029140B  cmp         ecx,3B9ACA00h  
00291411  jb          main+160h (0291400h)

Note that these two loops are indeed nearly identical. The only difference is that the first for loop appears to be the one doing the loading and storing! It would appear as though Visual Studio made a ton of optimizations since x and y were on the stack. Changing them both to be on the heap (thus the writes must happen), and the machine code is now identical. Is this generally the case? Is there really no negative side effect to using the storage classes? Other than the fully optimized versions I suppose.

Aspersion answered 2/5, 2014 at 1:11 Comment(4)

FWIW, in the second loop v2 = DirectX::XMLoadFloat4(&y); would probably be optimised away (i.e. put outside of the loop) because it doesn't change inside. I know that doesn't explain the difference though ... – Gronseth 2/5, 2014 at 6:33

Interesting - I get the same 3:1 difference too - +1 for intrigue! – Gronseth 2/5, 2014 at 6:41

See my edit. It appears optimization is at the heart of the issue, though I'm still baffled as to why XMVECTOR did not receive the same optimizations. – Aspersion 2/5, 2014 at 7:3

A more realistic test would be to have the vector content in a memory array allocated from the heap. As it is, the optimizer is figuring out that most of lot of what you are doing has no side-effects and is free to optimize it away. – Maretz 14/6, 2017 at 16:57

Firstly, don't use Visual Studio's "high-resolution clock" for perf timing. You should use QueryPerformanceCounter instead. See Connect.

SIMD performance is difficult to measure in these micro tests because the overhead of loading up vector data can often dominate with such trivial ALU usage. You really need to do something substantial with the data to see the benefits. Also keep in mind that depending on your compiler settings, the compiler itself may be using the 'scalar' SIMD functionality or even auto-vectoring such trivial code loops.

You are also seeing some issues with the way you are generating your test data. You should create something larger than a single vector on the heap and use that as your source/dest.

PS: The best way to create 'static' XMVECTOR data is to use the XMVECTORF32 type.

static const DirectX::XMVECTORF32 v1 = { 1, 2, 3, 4 };

Note that if you want to have the load/save conversions between XMVECTOR and XMFLOATx to be "automatic", take a look at SimpleMath in the DirectX Tool Kit. You just use types like SimpleMath::Vector4 in your data structures, and the implicit conversion operators take care of calling XMLoadFloat4 / XMStoreFloat4 for you.

Maretz answered 26/6, 2014 at 18:37 Comment(3)

The bug associated with std::high_resolution_clock appears to be in the nanosecond range. My test results are pushing up into the second range... I think I'm good on that front. As for the significant amount of time spent loading, that was the point. I needed to test the performance benefits of using XMVECTOR vs. writing operator+, etc. functions for XMFLOAT4. I.e. if I could be lazy and not have to deal with converting back and forth – Aspersion 7/7, 2014 at 7:32

FWIW, I ended up using XMVECTOR as the return type for operator+, etc. for XMFLOAT*. That, coupled with auto rendered this test unnecessary. – Aspersion 7/7, 2014 at 7:36

I would rather suggest to use kernel time instead of wall clock time (i.e. GetProcessTimes). – Horripilation 14/6, 2017 at 12:50

If you define

DirectX::XMVECTOR v3 = DirectX::XMVectorSet(2, 3, 4, 5);

and use v3 instead v1 as a result: ...

 for (uint64_t i = 0; i < TEST_COUNT; i++)
    {
        v3 = DirectX::XMVectorAdd(v1, v2);
    }

you got code faster then 2-nd part code using XMLoadFloat4 and XMStoreFloat4

Queasy answered 15/12, 2014 at 7:7 Comment(1)

Again static const DirectX::XMVECTORF32 v3 = { 2, 3, 4, 5}; is better than XMVectorSet if you have constant value (i.e. you are making a vectorized constant). – Maretz 14/6, 2017 at 16:51

Recommended topics

Hot tags