C code loop performance [continued]

Asked 4/4, 2012 at 8:11 Answered 17/4, 2012 at 22:14

c performance intel instructions assembly

This question continues on my question here (on the advice of Mystical):

Continuing on my question, when i use packed instructions instead of scalar instructions the code using intrinsics would look very similar:

for(int i=0; i<size; i+=16) {
    y1 = _mm_load_ps(output[i]);
    …
    y4 = _mm_load_ps(output[i+12]);

    for(k=0; k<ksize; k++){
        for(l=0; l<ksize; l++){
            w  = _mm_set_ps1(weight[i+k+l]);

            x1 = _mm_load_ps(input[i+k+l]);
            y1 = _mm_add_ps(y1,_mm_mul_ps(w,x1));
            …
            x4 = _mm_load_ps(input[i+k+l+12]);
            y4 = _mm_add_ps(y4,_mm_mul_ps(w,x4));
        }
    }
    _mm_store_ps(&output[i],y1);
    …
    _mm_store_ps(&output[i+12],y4);
    }

The measured performance of this kernel is about 5.6 FP operations per cycle, although i would expect it to be exactly 4x the performance of the scalar version, i.e. 4.1,6=6,4 FP ops per cycle.

Taking the move of the weight factor into account (thanks for pointing that out), the schedule looks like:

schedule

It looks like the schedule doesn't change, although there is an extra instruction after the movss operation that moves the scalar weight value to the XMM register and then uses shufps to copy this scalar value in the entire vector. It seems like the weight vector is ready to be used for the mulps in time taking the switching latency from load to the floating point domain into account, so this shouldn't incur any extra latency.

The movaps (aligned, packed move),addps & mulps instructions that are used in this kernel (checked with assembly code) have the same latency & throughput as their scalar versions, so this shouldn't incur any extra latency either.

Does anybody have an idea where this extra cycle per 8 cycles is spent on, assuming the maximum performance this kernel can get is 6.4 FP ops per cycle and it is running at 5.6 FP ops per cycle?

By the way here is what the actual assembly looks like:

…
Block x: 
  movapsx  (%rax,%rcx,4), %xmm0
  movapsx  0x10(%rax,%rcx,4), %xmm1
  movapsx  0x20(%rax,%rcx,4), %xmm2
  movapsx  0x30(%rax,%rcx,4), %xmm3
  movssl  (%rdx,%rcx,4), %xmm4
  inc %rcx
  shufps $0x0, %xmm4, %xmm4               {fill weight vector}
  cmp $0x32, %rcx 
  mulps %xmm4, %xmm0 
  mulps %xmm4, %xmm1
  mulps %xmm4, %xmm2 
  mulps %xmm3, %xmm4
  addps %xmm0, %xmm5 
  addps %xmm1, %xmm6 
  addps %xmm2, %xmm7 
  addps %xmm4, %xmm8 
  jl 0x401ad6 <Block x> 
…

Globuliferous answered 4/4, 2012 at 8:11 Comment(21)

So I guess the question now is: "Why does the shufps instruction add 1 cycle every 1.6 iterations?" That's a tough one... – Avunculate 4/4, 2012 at 8:19

i would expect it to have no overhead since the output of the shufps should directly be available to the multps op since it's both FP domain – Globuliferous 4/4, 2012 at 8:24

Easy to find out. Make sure that the weight vector does not contain any denormalized values values. Try the loop without the shuffle instruction. It will not produce any useful results, but maybe your find which instruction does cost you additional cycles (I suspect the shuffle, of course). – Longoria 4/4, 2012 at 8:47

@Mystical: I see 0.75 cycles per loop iteration added. (Wasn't it my comment about using 5 cycles instead of 4 which lead you to your answer there... :-)) – Longoria 4/4, 2012 at 8:49

@drhirsch Of course everyone is afraid of denormalized values... Another thing to try is to replace the weight vector with SIMD blocks of identical values. That'll let you do a normal load and not need to shuffle. – Avunculate 4/4, 2012 at 8:51

@DanLeakin It would be helpful if you posted the actual cycle counts as measured instead of the basically useless Flops/cycle value, instead of letting us deduce it. – Longoria 4/4, 2012 at 8:52

@drhirsch Yeah, your comment did indeed tip me in the right direction. :) This one is harder though... Hard to suspect anything in particular. There's too much in a modern CPU. :P – Avunculate 4/4, 2012 at 8:53

@Mystical Actually the answer you have given there was my very first thought. 5 loads - 5 cycles - easy to see the coincidence. But then I remembered that my current SB is able to do 2 loads per cycle, ignoring the fact that the question was about a Nehalem and so I decided this could be the answer :-) – Longoria 4/4, 2012 at 8:56

@drhirsch Yeah, I also hesitated because I thought scalar loads could be multiple issue on Nehalem. Apparently I was wrong when I took a look at Agner's tables. Nehalem isn't able to split its 128-bit/cycle load bandwidth the way that SB can split its 256-bit/cycle into dual-issue SSE loads. – Avunculate 4/4, 2012 at 9:4

Ok i tried to remove the shufps by using a load instruction, but the performance didn't increase, which to my opinion means that the shufps isn't the bad guy here. Any other explanations? Maybe the packed movaps instructions have some extra latency from cache stuff (misses, misalignment) that isn't there with the movss instructions in the scalar version? – Globuliferous 4/4, 2012 at 9:17

For one, now you're demanding 4x the cache bandwidth. How large are the data sizes? Do they fit into the L1 cache? – Avunculate 4/4, 2012 at 9:27

Yes the data fits in the L1 cache – Globuliferous 4/4, 2012 at 10:59

@DanLeakin Could you move the load out of the loop and just remove the shufps completely? So that you have basically the same code, but every scalar instruction is replaced by a vector instruction? – Longoria 4/4, 2012 at 11:0

When moving the load out of the loop and thus removing the shufps instruction every iteration the performance remains almost the same (goes up by a little because one load is gone), so i assume it is caused by the cache – Globuliferous 4/4, 2012 at 11:50

This is not exactly an answer to your question, but can't you use dpps? – Parenteral 6/4, 2012 at 17:53

Are you using FTZ (flush-to-zero) and DAZ mode? – Amphisbaena 8/4, 2012 at 19:19

I don't use FTZ or DAZ. @Necrolis thanks for the link, i'll check into that – Globuliferous 9/4, 2012 at 12:30

If possible, I would use Intel Inspector (or its predecessor - VTune Performance Analyzer) to see where exactly performance is stalled. – Geyer 9/4, 2012 at 14:6

i already analyzed the code using VTune, but this doesn't give much insight in the performance bottleneck at cycle level to my opinion – Globuliferous 10/4, 2012 at 7:28

Do you have any sample data we can run to test it out ourselves? (Or a simple way of generating similar data.) – Puccoon 17/4, 2012 at 2:43

of course, just precede the for loop with a loop initializing some values like for(i=0;i<2*size;i++) input[i] = i/3; output[i] = i/5; weight[i] = i/8; and keep the ksize in the loop low (mine is 6) – Globuliferous 17/4, 2012 at 13:8

Try using EMON profiling in Vtune, or some equivalent tool like oprof

Vtune for Linux (you can search for the Windows version)
oprofile

EMON (Event Monitoring) profiling => like a time based tool, but it can tell you what performance event is causing the problem. Although, you should start out with a time based profile first, to see if there is a particular instruction that jumps out. (And possibly the related events that tell you how often there was a retirement stall at that IP.)

To use EMON profiling, you must run through a list of events, ranging from "the usual suspects" to ...

Here, I would start off with cache misses, alignment. I do not know if the processor you are using has a counter for RF port limitations - it should - but I added EMON profiling long ago, and I don't know how well they are keeping up by adding events appropriate for microarchitecture.

It may also be possible that it is a front end, instruction fetch, stall. How many bytes are in these instructions, anyway? There are EMON events for that, too.

Responding to comment that Nehalem VTune can't see L3 events: not true. Here is stuff I was adding to comment, but did not fit:

Actually, there ARE performance counters for the LL3 / L3$ / so-called Uncore. I would be immensely surprised if VTune doesn't support them. See http://software.intel.com/sites/products/collateral/hpc/vtune/performance_analysis_guide.pdf points to VTune and other tools such as PTU. In fact, even without LL3 events, as David Levinthal says: "the Intel® Core™ i7 processor has a “latency event” which is very similar to the Itanium® Processor Family Data EAR event. This event samples loads, recording the number of cycles between the execution of the instruction and actual deliver of the data. If the measured latency is larger than the minimum latency programmed into MSR 0x3f6, bits 15:0, then the counter is incremented. Counter overflow arms the PEBS mechanism and on the next event satisfying the latency threshold, the measured latency, the virtual or linear address and the data source are copied into 3 additional registers in the PEBS buffer. Because the virtual address is captured into a known location, the sampling driver could also execute a virtual to physical translation and capture the physical address. The physical address identifies the NUMA home location and in principle allows an analysis of the details of the cache occupancies." He also points, on page 35, to VTune events such as L3 CACHE_HIT_UNCORE_HIT and L3 CACHE_MISS_REMOTE_DRAM. Sometimes you need to look up the numeric codes and program them into VTune's lower level interface, but I think in this case it is visible in the pretty user interface.

OK, in http://software.intel.com/en-us/forums/showthread.php?t=77700&o=d&s=lr a VTune programmer in Russia (I think) "explains" that you can't sample on Uncore events.

He's wrong - you could, for example, enable only one CPU, and sample meaningfully. I also believe that there is the ability to mark L3 missing data as it returns to the CPU. In fact, overall the L3 knows which CPU it is returning data to, so you can definitely sample. You may not know which hyperthread, but again you can disable, go into single thread mode.

But it looks like, as is rather common, you would have to work AROUND VTune, not with it, to do this.

Try latency profiling first. That's entirely inside the CPU, and the VTune folks are unlikely to have messed it up too much.

And, I say again, likelihood is that your problem is in the core, not in L3. So VTune should bne able to handle that.

Try "Cycle Accounting" per Levinthal.

Capsule answered 17/4, 2012 at 22:14 Comment(9)

Thanks for your reaction. I use VTune to analyze my application, but the problem with the nehalem architecture is that the L3 cache belongs to the off-core part of the core, so there are no performance event counters available for this part. Therefore it is hard to estimate cache misses etcetera. – Globuliferous 23/4, 2012 at 12:44

Actually, there ARE performance counters for the LL3 / L3$ / so-called Uncore. I would be immensely surprised if VTune doesn't support them. See software.intel.com/sites/products/collateral/hpc/vtune/… – Capsule 23/4, 2012 at 16:40

I wrote more than would fit in comment, tried to move it to the answer and clean up the original comment, but comments can only bwe edited for 5 minutes. Short version: VTune allows you to see L3 cache misses. Even without Uncore support, using latency profiling - and it has Uncore support. – Capsule 23/4, 2012 at 16:51

And overall I suspect that your problem is not L3 cache misses. More likely a front end event. – Capsule 23/4, 2012 at 16:52

@KrazyGlew: Your guess is right, he is a Russian guy from Russian Federation. Here is his profile on LinkedIn - linkedin.com/in/vtsymbal – Cusp 23/4, 2012 at 17:24

@Vlad_Lazarenko: By the way, I certainly do not mean to diss Vlad Tsymbal. In general, Intel's Russian teams were great to work with. I did, however, let my decades spanning frustration with VTune show. A good performance analyst always thinks about disabling stuff in order to measure stuff like L3 cache misses, if that's what it takes. VTune is supposed to encapsulate the knowledge of a good performance analyst. – Capsule 24/4, 2012 at 14:6

// As for the hardware not allowing attribution of LLC misses to CPU - that's silly. Either VTune missed something or the HW missed obvious fixes: (a) there should be marking, and (b) the information is there in the hardware, since the cache miss must be routed back to the correct requesting CPU. – Capsule 24/4, 2012 at 14:8

Indeed, Levinthal says no attribution. Unfortunate. But this is the way HW/SW codevelopment works: if the VTune guys provide software to do attribution by disabling cores and threads, then perhaps there will be justification for HW to do a better job next time. // BTW intel.com/Assets/en_US/PDF/designguide/323535.pdf says that you can do MEM_LOAD_RETIRED.LLC_MISS PEBS profiling, so there is yet another way to measure LLC misses. – Capsule 24/4, 2012 at 14:14

... and use in a profile. Better yet, with PEBs you know where the (non-speculative) miss actually occurred. – Capsule 24/4, 2012 at 15:10

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags