Understanding Java 17 Vector slowness and performance with pow operator
Asked Answered
L

1

8

I have a question relating to the pow() function in Java's 17 new Vector API feature. I'm trying to implement the black scholes formula in a vectorized manner, but I'm having difficulty in obtaining the same performance as the scalar implementation

The code is as follows:

  1. I create an array of doubles (currently, just 5.0)
  2. I loop over elements of that array (different looping syntax for scalar and vector)
  3. I create DoubleVectors from the double arrays within and do calculations (or just calculations for scalar) I am trying to do e^(value), and I believe that is the problem

Here are some code snippets:

    public static double[] createArray(int arrayLength)
    {
        double[] array0 = new double[arrayLength];
        for(int i=0;i<arrayLength;i++)
        {
            array0[i] = 2.0;
        }
        return array0;
    } 
    @Param({"256000"})
    int arraySize;
    public static final VectorSpecies<Double> SPECIES = DoubleVector.SPECIES_PREFERRED;
    DoubleVector vectorTwo =  DoubleVector.broadcast(SPECIES,2);
    DoubleVector vectorHundred =  DoubleVector.broadcast(SPECIES,100);

    double[] scalarTwo = new double[]{2,2,2,2};
    double[] scalarHundred  = new double[]{100,100,100,100};

    @Setup
    public void Setup()
    {
        javaSIMD = new JavaSIMD();
        javaScalar = new JavaScalar();
        spotPrices = createArray(arraySize);
        timeToMaturity = createArray(arraySize);
        strikePrice = createArray(arraySize);
        interestRate = createArray(arraySize);
        volatility = createArray(arraySize);
        e = new double[arraySize];
        for(int i=0;i<arraySize;i++)
        {
            e[i] = Math.exp(1);
        }
        upperBound = SPECIES.loopBound(spotPrices.length);
    }
    @Benchmark
    @BenchmarkMode(Mode.Throughput)
    @OutputTimeUnit(TimeUnit.MILLISECONDS)
    public void testVectorPerformance(Blackhole bh) {
        var upperBound = SPECIES.loopBound(spotPrices.length);
        for (var i=0;i<upperBound; i+= SPECIES.length())
        {
            bh.consume(javaSIMD.calculateBlackScholesSingleCalc(spotPrices,timeToMaturity,strikePrice,
                    interestRate,volatility,e, i));
        }
    }

    @Benchmark
    @BenchmarkMode(Mode.Throughput)
    @OutputTimeUnit(TimeUnit.MILLISECONDS)
    public void testScalarPerformance(Blackhole bh) {
        for(int i=0;i<arraySize;i++)
        {
            bh.consume(javaScalar.calculateBlackScholesSingleCycle(spotPrices,timeToMaturity,strikePrice,
                    interestRate,volatility, i,normDist));
        }
    }
    public DoubleVector calculateBlackScholesSingleCalc(double[] spotPrices, double[] timeToMaturity, double[] strikePrice,
                                                        double[] interestRate, double[] volatility, double[] e,int i){
...(skip lines)
        DoubleVector vSpot = DoubleVector.fromArray(SPECIES, spotPrices, i);
...(skip lines)
        DoubleVector powerOperand = vRateScaled
                .mul(vTime)
                .neg();
        DoubleVector call  = (vSpot
                .mul(CDFVectorizedExcelOptimized(d1,vE)))
                .sub(vStrike
                .mul(vE
                        .pow(powerOperand))
                .mul(CDFVectorizedExcelOptimized(d2,vE)));
        return call;

Here are some JMH benchmarks (2 forks,2 warmups,2 iterations) on a Ryzen 5800X using WSL: Overall, it seems ~2x slower vs the scalar version. I ran a simple time before vs time after separately, of the method without JMH and it seems inline.

Result "blackScholes.TestJavaPerf.testScalarPerformance":
  0.116 ±(99.9%) 0.002 ops/ms [Average]
       89873915287      cycles:u                  #    4.238 GHz                      (40.43%)
      242060738532      instructions:u            #    2.69  insn per cycle   

      
Result "blackScholes.TestJavaPerf.testVectorPerformance":
  0.071 ±(99.9%) 0.001 ops/ms [Average]
       90878787665      cycles:u                  #    4.072 GHz                      (39.25%)
      254117779312      instructions:u            #    2.80  insn per cycle  

I also enabled diagnostic options for the JVM. I see the following:

"-XX:+UnlockDiagnosticVMOptions", "-XX:+PrintIntrinsics","-XX:+PrintAssembly"
  0x00007fe451959413:   call   0x00007fe451239f00           ; ImmutableOopMap {rsi=Oop }
                                                            ;*synchronization entry
                                                            ; - jdk.incubator.vector.DoubleVector::arrayAddress@-1 (line 3283)
                                                            ;   {runtime_call counter_overflow Runtime1 stub}
  0x00007fe451959418:   jmp    0x00007fe4519593ce
  0x00007fe45195941a:   movabs $0x7fe4519593ee,%r10         ;   {internal_word}
  0x00007fe451959424:   mov    %r10,0x358(%r15)
  0x00007fe45195942b:   jmp    0x00007fe451193100           ;   {runtime_call SafepointBlob}
  0x00007fe451959430:   nop
  0x00007fe451959431:   nop
  0x00007fe451959432:   mov    0x3d0(%r15),%rax
  0x00007fe451959439:   movq   $0x0,0x3d0(%r15)
  0x00007fe451959444:   movq   $0x0,0x3d8(%r15)
  0x00007fe45195944f:   add    $0x40,%rsp
  0x00007fe451959453:   pop    %rbp
  0x00007fe451959454:   jmp    0x00007fe451231e80           ;   {runtime_call unwind_exception Runtime1 stub}
  0x00007fe451959459:   hlt    
<More halts cut off>   
[Exception Handler]
  0x00007fe451959460:   call   0x00007fe451234580           ;   {no_reloc}
  0x00007fe451959465:   movabs $0x7fe46e76df9a,%rdi         ;   {external_word}
  0x00007fe45195946f:   and    $0xfffffffffffffff0,%rsp
  0x00007fe451959473:   call   0x00007fe46e283d40           ;   {runtime_call}
  0x00007fe451959478:   hlt    
[Deopt Handler Code]
  0x00007fe451959479:   movabs $0x7fe451959479,%r10         ;   {section_word}
  0x00007fe451959483:   push   %r10
  0x00007fe451959485:   jmp    0x00007fe4511923a0           ;   {runtime_call DeoptimizationBlob}
  0x00007fe45195948a:   hlt    
<More halts cut off>
--------------------------------------------------------------------------------

============================= C2-compiled nmethod ==============================
  ** svml call failed for double_pow_32
                                            @ 3   jdk.internal.misc.Unsafe::loadFence (0 bytes)   (intrinsic)
                                            @ 3   jdk.internal.misc.Unsafe::loadFence (0 bytes)   (intrinsic)
                                          @ 2   java.lang.Math::pow (6 bytes)   (intrinsic)

Investigations/Questions:

  1. Im writing different implementations of the formula, it is not 1:1 - could this be the cause? Looking at the number of instructions according to JMH, there is roughly a 12billion difference in num of instructions. With vectorization the processor runs at a lower clock rate as well.
  2. Is the choice of input numbers a problem? I've tried i+10/(array.Length) as well.
  3. Is there a reason I see that the SVML call fail for double_pow_32 ? I don't see this problem for smaller input array sizes BTW
  4. I changed the pow to mul (for both,obviously the eq is now very different) but it seems to be much faster as a result, results are as expected scalar vs vector

Note: I believe it is using 256bit width vectors (checked during debugging)

Lueck answered 10/10, 2022 at 7:4 Comment(7)
I can just add that I don't see any performance increases at all from using the Vector API, rather the reverse.Inoculation
EDIT: for point 4 here are the results Result "blackScholes.TestJavaPerf.testScalarPerformance": 0.181 ±(99.9%) 0.020 ops/ms [Average] (min, avg, max) = (0.177, 0.181, 0.184), stdev = 0.003 CI (99.9%): [0.161, 0.202] (assumes normal distribution) Result "blackScholes.TestJavaPerf.testVectorPerformance": 0.302 ±(99.9%) 0.007 ops/ms [Average] (min, avg, max) = (0.301, 0.302, 0.304), stdev = 0.001 CI (99.9%): [0.296, 0.309] (assumes normal distribution)Lueck
@Inoculation could you clarify instances where you saw regressions with java 17's vector API? sorry just curiousLueck
I did some basic experiments with double arrays of various sizes. I tried out some basic algorithms and basic operations just to see what kind of performance changes would appear. What I saw was consistent performance decrease when using the Vector API.Inoculation
I'm not familiar with Java's Vector implementation, but if you want to calculate e^something, there should also be an exp(something) (maybe something.exp()) method for that, which should be faster than a generic pow implementation. Maybe also something like exp2.Joel
@Joel indeed. While there is no exp() method in DoubleVector, replacing vE.pow(powerOperand) with powerOperand.lanewise(VectorOperators.EXP) should do.Bite
@Bite that worked! blackScholes.TestJavaPerf.testScalarPerformance 256000 thrpt 4 0.116 ± 0.003 ops/ms blackScholes.TestJavaPerf.testVectorPerformance 256000 thrpt 4 0.237 ± 0.005 ops/ms So now with vector we're getting roughly double the operations per second of scalar, which is much better than before. Do you want to write the answer so I can mark it as the solution, or should I write it?Lueck
B
6

This might be related to JDK-8262275, Math vector stubs are not called for double64 vectors

For Double64Vector, the svml math vector stubs intrinsification is failing and they are not being called from jitted code.
But we do have svml double64 vectors.

You might try alternative operations, e.g. instead of vE.pow(powerOperand) with vE being a vector of e, you can use powerOperand.lanewise(VectorOperators.EXP) to perform ex for all lanes.

Keep in mind that this API is work in progress in incubator state…

Bite answered 10/10, 2022 at 15:25 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.