I have a question relating to the pow() function in Java's 17 new Vector API feature. I'm trying to implement the black scholes formula in a vectorized manner, but I'm having difficulty in obtaining the same performance as the scalar implementation
The code is as follows:
- I create an array of doubles (currently, just 5.0)
- I loop over elements of that array (different looping syntax for scalar and vector)
- I create DoubleVectors from the double arrays within and do calculations (or just calculations for scalar) I am trying to do e^(value), and I believe that is the problem
Here are some code snippets:
public static double[] createArray(int arrayLength)
{
double[] array0 = new double[arrayLength];
for(int i=0;i<arrayLength;i++)
{
array0[i] = 2.0;
}
return array0;
}
@Param({"256000"})
int arraySize;
public static final VectorSpecies<Double> SPECIES = DoubleVector.SPECIES_PREFERRED;
DoubleVector vectorTwo = DoubleVector.broadcast(SPECIES,2);
DoubleVector vectorHundred = DoubleVector.broadcast(SPECIES,100);
double[] scalarTwo = new double[]{2,2,2,2};
double[] scalarHundred = new double[]{100,100,100,100};
@Setup
public void Setup()
{
javaSIMD = new JavaSIMD();
javaScalar = new JavaScalar();
spotPrices = createArray(arraySize);
timeToMaturity = createArray(arraySize);
strikePrice = createArray(arraySize);
interestRate = createArray(arraySize);
volatility = createArray(arraySize);
e = new double[arraySize];
for(int i=0;i<arraySize;i++)
{
e[i] = Math.exp(1);
}
upperBound = SPECIES.loopBound(spotPrices.length);
}
@Benchmark
@BenchmarkMode(Mode.Throughput)
@OutputTimeUnit(TimeUnit.MILLISECONDS)
public void testVectorPerformance(Blackhole bh) {
var upperBound = SPECIES.loopBound(spotPrices.length);
for (var i=0;i<upperBound; i+= SPECIES.length())
{
bh.consume(javaSIMD.calculateBlackScholesSingleCalc(spotPrices,timeToMaturity,strikePrice,
interestRate,volatility,e, i));
}
}
@Benchmark
@BenchmarkMode(Mode.Throughput)
@OutputTimeUnit(TimeUnit.MILLISECONDS)
public void testScalarPerformance(Blackhole bh) {
for(int i=0;i<arraySize;i++)
{
bh.consume(javaScalar.calculateBlackScholesSingleCycle(spotPrices,timeToMaturity,strikePrice,
interestRate,volatility, i,normDist));
}
}
public DoubleVector calculateBlackScholesSingleCalc(double[] spotPrices, double[] timeToMaturity, double[] strikePrice,
double[] interestRate, double[] volatility, double[] e,int i){
...(skip lines)
DoubleVector vSpot = DoubleVector.fromArray(SPECIES, spotPrices, i);
...(skip lines)
DoubleVector powerOperand = vRateScaled
.mul(vTime)
.neg();
DoubleVector call = (vSpot
.mul(CDFVectorizedExcelOptimized(d1,vE)))
.sub(vStrike
.mul(vE
.pow(powerOperand))
.mul(CDFVectorizedExcelOptimized(d2,vE)));
return call;
Here are some JMH benchmarks (2 forks,2 warmups,2 iterations) on a Ryzen 5800X using WSL: Overall, it seems ~2x slower vs the scalar version. I ran a simple time before vs time after separately, of the method without JMH and it seems inline.
Result "blackScholes.TestJavaPerf.testScalarPerformance":
0.116 ±(99.9%) 0.002 ops/ms [Average]
89873915287 cycles:u # 4.238 GHz (40.43%)
242060738532 instructions:u # 2.69 insn per cycle
Result "blackScholes.TestJavaPerf.testVectorPerformance":
0.071 ±(99.9%) 0.001 ops/ms [Average]
90878787665 cycles:u # 4.072 GHz (39.25%)
254117779312 instructions:u # 2.80 insn per cycle
I also enabled diagnostic options for the JVM. I see the following:
"-XX:+UnlockDiagnosticVMOptions", "-XX:+PrintIntrinsics","-XX:+PrintAssembly"
0x00007fe451959413: call 0x00007fe451239f00 ; ImmutableOopMap {rsi=Oop }
;*synchronization entry
; - jdk.incubator.vector.DoubleVector::arrayAddress@-1 (line 3283)
; {runtime_call counter_overflow Runtime1 stub}
0x00007fe451959418: jmp 0x00007fe4519593ce
0x00007fe45195941a: movabs $0x7fe4519593ee,%r10 ; {internal_word}
0x00007fe451959424: mov %r10,0x358(%r15)
0x00007fe45195942b: jmp 0x00007fe451193100 ; {runtime_call SafepointBlob}
0x00007fe451959430: nop
0x00007fe451959431: nop
0x00007fe451959432: mov 0x3d0(%r15),%rax
0x00007fe451959439: movq $0x0,0x3d0(%r15)
0x00007fe451959444: movq $0x0,0x3d8(%r15)
0x00007fe45195944f: add $0x40,%rsp
0x00007fe451959453: pop %rbp
0x00007fe451959454: jmp 0x00007fe451231e80 ; {runtime_call unwind_exception Runtime1 stub}
0x00007fe451959459: hlt
<More halts cut off>
[Exception Handler]
0x00007fe451959460: call 0x00007fe451234580 ; {no_reloc}
0x00007fe451959465: movabs $0x7fe46e76df9a,%rdi ; {external_word}
0x00007fe45195946f: and $0xfffffffffffffff0,%rsp
0x00007fe451959473: call 0x00007fe46e283d40 ; {runtime_call}
0x00007fe451959478: hlt
[Deopt Handler Code]
0x00007fe451959479: movabs $0x7fe451959479,%r10 ; {section_word}
0x00007fe451959483: push %r10
0x00007fe451959485: jmp 0x00007fe4511923a0 ; {runtime_call DeoptimizationBlob}
0x00007fe45195948a: hlt
<More halts cut off>
--------------------------------------------------------------------------------
============================= C2-compiled nmethod ==============================
** svml call failed for double_pow_32
@ 3 jdk.internal.misc.Unsafe::loadFence (0 bytes) (intrinsic)
@ 3 jdk.internal.misc.Unsafe::loadFence (0 bytes) (intrinsic)
@ 2 java.lang.Math::pow (6 bytes) (intrinsic)
Investigations/Questions:
- Im writing different implementations of the formula, it is not 1:1 - could this be the cause? Looking at the number of instructions according to JMH, there is roughly a 12billion difference in num of instructions. With vectorization the processor runs at a lower clock rate as well.
- Is the choice of input numbers a problem? I've tried i+10/(array.Length) as well.
- Is there a reason I see that the SVML call fail for double_pow_32 ? I don't see this problem for smaller input array sizes BTW
- I changed the pow to mul (for both,obviously the eq is now very different) but it seems to be much faster as a result, results are as expected scalar vs vector
Note: I believe it is using 256bit width vectors (checked during debugging)
e^something
, there should also be anexp(something)
(maybesomething.exp()
) method for that, which should be faster than a genericpow
implementation. Maybe also something likeexp2
. – Joelexp()
method inDoubleVector
, replacingvE.pow(powerOperand)
withpowerOperand.lanewise(VectorOperators.EXP)
should do. – Bite