The version below should be faster, compared to your initial version that uses _mm256_div_ps
vdivps
is quite slow, e.g. on my Haswell Xeon it’s 18-21 cycles latency, 14 cycles throughput. Newer CPUs perform better BTW, it’s 11/5 on Skylake, 10/6 on Ryzen.
As said in the comments, the performance is fixable by replacing divide with multiply and further improved with FMA. The problem with the approach is quality of distribution. If you’ll try to get these numbers in your output interval by rounding mode or clipping, you’ll introduce peaks in probability distribution of the output numbers.
My implementation is not ideal either, it doesn’t output all possible values in the output interval, skips many representable floats, especially near 0. But at least the distribution is very even.
__m256 __vectorcall randomFloats( __m256i randomBits )
{
// Convert to random float bits
__m256 result = _mm256_castsi256_ps( randomBits );
// Zero out exponent bits, leave random bits in mantissa.
// BTW since the mask value is constexpr, we don't actually need AVX2 instructions for this, it's just easier to code with set1_epi32.
const __m256 mantissaMask = _mm256_castsi256_ps( _mm256_set1_epi32( 0x007FFFFF ) );
result = _mm256_and_ps( result, mantissaMask );
// Set sign + exponent bits to that of 1.0, which is sign=0, exponent=2^0.
const __m256 one = _mm256_set1_ps( 1.0f );
result = _mm256_or_ps( result, one );
// Subtract 1.0. The above algorithm generates floats in range [1..2).
// Can't use bit tricks to generate floats in [0..1) because it would cause them to be distributed very unevenly.
return _mm256_sub_ps( result, one );
}
Update: if you want better precision, use the following version. But it’s no longer “the fastest”.
__m256 __vectorcall randomFloats_32( __m256i randomBits )
{
// Convert to random float bits
__m256 result = _mm256_castsi256_ps( randomBits );
// Zero out exponent bits, leave random bits in mantissa.
const __m256 mantissaMask = _mm256_castsi256_ps( _mm256_set1_epi32( 0x007FFFFF ) );
result = _mm256_and_ps( result, mantissaMask );
// Set sign + exponent bits to that of 1.0, which is sign=0, exponent = 2^0.
const __m256 one = _mm256_set1_ps( 1.0f );
result = _mm256_or_ps( result, one );
// Subtract 1.0. The above algorithm generates floats in range [1..2).
result = _mm256_sub_ps( result, one );
// Use 9 unused random bits to add extra randomness to the lower bits of the values.
// This increases precision to 2^-32, however most floats in the range can't store that many bits, fmadd will only add them for small enough values.
// If you want uniformly distributed floats with 2^-24 precision, replace the second argument in the following line with _mm256_set1_epi32( 0x80000000 ).
// In this case you don't need to set rounding mode bits in MXCSR.
__m256i extraBits = _mm256_and_si256( randomBits, _mm256_castps_si256( mantissaMask ) );
extraBits = _mm256_srli_epi32( extraBits, 9 );
__m256 extra = _mm256_castsi256_ps( extraBits );
extra = _mm256_or_ps( extra, one );
extra = _mm256_sub_ps( extra, one );
_MM_SET_ROUNDING_MODE( _MM_ROUND_DOWN );
constexpr float mul = 0x1p-23f; // The initial part of the algorithm has generated uniform distribution with the step 2^-23.
return _mm256_fmadd_ps( extra, _mm256_set1_ps( mul ), result );
}
0x1p-31f
. Division is generally slower or more resource-consuming than multiplication. – Olneyfloat
and use floating-point arithmetic to finish the job. But you are apparently supporting 31 bits (and a sign), so they have to be rounded. The convert instructions are designed for that, so you are not likely to do better. – Olney-ffast-math
(or some part of that) to fuse the addition and multiplication. – Marilumarilyn0x7fffffff
tofloat
. FMA doesn't have intermediate rounding, but the convert is a problem for inputs outside the +- 2^24 range wherefloat
can exactly represent every integer. – Adjudge1ul
(or0x80000001
) result in2^-32
? And should0xffffffff
(or0x7fffffff
) result in the nearestfloat
ton*2^-32
(i.e.,1.0f
) or the largestfloat
smaller than that? – Marilumarilyn__m256i
. So my code assumes they are signed because_mm256_cvtepi32_ps()
works with signed integers, but there are no more reasons to treat them as signed. – Ultrared[0, 1)
(e.g., do you care if0x0
gets mapped to0.5f
or0.0f
)? And is the "exclusive" important? – Marilumarilyn(0;1)
range. I just need random floats - I don't care how they are mapped to integers. However, by precision I mean that there should be as many different float numbers as possible, and they should be able to reach as closely to 0 as possible. Because I later use this in Box-Muller transformation for generating normally-distributed numbers. – Ultraredsqrt(-2log(x))
for the Box-Muller transformation, you actually needx>0
, butx=1
would be ok. And there might be faster ways to calculate that directly from 32bits, without first creating afloat
. Same for calculatingsin
andcos
of the second variable. – Marilumarilynsqrt(-2log(1-x))
– Crumple