micro-optimization Questions
2
Solved
I'm studying x86-64 NASM and here is current situation:
These codes are for education only, not for running on client-facing system or so.
RCX holds loop count, between 1 and 1000.
At the beginnin...
Improbability asked 8/9 at 15:50
2
Solved
I'm running BenchmarkDotNet against the following code on .NET 8:
using System.Runtime.InteropServices;
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
[StructLayout(LayoutKind.Ex...
Maricruzmaridel asked 29/7 at 14:18
3
Solved
The context of this question is the creation of a side-channel resistant implementation of a IEEE-754 compliant single-precision square root for a 32-bit RISC-V platform without hardware support fo...
Esophagus asked 7/6 at 1:20
3
Solved
Is there any data on AVX2 gather latency?
(for instance a _mm256_i32gather_ps instruction accessing a single cache line)
Parrott asked 22/7, 2013 at 14:18
1
Solved
This question was inspired by a question I recently came across on Stackoverflow. It reminded me that early in the development of the x86-64 ISA I had written 32-bit x86 code for BCD addition witho...
Mueller asked 29/3 at 21:42
2
Solved
Timing results in Python 3.12 (and similar with 3.11 and 3.13 on different machines):
When x = None:
13.8 ns x is None
10.1 ns if x is None: pass
When x = True:
13.9 ns x is None
11.1 ns if x is N...
Rawdin asked 26/3 at 14:12
2
Solved
I need to bit scan reverse with LZCNT an array of words: 16 bits.
The throughput of LZCNT is 1 execution per clock on an Intel latest generation processors. The throughput on an AMD Ryzen seems to...
Milan asked 15/5, 2019 at 15:43
2
I have a constant (64-bit) address that I want to load into a register. This address is located in the code, segment, so it could be addressed relative to RIP. What's the differences between
movabs...
Disinterest asked 5/1 at 15:0
11
Solved
Is it possible to divide an unsigned integer by 10 by using pure bit shifts, addition, subtraction and maybe multiply? Using a processor with very limited resources and slow divide.
Thespian asked 5/4, 2011 at 21:4
8
Solved
Is there any (non-microoptimization) performance gain by coding
float f1 = 200f / 2
in comparision to
float f2 = 200f * 0.5
A professor of mine told me a few years ago that floating point div...
Desrochers asked 8/11, 2010 at 15:4
2
Solved
I am experimenting with the way parameters are passed to a function when compiling C++ code. I tried to compile the following C++ code using the x64 msvc 19.35/latest compiler to see the resulting ...
Floria asked 29/7, 2023 at 23:33
1
Solved
I want to replace the lowest byte in an integer. On x86 this is exactly mov al, [mem] but I can't seem to get compilers to output this. Am I missing an obvious code pattern that is recognized, am I...
Patten asked 22/6, 2023 at 10:38
4
Solved
For min(ctz(x), ctz(y)), we can use ctz(x | y) to gain better performance. But what about max(ctz(x), ctz(y))?
ctz represents "count trailing zeros".
C++ version (Compiler Explorer)
#incl...
Punt asked 1/6, 2023 at 11:5
4
Solved
I stumbled across this Reddit post which is a joke on the following code snippet,
void f(int& x) {
if (x != 1) {
x = 1;
}
}
void g(int& x) {
x = 1;
}
saying that the two functions are ...
Marrowbone asked 16/5, 2023 at 9:40
1
Take this simple example:
struct has_destruct_t {
int a;
~has_destruct_t() {}
};
struct no_destruct_t {
int a;
};
int bar_no_destruct(no_destruct_t);
int foo_no_destruct(void) {
no_destruct_...
Arroba asked 15/5, 2023 at 18:35
1
Solved
Changing add to adc in the highlighted line below significantly improves performance. I find it very counter-intuitive, since add has more ports to execute and it does not depend on flags.
CPU: Int...
Vickyvico asked 24/1, 2021 at 20:20
4
Solved
Suppose you need to create a 'top' of some sort and have code like this:
$matches=array();
foreach ($array as $v){
$matches[processing($v)]++;
}
This will output a Notice: Undefined index for ...
Chide asked 10/1, 2013 at 15:5
8
Solved
We find various tricks to replace std::sqrt (Timing Square Root) and some for std::exp (Using Faster Exponential Approximation) , but I find nothing to replace std::log.
It's part of loops in my pr...
Primo asked 2/10, 2016 at 20:27
1
In the context of exploratory activity I have started to take a look at integer & fixed-point arithmetic building blocks for 32-bit platforms. My primary target would be ARM32 (specifically arm...
Monophthong asked 7/12, 2022 at 8:22
2
I was looking at some recursive function from here:
int get_steps_to_zero(int n)
{
if (n == 0) {
// Base case: we have reached zero
return 0;
} else if (n % 2 == 0) {
// Recursive case 1...
Mclin asked 6/12, 2022 at 10:24
1
Solved
It's well known that we can use the CMOV instruction to write branchless code, but I was wondering if I'm writing the equivalent of x = cond ? 1 : 2, should I prefer
CMOVE rax, 1 #1a
CMOVNE rax, 2 ...
Thylacine asked 23/11, 2022 at 20:41
3
Solved
What's the most efficient way to set and also to clear the zero flag (ZF) in x86-64?
Methods that work without the need for a register with a known value, or without any free registers at all are ...
Monotone asked 3/2, 2019 at 2:4
6
Solved
For example,
Let's say a variable x,
x could be anything include 0.
Then we got code like:
if(x==0) {
y = 1;
}
else {
y = x;
}
Could I do this without producing branches in C/C++?
I'm trying to ...
Maggs asked 27/10, 2022 at 5:51
1
Solved
I have been trying to understand how to systematically make Spacy run as fast as possible for a long time and I would like this post to become a wiki-style public post if possible.
Here is what I c...
Tanta asked 24/10, 2022 at 13:23
4
The problem for signed 2's complement 32-bit integers:
satMul2 - multiplies by 2, saturating to Tmin or Tmax if overflow.
Examples: satMul2(0x30000000) = 0x60000000
satMul2(0x40000000) ...
Sandoval asked 8/10, 2022 at 12:22
1 Next >
© 2022 - 2024 — McMap. All rights reserved.