micro-optimization Questions

2

Solved

I'm studying x86-64 NASM and here is current situation: These codes are for education only, not for running on client-facing system or so. RCX holds loop count, between 1 and 1000. At the beginnin...
Improbability asked 8/9 at 15:50

2

Solved

I'm running BenchmarkDotNet against the following code on .NET 8: using System.Runtime.InteropServices; using BenchmarkDotNet.Attributes; using BenchmarkDotNet.Running; [StructLayout(LayoutKind.Ex...
Maricruzmaridel asked 29/7 at 14:18

3

Solved

The context of this question is the creation of a side-channel resistant implementation of a IEEE-754 compliant single-precision square root for a 32-bit RISC-V platform without hardware support fo...

3

Solved

Is there any data on AVX2 gather latency? (for instance a _mm256_i32gather_ps instruction accessing a single cache line)
Parrott asked 22/7, 2013 at 14:18

1

Solved

This question was inspired by a question I recently came across on Stackoverflow. It reminded me that early in the development of the x86-64 ISA I had written 32-bit x86 code for BCD addition witho...

2

Solved

Timing results in Python 3.12 (and similar with 3.11 and 3.13 on different machines): When x = None: 13.8 ns x is None 10.1 ns if x is None: pass When x = True: 13.9 ns x is None 11.1 ns if x is N...

2

Solved

I need to bit scan reverse with LZCNT an array of words: 16 bits. The throughput of LZCNT is 1 execution per clock on an Intel latest generation processors. The throughput on an AMD Ryzen seems to...
Milan asked 15/5, 2019 at 15:43

2

I have a constant (64-bit) address that I want to load into a register. This address is located in the code, segment, so it could be addressed relative to RIP. What's the differences between movabs...
Disinterest asked 5/1 at 15:0

11

Solved

Is it possible to divide an unsigned integer by 10 by using pure bit shifts, addition, subtraction and maybe multiply? Using a processor with very limited resources and slow divide.
Thespian asked 5/4, 2011 at 21:4

8

Solved

Is there any (non-microoptimization) performance gain by coding float f1 = 200f / 2 in comparision to float f2 = 200f * 0.5 A professor of mine told me a few years ago that floating point div...
Desrochers asked 8/11, 2010 at 15:4

2

Solved

I am experimenting with the way parameters are passed to a function when compiling C++ code. I tried to compile the following C++ code using the x64 msvc 19.35/latest compiler to see the resulting ...
Floria asked 29/7, 2023 at 23:33

1

Solved

I want to replace the lowest byte in an integer. On x86 this is exactly mov al, [mem] but I can't seem to get compilers to output this. Am I missing an obvious code pattern that is recognized, am I...
Patten asked 22/6, 2023 at 10:38

4

Solved

For min(ctz(x), ctz(y)), we can use ctz(x | y) to gain better performance. But what about max(ctz(x), ctz(y))? ctz represents "count trailing zeros". C++ version (Compiler Explorer) #incl...

4

Solved

I stumbled across this Reddit post which is a joke on the following code snippet, void f(int& x) { if (x != 1) { x = 1; } } void g(int& x) { x = 1; } saying that the two functions are ...
Marrowbone asked 16/5, 2023 at 9:40

1

Take this simple example: struct has_destruct_t { int a; ~has_destruct_t() {} }; struct no_destruct_t { int a; }; int bar_no_destruct(no_destruct_t); int foo_no_destruct(void) { no_destruct_...

1

Solved

Changing add to adc in the highlighted line below significantly improves performance. I find it very counter-intuitive, since add has more ports to execute and it does not depend on flags. CPU: Int...
Vickyvico asked 24/1, 2021 at 20:20

4

Solved

Suppose you need to create a 'top' of some sort and have code like this: $matches=array(); foreach ($array as $v){ $matches[processing($v)]++; } This will output a Notice: Undefined index for ...
Chide asked 10/1, 2013 at 15:5

8

Solved

We find various tricks to replace std::sqrt (Timing Square Root) and some for std::exp (Using Faster Exponential Approximation) , but I find nothing to replace std::log. It's part of loops in my pr...
Primo asked 2/10, 2016 at 20:27

1

In the context of exploratory activity I have started to take a look at integer & fixed-point arithmetic building blocks for 32-bit platforms. My primary target would be ARM32 (specifically arm...
Monophthong asked 7/12, 2022 at 8:22

2

I was looking at some recursive function from here: int get_steps_to_zero(int n) { if (n == 0) { // Base case: we have reached zero return 0; } else if (n % 2 == 0) { // Recursive case 1...
Mclin asked 6/12, 2022 at 10:24

1

Solved

It's well known that we can use the CMOV instruction to write branchless code, but I was wondering if I'm writing the equivalent of x = cond ? 1 : 2, should I prefer CMOVE rax, 1 #1a CMOVNE rax, 2 ...
Thylacine asked 23/11, 2022 at 20:41

3

Solved

What's the most efficient way to set and also to clear the zero flag (ZF) in x86-64? Methods that work without the need for a register with a known value, or without any free registers at all are ...
Monotone asked 3/2, 2019 at 2:4

6

Solved

For example, Let's say a variable x, x could be anything include 0. Then we got code like: if(x==0) { y = 1; } else { y = x; } Could I do this without producing branches in C/C++? I'm trying to ...
Maggs asked 27/10, 2022 at 5:51

1

Solved

I have been trying to understand how to systematically make Spacy run as fast as possible for a long time and I would like this post to become a wiki-style public post if possible. Here is what I c...
Tanta asked 24/10, 2022 at 13:23

4

The problem for signed 2's complement 32-bit integers: satMul2 - multiplies by 2, saturating to Tmin or Tmax if overflow. Examples: satMul2(0x30000000) = 0x60000000            satMul2(0x40000000) ...

© 2022 - 2024 — McMap. All rights reserved.