micro-optimization

2

Solved

Is it "too clever" for using LEA to load constant to register?

I'm studying x86-64 NASM and here is current situation: These codes are for education only, not for running on client-facing system or so. RCX holds loop count, between 1 and 1000. At the beginnin...

assembly x86-64 nasm micro-optimization

Improbability asked 8/9 at 15:50

2

Solved

Why is my operator ++ more than twice as fast as its equivalent instance method?

I'm running BenchmarkDotNet against the following code on .NET 8: using System.Runtime.InteropServices; using BenchmarkDotNet.Attributes; using BenchmarkDotNet.Running; [StructLayout(LayoutKind.Ex...

c#.net-core micro-optimization benchmarkdotnet

Maricruzmaridel asked 29/7 at 14:18

3

Solved

Branchless count-leading-zeros on 32-bit RISC-V without Zbb extension

The context of this question is the creation of a side-channel resistant implementation of a IEEE-754 compliant single-precision square root for a 32-bit RISC-V platform without hardware support fo...

algorithm bit-manipulation riscv micro-optimization riscv32

Esophagus asked 7/6 at 1:20

3

Solved

Is there any data on the latency of an AVX2 gather instruction?

Is there any data on AVX2 gather latency? (for instance a _mm256_i32gather_ps instruction accessing a single cache line)

performance x86 latency micro-optimization avx2

Parrott asked 22/7, 2013 at 14:18

1

Solved

Fast BCD addition

This question was inspired by a question I recently came across on Stackoverflow. It reminded me that early in the development of the x86-64 ISA I had written 32-bit x86 code for BCD addition witho...

c algorithm bit-manipulation micro-optimization bcd

Mueller asked 29/3 at 21:42

2

Solved

Why is `if x is None: pass` faster than `x is None` alone?

Timing results in Python 3.12 (and similar with 3.11 and 3.13 on different machines): When x = None: 13.8 ns x is None 10.1 ns if x is None: pass When x = True: 13.9 ns x is None 11.1 ns if x is N...

python performance cpython micro-optimization python-internals

Rawdin asked 26/3 at 14:12

2

Solved

Is using AVX2 can implement a faster processing of LZCNT on a word array?

I need to bit scan reverse with LZCNT an array of words: 16 bits. The throughput of LZCNT is 1 execution per clock on an Intel latest generation processors. The throughput on an AMD Ryzen seems to...

x86 simd avx micro-optimization avx2

Milan asked 15/5, 2019 at 15:43

2

LEA vs MOV imm64 for loading address-constant into register

I have a constant (64-bit) address that I want to load into a register. This address is located in the code, segment, so it could be addressed relative to RIP. What's the differences between movabs...

assembly x86-64 micro-optimization

Disinterest asked 5/1 at 15:0

11

Solved

Divide by 10 using bit shifts?

Is it possible to divide an unsigned integer by 10 by using pure bit shifts, addition, subtraction and maybe multiply? Using a processor with very limited resources and slow divide.

math bit micro-optimization low-level integer-division

Thespian asked 5/4, 2011 at 21:4

8

Solved

Floating point division vs floating point multiplication

Is there any (non-microoptimization) performance gain by coding float f1 = 200f / 2 in comparision to float f2 = 200f * 0.5 A professor of mine told me a few years ago that floating point div...

c++floating-point micro-optimization

Desrochers asked 8/11, 2010 at 15:4

2

Solved

Advantage of using LEA over MOV for passing parameters in Assembly compiled from C++

I am experimenting with the way parameters are passed to a function when compiling C++ code. I tried to compile the following C++ code using the x64 msvc 19.35/latest compiler to see the resulting ...

c++assembly visual-c++x86-64 micro-optimization

Floria asked 29/7, 2023 at 23:33

1

Solved

Missing optimization: mov al, [mem] to bitfield-insert a new low byte into an integer

I want to replace the lowest byte in an integer. On x86 this is exactly mov al, [mem] but I can't seem to get compilers to output this. Am I missing an obvious code pattern that is recognized, am I...

c assembly x86-64 micro-optimization

Patten asked 22/6, 2023 at 10:38

4

Solved

Is there a faster algorithm for max(ctz(x), ctz(y))?

For min(ctz(x), ctz(y)), we can use ctz(x | y) to gain better performance. But what about max(ctz(x), ctz(y))? ctz represents "count trailing zeros". C++ version (Compiler Explorer) #incl...

c++algorithm rust bit-manipulation micro-optimization

Punt asked 1/6, 2023 at 11:5

4

Solved

Why do none of the major compilers optimize this conditional store that checks if the value is already set?

I stumbled across this Reddit post which is a joke on the following code snippet, void f(int& x) { if (x != 1) { x = 1; } } void g(int& x) { x = 1; } saying that the two functions are ...

c++compiler-optimization micro-optimization

Marrowbone asked 16/5, 2023 at 9:40

1

Why does the act of introducing a destructor result in worse codegen? (Passed by reference instead of by value in a register)

Take this simple example: struct has_destruct_t { int a; ~has_destruct_t() {} }; struct no_destruct_t { int a; }; int bar_no_destruct(no_destruct_t); int foo_no_destruct(void) { no_destruct_...

c++compiler-optimization calling-convention micro-optimization abi

Arroba asked 15/5, 2023 at 18:35

1

Solved

ADD slower than ADC in the first step of a bigint multiply on Coffee Lake (Skylake)

Changing add to adc in the highlighted line below significantly improves performance. I find it very counter-intuitive, since add has more ports to execute and it does not depend on flags. CPU: Int...

performance assembly x86 cpu-architecture micro-optimization

Vickyvico asked 24/1, 2021 at 20:20

4

Solved

How to properly increment some array key, even if key needs to be created?

Suppose you need to create a 'top' of some sort and have code like this: $matches=array(); foreach ($array as $v){ $matches[processing($v)]++; } This will output a Notice: Undefined index for ...

php optimization micro-optimization

Chide asked 10/1, 2013 at 15:5

8

Solved

Very fast approximate Logarithm (natural log) function in C++?

We find various tricks to replace std::sqrt (Timing Square Root) and some for std::exp (Using Faster Exponential Approximation) , but I find nothing to replace std::log. It's part of loops in my pr...

c++math logarithm micro-optimization sqrt

Primo asked 2/10, 2016 at 20:27

1

Unsigned 64x64->128 bit integer multiply on 32-bit platforms

In the context of exploratory activity I have started to take a look at integer & fixed-point arithmetic building blocks for 32-bit platforms. My primary target would be ARM32 (specifically arm...

c arm micro-optimization bigint riscv32

Monophthong asked 7/12, 2022 at 8:22

2

Why would gcc -O3 generate multiple ret instructions? [duplicate]

I was looking at some recursive function from here: int get_steps_to_zero(int n) { if (n == 0) { // Base case: we have reached zero return 0; } else if (n % 2 == 0) { // Recursive case 1...

c assembly gcc x86-64 micro-optimization

Mclin asked 6/12, 2022 at 10:24

1

Solved

In assembly, should branchless code use complementary CMOVs?

It's well known that we can use the CMOV instruction to write branchless code, but I was wondering if I'm writing the equivalent of x = cond ? 1 : 2, should I prefer CMOVE rax, 1 #1a CMOVNE rax, 2 ...

assembly x86 micro-optimization branchless conditional-move

Thylacine asked 23/11, 2022 at 20:41

3

Solved

Setting and clearing the zero flag in x86

What's the most efficient way to set and also to clear the zero flag (ZF) in x86-64? Methods that work without the need for a register with a known value, or without any free registers at all are ...

performance assembly x86 x86-64 micro-optimization

Monotone asked 3/2, 2019 at 2:4

6

Solved

How can I guarantee that a variable will never be zero without using a conditional statement in C?

For example, Let's say a variable x, x could be anything include 0. Then we got code like: if(x==0) { y = 1; } else { y = x; } Could I do this without producing branches in C/C++? I'm trying to ...

c compiler-optimization micro-optimization

Maggs asked 27/10, 2022 at 5:51

1

Solved

A checklist for Spacy optimization?

I have been trying to understand how to systematically make Spacy run as fast as possible for a long time and I would like this post to become a wiki-style public post if possible. Here is what I c...

optimization nlp spacy micro-optimization

Tanta asked 24/10, 2022 at 13:23

4

Multiply by 2 with signed saturation in 6 operations in C?

The problem for signed 2's complement 32-bit integers: satMul2 - multiplies by 2, saturating to Tmin or Tmax if overflow. Examples: satMul2(0x30000000) = 0x60000000 satMul2(0x40000000) ...

c bit-manipulation bitwise-operators micro-optimization saturation-arithmetic

Sandoval asked 8/10, 2022 at 12:22

micro-optimization Questions

Recommended topics

Hot tags