neon - McMap

4

How to check the existence of NEON on arm?

How to determine whether NEON engine exists on given ARM processor? Any status/flag register can be queried for such purpose?

arm neon

Rimrock asked 2/11, 2014 at 15:58

4

fast bit-matrix (64x64) transpose algorithm using SIMD (ARM)

I am trying to understand, if there is a fast way to do a matrix transpose (64x64 bits) using ARM SIMD instructions. I tried to explore the VTRN instruction of ARM SIMD but am not sure of its effec...

assembly arm transpose simd neon

Soggy asked 21/3, 2022 at 4:19

2

Solved

sse/avx equivalent for neon vuzp

Intel's vector extensions SSE, AVX, etc. provide two unpack operations for each element size, e.g. SSE intrinsics are _mm_unpacklo_* and _mm_unpackhi_*. For 4 elements in a vector, it does this: i...

sse simd neon avx

Womenfolk asked 28/7, 2017 at 14:36

1

Does vfmaq_f32 really have higher running accuracy?

Does vfmaq_f32 really have higher running accuracy? I guess the accuracy of vfmaq_f32 varies depending on the length of the bit extension of the floating point processing unit in different architec...

c++arm neon

Galway asked 14/9, 2023 at 7:50

2

Solved

ARM NEON vectorization failure

I would like to enable NEON vectorization on my ARM cortex-a9, but I get this output at compile: "not vectorized: relevant stmt not supported: D.14140_82 = D.14143_77 * D.14141_81" Here is my loo...

compiler-construction arm vectorization neon

Cavein asked 5/3, 2013 at 13:50

3

bitpack ascii string into 7-bit binary blob using ARM-v8 Neon SIMD

Following my x86 question, I would like to know how it is possible to vectorized efficiently the following code on Arm-v8: static inline uint64_t Compress8x7bit(uint64_t x) { x = ((x & 0x7F00...

simd arm64 intrinsics neon

Shake asked 19/12, 2022 at 5:14

6

Solved

SSE _mm_movemask_epi8 equivalent method for ARM NEON

I decided to continue Fast corners optimisation and stucked at _mm_movemask_epi8 SSE instruction. How can i rewrite it for ARM Neon with uint8x16_t input?

arm sse neon

Bluster asked 8/8, 2012 at 18:33

2

A64 Neon SIMD - 256-bit comparison

I would like to compare two little-endian 256-bit values with A64 Neon instructions (asm) efficiently. Equality (=) For equality, I already got a solution: bool eq256(const UInt256 *lhs, const...

arm comparison simd neon arm64

Conspicuous asked 20/4, 2015 at 8:34

1

Solved

Handling elements that are odd number using neon intrinsics

I am new to neon intrinsics. I have two arrays containing 99 elements which I am trying to add them element wise using neon intrinsic. As 99 is not a multiple of 8,16 or 32. 96 elements can be hand...

c raspberry-pi simd neon armv8

Bolin asked 11/3, 2022 at 11:8

3

Solved

Loop takes more cycles to execute than expected in an ARM Cortex-A72 CPU

Consider the following code, running on an ARM Cortex-A72 processor (optimization guide here). I have included what I expect are resource pressures for each execution port: Instruction B I0 I1 ...

performance assembly optimization arm neon

Graphics asked 5/11, 2021 at 15:31

1

Solved

Why doesn’t Clang use vcnt for __builtin_popcountll on AArch32?

The simple test, unsigned f(unsigned long long x) { return __builtin_popcountll(x); } when compiled with clang --target=arm-none-linux-eabi -mfpu=neon -mfloat-abi=softfp -mcpu=cortex-a15 -Os,⁎ re...

arm clang bit-manipulation micro-optimization neon

Audiovisual asked 17/11, 2021 at 16:57

2

Solved

Why does gcc, with -O3, unnecessarily clear a local ARM NEON array?

Consider the following code (Compiler Explorer link), compiled under gcc and clang with -O3 optimization: #include <arm_neon.h> void bug(int8_t *out, const int8_t *in) { for (int i = 0; i &...

c gcc arm64 neon compiler-bug

Feinberg asked 7/10, 2021 at 22:30

2

Optimizing horizontal boolean reduction in ARM NEON

I'm experimenting with a cross-platform SIMD library ala ecmascript_simd aka SIMD.js, and part of this is providing a few "horizontal" SIMD operations. In particular, the API that library offers in...

arm simd neon

Konrad asked 3/7, 2015 at 1:40

1

ARM NEON: How to implement a 256bytes Look Up table

I am porting some code I wrote to NEON using inline assembly. One of the things I need is to convert byte values ranging [0..128] to other byte values in a table which take the full range [0..255]...

optimization assembly arm neon

Impart asked 3/3, 2014 at 21:48

2

Solved

What is the most efficient way to support CMGT with 64bit signed comparisons on ARMv7a with Neon?

This question was originally posed for SSE2 here. Since every single algorithm overlapped with ARMv7a+NEON's support for the same operations, the question was updated to include the ARMv7+NEON vers...

assembly arm simd webassembly neon

Yila asked 7/12, 2020 at 23:45

3

Solved

ARM Neon: Store n-th position(s) of non-zero byte(s) in a 8-byte vector lane

I want to convert a Neon 64-bit vector lane to get the n-th position(s) of non-zero (aka. 0xFF) 8-bit value(s), and then fill the rest of the vector with zeros. Here are some examples: 0 1 2 3 4 ...

assembly arm neon

Dispersant asked 15/9, 2016 at 8:34

5

Fastest way to test a 128 bit NEON register for a value of 0 using intrinsics?

I'm looking for the fastest way to test if a 128 NEON register contains all zeros, using NEON intrinsics. I'm currently using 3 OR operations, and 2 MOVs: uint32x4_t vr = vorrq_u32(vcmp0, vcmp1); ...

neon

Donniedonnish asked 13/3, 2013 at 15:29

3

Solved

Performance of unaligned SIMD load/store on aarch64

An older answer indicates that aarch64 supports unaligned reads/writes and has a mention about performance cost, but it's unclear if the answer covers only the ALU or SIMD (128-bit register) operat...

alignment simd neon arm64

Dianoia asked 16/8, 2017 at 13:11

4

Solved

ARM Neon: How to convert from uint8x16_t to uint8x8x2_t?

I recently discovered about the vreinterpret{q}_dsttype_srctype casting operator. However this doesn't seem to support conversion in the data type described at this link (bottom of the page): So...

c++c arm vectorization neon

Unaccomplished asked 20/4, 2017 at 13:38

4

Solved

Arm Neon Intrinsics vs hand assembly

https://web.archive.org/web/20170227190422/http://hilbert-space.de/?p=22 On this site which is quite dated it shows that hand written asm would give a much greater improvement then the intrinsics....

arm neon intrinsics

Betseybetsy asked 22/3, 2012 at 18:48

7

Coding for ARM NEON: How to start?

I'm looking to optimize C++ code (mainly some for loops) using the NEON capability of computing 4 or 8 array elements at a time. Is there some kind of library or set of functions that can be used i...

c++arm neon

Bibliotherapy asked 16/2, 2015 at 18:9

3

Is numpy optimized for raspberry-pi automatically

The Raspberry Pi ( armv7l architecture ) has neon vfpv4 support which can be used for optimization. Does the standard version of numpy include these optimizations when installing the command pip3...

numpy optimization raspberry-pi arm neon

Subalternate asked 4/9, 2018 at 7:32

2

Solved

How portable are the new ARM SVE instructions?

I am looking for information about the new Scalable Vector Unit (SVE) from Arm. It looks amazingly good to me for doing Image processing with beeing able to compute 2048 bit in parallel and so on. ...

arm neon arm64 sve

Hobbes asked 21/12, 2016 at 13:4

2

Solved

gcc; arm64; aarch64; unrecognized command line option '-mfpu=neon'

I got compilation error: unrecognized command line option '-mfpu=neon'* when tried to compile with -mfpu=neon flag. Actually, any 'mfpu' options I tried failed. However in documentation this ...

gcc arm neon arm64 linaro

Rockyrococo asked 24/4, 2015 at 15:11

3

Add all elements in a lane

Is there an intrinsic which allows one to add all of the elements in a lane? I am using Neon to multiply 8 numbers together, and I need to sum the result. Here is some paraphrased code to show what...

c arm simd neon

Seasoning asked 29/8, 2012 at 4:55

neon Questions

Recommended topics

Hot tags