neon Questions

4

How to determine whether NEON engine exists on given ARM processor? Any status/flag register can be queried for such purpose?
Rimrock asked 2/11, 2014 at 15:58

4

I am trying to understand, if there is a fast way to do a matrix transpose (64x64 bits) using ARM SIMD instructions. I tried to explore the VTRN instruction of ARM SIMD but am not sure of its effec...
Soggy asked 21/3, 2022 at 4:19

2

Solved

Intel's vector extensions SSE, AVX, etc. provide two unpack operations for each element size, e.g. SSE intrinsics are _mm_unpacklo_* and _mm_unpackhi_*. For 4 elements in a vector, it does this: i...
Womenfolk asked 28/7, 2017 at 14:36

1

Does vfmaq_f32 really have higher running accuracy? I guess the accuracy of vfmaq_f32 varies depending on the length of the bit extension of the floating point processing unit in different architec...
Galway asked 14/9, 2023 at 7:50

2

Solved

I would like to enable NEON vectorization on my ARM cortex-a9, but I get this output at compile: "not vectorized: relevant stmt not supported: D.14140_82 = D.14143_77 * D.14141_81" Here is my loo...
Cavein asked 5/3, 2013 at 13:50

3

Following my x86 question, I would like to know how it is possible to vectorized efficiently the following code on Arm-v8: static inline uint64_t Compress8x7bit(uint64_t x) { x = ((x & 0x7F00...
Shake asked 19/12, 2022 at 5:14

6

Solved

I decided to continue Fast corners optimisation and stucked at _mm_movemask_epi8 SSE instruction. How can i rewrite it for ARM Neon with uint8x16_t input?
Bluster asked 8/8, 2012 at 18:33

2

I would like to compare two little-endian 256-bit values with A64 Neon instructions (asm) efficiently. Equality (=) For equality, I already got a solution: bool eq256(const UInt256 *lhs, const...
Conspicuous asked 20/4, 2015 at 8:34

1

Solved

I am new to neon intrinsics. I have two arrays containing 99 elements which I am trying to add them element wise using neon intrinsic. As 99 is not a multiple of 8,16 or 32. 96 elements can be hand...
Bolin asked 11/3, 2022 at 11:8

3

Solved

Consider the following code, running on an ARM Cortex-A72 processor (optimization guide here). I have included what I expect are resource pressures for each execution port: Instruction B I0 I1 ...
Graphics asked 5/11, 2021 at 15:31

1

Solved

The simple test, unsigned f(unsigned long long x) { return __builtin_popcountll(x); } when compiled with clang --target=arm-none-linux-eabi -mfpu=neon -mfloat-abi=softfp -mcpu=cortex-a15 -Os,⁎ re...
Audiovisual asked 17/11, 2021 at 16:57

2

Solved

Consider the following code (Compiler Explorer link), compiled under gcc and clang with -O3 optimization: #include <arm_neon.h> void bug(int8_t *out, const int8_t *in) { for (int i = 0; i &...
Feinberg asked 7/10, 2021 at 22:30

2

I'm experimenting with a cross-platform SIMD library ala ecmascript_simd aka SIMD.js, and part of this is providing a few "horizontal" SIMD operations. In particular, the API that library offers in...
Konrad asked 3/7, 2015 at 1:40

1

I am porting some code I wrote to NEON using inline assembly. One of the things I need is to convert byte values ranging [0..128] to other byte values in a table which take the full range [0..255]...
Impart asked 3/3, 2014 at 21:48

2

Solved

This question was originally posed for SSE2 here. Since every single algorithm overlapped with ARMv7a+NEON's support for the same operations, the question was updated to include the ARMv7+NEON vers...
Yila asked 7/12, 2020 at 23:45

3

Solved

I want to convert a Neon 64-bit vector lane to get the n-th position(s) of non-zero (aka. 0xFF) 8-bit value(s), and then fill the rest of the vector with zeros. Here are some examples: 0 1 2 3 4 ...
Dispersant asked 15/9, 2016 at 8:34

5

I'm looking for the fastest way to test if a 128 NEON register contains all zeros, using NEON intrinsics. I'm currently using 3 OR operations, and 2 MOVs: uint32x4_t vr = vorrq_u32(vcmp0, vcmp1); ...
Donniedonnish asked 13/3, 2013 at 15:29

3

Solved

An older answer indicates that aarch64 supports unaligned reads/writes and has a mention about performance cost, but it's unclear if the answer covers only the ALU or SIMD (128-bit register) operat...
Dianoia asked 16/8, 2017 at 13:11

4

Solved

I recently discovered about the vreinterpret{q}_dsttype_srctype casting operator. However this doesn't seem to support conversion in the data type described at this link (bottom of the page): So...
Unaccomplished asked 20/4, 2017 at 13:38

4

Solved

https://web.archive.org/web/20170227190422/http://hilbert-space.de/?p=22 On this site which is quite dated it shows that hand written asm would give a much greater improvement then the intrinsics....
Betseybetsy asked 22/3, 2012 at 18:48

7

I'm looking to optimize C++ code (mainly some for loops) using the NEON capability of computing 4 or 8 array elements at a time. Is there some kind of library or set of functions that can be used i...
Bibliotherapy asked 16/2, 2015 at 18:9

3

The Raspberry Pi ( armv7l architecture ) has neon vfpv4 support which can be used for optimization. Does the standard version of numpy include these optimizations when installing the command pip3...
Subalternate asked 4/9, 2018 at 7:32

2

Solved

I am looking for information about the new Scalable Vector Unit (SVE) from Arm. It looks amazingly good to me for doing Image processing with beeing able to compute 2048 bit in parallel and so on. ...
Hobbes asked 21/12, 2016 at 13:4

2

Solved

I got compilation error: unrecognized command line option '-mfpu=neon'* when tried to compile with -mfpu=neon flag. Actually, any 'mfpu' options I tried failed. However in documentation this ...
Rockyrococo asked 24/4, 2015 at 15:11

3

Is there an intrinsic which allows one to add all of the elements in a lane? I am using Neon to multiply 8 numbers together, and I need to sum the result. Here is some paraphrased code to show what...
Seasoning asked 29/8, 2012 at 4:55

© 2022 - 2024 — McMap. All rights reserved.