my image processing project works with grayscale images. I have ARM Cortex-A8 processor platform. I want to make use of the NEON.
I have a grayscale image( consider the example below) and in my alogorithm, I have to add only the columns.
How can I load four 8-bit pixel values in parallel, which are uint8_t, as four uint32_t into one of the 128-bit NEON registers? What intrinsic do I have to use to do this?
I mean:
I must load them as 32 bits because if you look carefully, the moment I do 255 + 255 is 512, which can't be held in a 8-bit register.
e.g.
255 255 255 255 ......... (640 pixels)
255 255 255 255
255 255 255 255
255 255 255 255
.
.
.
.
.
(480 pixels)