glPixelStorei(GL_UNPACK_ALIGNMENT, 1) Disadvantages?
Asked Answered
C

3

37

What are the disadvantages of always using alginment of 1?

glPixelStorei(GL_UNPACK_ALIGNMENT, 1)
glPixelStorei(GL_PACK_ALIGNMENT, 1)

Will it impact performance on modern gpus?

Cene answered 14/6, 2012 at 22:7 Comment(8)
You mean, besides the fact that some of your data may not have 1-byte aligned rows?Syneresis
For non-POTS textures it may affect upload/download speed. For POTS textures it should have no effect.Jacobine
@DietrichEpp: What is POTS-textures?Cene
@NicolBolas: How can data not be 1-byte aligned?Cene
@ronag: POTS = "power of two sized", e.g., 512x64. NPOTS = "non power of two sized", e.g., 640x480. Basically, changing the alignment on a texture does not affect anything for textures that are a multiple of 16 pixels wide, since the data will still be aligned. If the texture has an odd width (like 501 pixels) then it may take slightly longer to upload it to the GPU. But you will probably not see any difference, since most programs aren't limited by texture upload performance.Jacobine
@DietrichEpp: My application is limited by upload/download performance. I guess I will have to benchmark, I just want to get a better understanding of the performance characteristics. I don' quite understand why it would so much slower, similarly to SSE optimized memcpy I think it should be able to copy most of it at the correct alignment and then have a special case for the last bytes.Cene
@ronag: If your application is actually limited by upload/download performance, you may wish to refer to developer.download.nvidia.com/assets/gamedev/docs/… There are similar papers from AMD (and probably Intel), but the same techniques and texture formats will probably work across most modern graphics cards. Long story short: if you choose the right format and alignment, then you get a direct copy from application memory to GPU memory, which isn't significantly faster but it is measurably faster.Jacobine
This isn't like SSE optimized memcpy, since there are literally millions of different cases to handle. There are a number of different fast and slow paths through the packer and unpacker, which might be implemented in hardware or software, might be bypassed completely (simple DMA), and it all depends on the formats and alignments you choose. Choosing a different format might move you from a hardware path to a software path.Jacobine
S
51

How can data not be 1-byte aligned?

This strongly suggests a lack of understanding of what the row alignment in pixel transfer operations means.

Image data that you pass to OpenGL is expected to be grouped into rows. Each row contains width number of pixels, with each pixel being the size as defined by the format and type parameters. So a format of GL_RGB with a type of GL_UNSIGNED_BYTE will result in a pixel that is 24-bits in size. Pixels are otherwise expected to be packed, so a row of 16 of these pixels will take up 48 bytes.

Each row is expected to be aligned on a specific value, as defined by the GL_PACK/UNPACK_ALIGNMENT. This means that the value you add to the pointer to get to the next row is: align(pixel_size * width, GL_*_ALIGNMENT). If the pixel size is 3-bytes, the width is 2, and the alignment is 1, the row byte size is 6. If the alignment is 4, the row byte size is eight.

See the problem?

Image data, which may come from some image file format as loaded with some image loader, has a row alignment. Sometimes this is 1-byte aligned, and sometimes it isn't. DDS images have an alignment specified as part of the format. In many cases, images have 4-byte row alignments; pixel sizes less than 32-bits will therefore have padding at the end of rows with certain widths. If the alignment you give OpenGL doesn't match that, then you get a malformed texture.

You set the alignment to match the image format's alignment. If you know or otherwise can ensure that your row alignment is always 1 (and that's unlikely unless you've written your own image format or DDS writer), you need to set the row alignment to be exactly what your image format uses.

Syneresis answered 15/6, 2012 at 5:52 Comment(13)
Yea that makes sense, I wasn't thinking in terms of rows. So if I can control the "linesize/row bytes" (which I can since I have to copy the data to a PBO anyways), what alignment would be most optimal? Is there any performance improvement with using 8 byte alignment?Cene
If i have GL_RGB data, should I convert it myself to GL_RGBA while copying it over to the PBO?Cene
@ronag: You are unlikely to beat the performance of the unpacker, and that's a notoriously complicated function to write using SIMD. It sounds like a lot of work, which would have to be justified with benchmarks showing that your application is limited by upload speed and that upload speed is adversely affected by the unpacker.Jacobine
@DietrichEpp: Well if I am alrdy copying to the PBO, converting it to GL_RGBA and fixing alignment while copying, would probably be in the order of 1% overhead, which would faster than the packer/unpacker, assuming that the gpu driver does these operations on the cpu anyways.Cene
However, what I'm not sure of is whether or not the driver will convert it to GL_BGRA before performing the DMA upload, or whether this conversion occurs on the gpu level.Cene
@ronag: There are so many different variables here, and you're considering the most complicated option, which is basically to reimplement the OpenGL unpacker in your application in the hopes that it may be faster. The fact that you guess that RGB -> BGRA conversion has "probably on the order of 1% overhead" means that you probably aren't ready for this -- that guess doesn't pass the smell test for a couple of reasons (chat if you want them). Perhaps you would like to switch to chat, or open another question about improving your application performance, assuming you can post benchmark data.Jacobine
@ronag: 1) You are not going to beat OpenGL's unpacker. 2) Even if you were going to beat it, PBO unpacking is asynchronous, which would require explicit thread usage from your applicaiton. 3) The proper way to optimize this is to benchmark various different formats until you find the one(s) that your hardware natively supports (ie: the ones that are fastest), and then preprocess your data such that they conform to those. The format/type values are far more important to this than the row alignment.Syneresis
@Dietrich Epp: "RGB -> BGRA conversion has "probably on the order of 1% overhead". Not to argue, but since I alrdy have the data in the registers while copying to the PBO, I think all that is needed is 2 extra PSHUFB instructions per 16 byte block to do the conversion, which is rather small overhead compared to reading/writing to/from system memory. Though I might be wrong.Cene
@Nicolas: I'm not talking about beating the unpacker, just that I since I alrdy have to read the data into cpu cache while copying I might as well do the "preprocessing" while copying to the PBO, and avoid having the driver read/write system memory again, instead of just starting the asynchronous DMA transfer.Cene
@ronag: Come back with some benchmarks and make a new question, otherwise this is all idle speculation (we could be discussing the number of angels that fit on the head of a pin). I can explain why 1% is unlikely if you put it in a question, but a long series of comments is not the right place.Jacobine
Nicol's answer is spot on but I still managed to be confused for a while on how the data would not be 1byte aligned. For those stuck on this point assume that you are getting the data from an image loading library, you didn’t control the data and thus you didn't control the layout. Given that the data may be imported with alignment bytes at the ends of each row. This alignment is what you are specifying. Hope this helps someone!Relevant
Dear @Nicol Bolas hear in the link that you have suggested in your answer, we read: For example, if the format​ is GL_RGB​, and the type​ is GL_UNSIGNED_BYTE​, and the width​ is 9, then each row is 27 bytes long. If the alignment is 8, this means that the second row begins 32 bytes after the first. If the alignment is 1, then it begins 27 bytes after the first row. Can you explain me why the next row will start in the 32th byte in case of value 8 for the alignment?Distorted
@sepideh: Because that's what alignment means. Each line must start on an even multiple of the alignment. 32 is the next even multiple of 8 that is higher than 27.Syneresis
K
11

Will it impact performance on modern gpus?

No, because the pixel store settings are only relevent for the transfer of data from or to the GPU, namely the alignment of your data. Once on the GPU memory it's aligned in whatever way the GPU and driver desire.

Killian answered 14/6, 2012 at 23:50 Comment(3)
I think he's asking if one or more GPUs would require CPUs to do some modification to the data before uploading. That is, if the GPU can't handle byte-aligned rows.Syneresis
I'm asking whether and when it might affect gpu uploads/downloads and if there are any other aspects I need to think about.Cene
@ronag: Well, if the GPU can not deal with the data format of your original data, then the data format needs to be converted first by the driver. However, since you'll be uploading texture data occasionally in overall it will consume only very little CPU time. However data transfers are not what I'd consider a GPU performance problem. The bottleneck clearly are the CPU and the data bus there.Killian
A
2

There will be no impact on performance. Setting higher alignment (in openGL) doesn't improve anything, or speeds anything up.

All alignment does is to tell openGL where to expect the next row of pixels. You should always use an alignment of 1, if your image pixels are tightly packed, i.e. if there are no gaps between where a row of bytes ends and where a new row starts.

The default alignment is 4 (i.e. openGL expects the next row of pixels to be after a jump in memory which is divisible by 4), which may cause problems in cases where you load R, RG or RGB textures which are not 4-bytes floats, or the width is not divisible by 4. If your image pixels are tightly packed you have to change alignment to 1 in order for the unpacking to work.

You could (I personally haven’t encountered them) have an image of, say, 3x3 RGB ubyte, whose rows are 4th-aligned with 3 extra bytes used as padding in the end. Which rows might look like this:

R - G - B - R - G - B - R - G - B - X - X - X (16 bytes in total)

The reason for it is that aligned data improves the performance of the processor (not sure how much it's true/justified on todays processors). IF you have any control over how the original image is composed, then maybe aligning it one way or another will improve the handling of it. But this is done PRIOR to openGL. OpenGL has no way of changing anything about this, it only cares about where to find the pixels.

So, back to the 3x3 image row above - setting the alignment to 4 would be good (and necessary) to jump over the last padding. If you set it to 1 then, it will mess your result, so you need to keep/restore it to 4. (Note that you could also use ROW_LENGTH to jump over it, as this is the parameter used when dealing with subsets of the image, in which case you sometime have to jump much more then 3 or 7 bytes (which is the max the alignment parameter of 8 can give you). In our example if you supply a row length of 4 and an alignment of 1 will also work).

Same goes for packing. You can tell openGL to align the pixels row to 1, 2, 4 and 8. If you're saving a 3x3 RGB ubyte, you should set the alignment to 1. Technically, if you want the resulting rows to be tightly packed, you should always give 1. If you want (for whatever reason) to create some padding you can give another value. Giving (in our example) a PACK_ALIGNMENT of 4, would result in creating rows that look like the row above (with the 3 extra padding in the end). Note that in that case your containing object (openCV mat, bitmap, etc.) should be able to receive that extra padding.

Apostolate answered 28/3, 2019 at 20:39 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.