What are assembly instructions like PEXT actually used for?

Asked 14/11, 2021 at 19:20 Answered 6/11, 2023 at 0:21

I watched a youtube video on the Top 10 Craziest Assembly Language Instructions and some of these instructions have no obvious application to me. What's the point of something like PEXT, which takes only the bits from the second argument which match indices of 1s in the first argument? How would the compiler know when to use this instruction? Same/similar questions about carry-less multiplication.

Disclaimer: I know little to nothing about assembly language. Maybe I should read up on it!

I hope this question is stackoverflow-appropriate.

Indophenol answered 14/11, 2021 at 19:20 Comment(8)

When you searched the internet for uses of pext and carry-less multiplication, what did you find? – Gentlemanfarmer 14/11, 2021 at 19:26

There are a number of different applications. A popular one is to compute bishop attacks in chess programming. I used it as a part of a perfect hash function for combinatorial search problems. – Brooch 14/11, 2021 at 19:30

Mostly I found descriptions of what they are and similar implementations of bitwise manipulation. – Indophenol 14/11, 2021 at 19:45

A couple uses I've run across include How to unset N right-most set bits, and multiple parts of AVX2 what is the most efficient way to pack left based on a mask? where I used pext as a bit version of left-packing, and pdep to expand a bitmask to bytes. Google site:stackoverflow.com pext pdep for more. IDK if there are any cases where a compiler would use it for you; normally you'd use it via intrinsics, i.e. _pdep_u64 – Burglary 14/11, 2021 at 20:15

Carryless multiplication is equivalent to multiplication of polynomials with coefficients in the integers mod 2, see en.wikipedia.org/wiki/Carry-less_product. Also known as Galois field multiplication. It shows up in cryptography algorithms like AES and in error-correcting codes. So there are definite uses for this. – Prakash 14/11, 2021 at 23:9

This is gonna sound kinda stupid, but forgive me. When would one have to manipulate bits so finely like that? Definitely a "lack of experience" question. It seems very useful for the compiler, I mean more for the programmer. Sorry if I'm misunderstanding something here! – Indophenol 15/11, 2021 at 6:0

Unfortunately(?) it's the other way around: compilers are relatively clueless when it comes to these instructions, but humans can use them in various inventive ways. BTW a neat use of PEXT is in calculating the indexes of set bits – Cantrell 15/11, 2021 at 19:14

My impression of pdep/pext is that they are sort of a "because we can" instruction. It has various special cases that could be useful (packing/unpacking values scattered across bitfields, interleaving bits for Z-ordering, etc) and so perhaps they just went ahead and implemented a generic version to allow for programmers to be creative. It's one of those things, like popcount or lzcount, that can be done many times more efficiently in hardware with a dedicated instruction than in software using existing instructions. – Prakash 15/11, 2021 at 20:53

You can find some applications listed in the paper regarding the hardware unit for PDEP/PEXT

There are many emerging applications, such as cryptography, imaging and biometrics, where more advanced bit manipulation operations are needed. While these can be built from the simpler logical and shift operations, the applications using these advanced bit manipulation operations are significantly sped up if the processor can support more powerful bit manipulation instructions. Such operations include arbitrary bit permutations, performing multiple bit-field extract operations in parallel, and performing multiple bit-field deposit operations in parallel. We call these permutation (perm), parallel extract (pex) or bit gather, and parallel deposit (pdep) or bit scatter operations, respectively.

Performing Advanced Bit Manipulations Efficiently in General-Purpose Processors

Bit permutation is extremely common in bitboards, for example reverse bytes/words or mirror bit arrays. There are lots of algorithms in it that require extensive bit manipulation and people had to get creative to do that before the era of PEXT/PDEP. Later many card game engines also use that technique to deal with a single game set in just one or a few registers

PDEP/PEXT is also used to greatly improve bit interleaving performance, which is common in algorithms like Morton code. Some examples on this:

The multiplication technique invented for bitboards is also commonly used in many algorithms in Bit Twiddling Hacks, for example interleave bits with 64-bit multiply. This technique is no longer needed when PDEP/PEXT is available

You can find more detailed information in Bit permutations and Hacker's Delight

Another usage for PDEP/PEXT is to extract/combine fields where the bits are not in contiguous positions, for example disassemble RISC-V instructions where immediates scatter around to make hardware design simpler but also make it a bit messier to work with on software without PDEP/PEXT

Some other applications:

I think the pext / pdep instructions have HUGE implications to 4-coloring problem, 3-SAT, Constraint Solvers, etc. etc. More researchers probably should look into those two instructions.

Just look at Binary Decision Diagrams, and other such combinatorial data structures, and you can definitely see the potential uses of PEXT / PDEP all over the place.

https://news.ycombinator.com/item?id=19137260

How would the compiler know when to use this instruction?

Compilers can recognize common patterns and optimize the instruction sequence, but for advanced things like this then programmers usually need to explicitly call intrinsics from high level code

Zizith answered 21/3, 2022 at 17:6 Comment(2)

You write that "Later many card game engines also use that technique to deal with a single game set in just one or a few registers" Do you have any reference or info where I can read more about this? – Agent 28/11, 2022 at 11:0

@PetterT I don't remember where I read that, it was long ago. However some people definitely use it, especially in games that only care about the value and not the suits More efficient way to store Playing Cards in bits? – Zizith 29/11, 2022 at 13:42

PDEP (Parallel Deposit) and PEXT (Parallel Extract) are meant to be a convenient way to extract and deposit bit fields. I'd bet there are good low level use cases for them.

For actual uses - I wrote a Sudoku solver that used PEXT in couple functions to extract bit values. Thanks to PEXT I was able to extract 4 elements in a single instruction (vs 1 for normal approach). It was really convenient. If you'd really want I could put up a code snippet on Compiler Explorer to show the difference.

Lyndonlyndsay answered 16/3, 2022 at 20:19 Comment(0)

Here's a real world use case from our recent publication Transcoding Unicode Characters with AVX-512. A total of 4 pext instructions are used to convert from UTF-8 to UTF-16.

64 bytes of UTF-8 text are loaded into a vector. We classify* the bytes into ASCII bytes (00 to 7f), follow bytes (80 to bf), 2-byte lead bytes (c0 to df), 3-byte lead-bytes (e0 to ef) and 4-byte lead bytes (the others). ASCII bytes, 2-byte lead bytes, 3-byte lead bytes, and 4-byte lead bytes are lead bytes.

The first pext instruction is used in the fast-path for when there are no 3-byte or 4-byte lead bytes. In such a case, we extract the lead byte mask with itself using pext to get a mask of the first n bits set where n is the number of bits in the lead byte mask. We need this mask to then store n bytes of output into memory. Using pext is faster than alternatives like popcnt followed by bzhi.
In the fast path for “no 4-byte lead bytes”, after transcoding has finished, we have a mask that tells us where the last bytes of each UTF-8 sequence were. We take the mask of 3-byte lead bytes, shift it to the left by 2 (to instead indicate the final bytes of 3-byte sequences) and extract it with pext through the last byte mask, giving us a mask that for each UTF-8 sequence (corresponding to one UTF-16 word in the output) tells us if that word corresponds to a 3-byte sequence. We then use this mask to check if there were any invalid 3-byte sequences in the input (i.e. surrogates or overlong sequences).
In the main code path, we use a similar approach to find out which of the generated UTF-16 words are supposed to be surrogates so we can check for correct sequencing and apply some postprocessing.
Lastly, the same approach from (2) is also used in the main code path, once again to validate 3-byte sequences.

All of this would have been significantly harder without pdep and pext and it's really important to have them.

* invalid bytes c0, c1, and f8 to ff are sorted out in a later step.

Brooch answered 6/11, 2023 at 0:21 Comment(0)

The following isn't directly related to the usag of PDEP / PEXT since it's about the performance - but it affects if its usage makes sense. I've got a Zen2 Ryzen Threadripper 3990X CPU under Windows 11 and I tested the througput of PDEP and PEXT with the intrinsics of MSVC++ and Intel C++ under Windows and clang++ and g++ under Linux. Here's the code:

#include <iostream>
#include <vector>
#include <chrono>
#include <random>
#include <cstdint>
#include <atomic>
#if defined(_MSC_VER)
    #include <intrin.h>
#elif defined(__GNUC__) || defined(__llvm__)
    #include <immintrin.h>
#endif

using namespace std;
using namespace chrono;

atomic_uint64_t aSum( 0 );

int main()
{
    constexpr size_t
        N = 0x1000,
        ROUNDS = 10'000;
    vector<uint64_t> data( N, 0 );
    mt19937_64 mt;
    uniform_int_distribution<uint64_t> uid( 0, -1 );
    for( uint64_t &d : data )
        d = uid( mt );
    auto pdep = []( uint64_t data, uint64_t mask ) -> uint64_t { return _pdep_u64( data, mask ); };
    auto pext = []( uint64_t data, uint64_t mask ) -> uint64_t { return _pext_u64( data, mask ); };
    auto bench = [&]<typename Permute>( Permute permute ) -> double
    {
        uint64_t sum = 0;
        auto start = high_resolution_clock::now();
        constexpr uint64_t MASK = 0x5555555555555555u;
        for( size_t r = ROUNDS; r--; )
            for( uint64_t d : data )
                sum += permute( d, MASK );
        double ns = (double)(int64_t)duration_cast<nanoseconds>( high_resolution_clock::now() - start ).count() / ((double)N * ROUNDS);
        ::aSum = sum;
        return ns;
    };
    cout << bench( pdep ) << endl;
    cout << bench( pext ) << endl;
}

According to the data on agner.org PDEP / PEXT should have a latency and througput of slightly below 20 clock cycles on my Zen2 CPU. On Intel since Haswell CPUs the latency is only 3 clock cycles and the throughput is a whopping one clock cycle.
But according to my measurements each instruction takes about 35ns, i.e. about 150 clock cycles on my CPU. There's no measurement error and the disassembly I checked matches what you'd write in assembly. So I'm curious about the data of other CPUs. Maybe you'll report it here. It would be helpful to assess if the usage of PDEP or PEXT makes sense.

Althorn answered 14/5, 2022 at 14:49 Comment(10)

I've seen it said that PEXT/PDEP microcode on AMD before Zen3 has data-dependent performance, so some inputs are faster than others. Maybe Agner only checked with one input value which happened to be simple, like all-zero? – Burglary 14/5, 2022 at 15:5

Anyway, this is clearly not an answer to this SO question, but rather a question you should post separately, with a title like "How do PEXT/PDEP perform on Zen 2 and earlier?" The question body can be mostly this, your experiment and results. Details on exactly how slow it is before Zen3 added HW support would be nice to have, it'd be a good question to have on SO. (We can edit this question and/or one of the answers to mention it being slow on many existing AMD CPUs, since that's good info for future readers to beware of. But it doesn't belong as a full answer here.) – Burglary 14/5, 2022 at 15:6

@PeterCordes: I know that my question only partitially fits here, but when I google for "PDEP PEXT Stack Overflow" this is the only article about PDEP and PEXT her. So it is not the worst idea to post it here. And you're right, Peter, the results are varying accoring to the mask. If I have a mask of 0, i..e. extracting or compressing no bits, the results are according to Agner's data. Should I inform him ? Better not, I've aleady mailed him for another reason and got an dismissive reply. – Althorn 14/5, 2022 at 16:18

This isn't an "article", it's a question, about something else. Finding a very different question involving the same function or instruction never makes it ok to post a new question as an answer on Stack Overflow. (Also, my google search for site:stackoverflow.com pext pdep found 137 hits. When I typoed PEXT as PEX, I only got 2, including this and a typoed comment on the AVX2 left-packing Q&A). I guess I can see why you thought it would make sense to post a performance note on a question about using it, so it's not as totally obvious as usual, but you wrote this like a question. – Burglary 14/5, 2022 at 16:39

Re: emailing Agner Fog or posting on his blog forum: if you have specific facts to share, like that PEXT's data-dependent performance on Zen2 has 20 cycles as the best case, with worst being way worse, I think he'd find that useful. IDK what you emailed him about before, but this is fully relevant. – Burglary 14/5, 2022 at 16:44

OTOH, Agner Fog's tables are no longer the best-maintained source for such info; uops.info automated the testing process enough to avoid human error in updating a spreadsheet, and to explore more corner cases. Andreas Abel maintains it, and is occasionally active on Stack Overflow. The uops.info tables do try to cover some fast vs. slow cases to give a throughput and latency range for div IIRC, so he might be interested in adding that for PEXT/PDEP. (He might be aware and just not have gotten around to it, but maybe not or maybe didn't realize how variable it can be.) – Burglary 14/5, 2022 at 16:47

The uops.info tables don't actually show the throughput or uops range for variable instructions like idiv on conroe for example, only for latency. But slow vs. fast division is in the throughput test results: uops.info/html-tp/CON/IDIV_R64-Measurements.html – Burglary 14/5, 2022 at 16:51

IDIV / DIV performance is obvious since iterative subtract and shift is the only working solition. But the performance of PDEP and PEXT is not necessarily so low. – Althorn 14/5, 2022 at 17:41

My point was that uops.info does already have some limited support in its UI for displaying variable-latency. And in its database for at least having the slow vs. fast throughput details available for applicable instructions even if that part isn't shown in the table. So the infrastructure is there to make the web site do something with PDEP/PEXT, once you tell the relevant human about it. (And BTW, no, iterative subtract and shift isn't the only working algorithm. My understanding is that modern dividers use a table for an initial approximation and then use pipelined Newton-Raphson.) – Burglary 14/5, 2022 at 18:36

I've downvoted this answer as it does not attempt to answer the question. – Brooch 6/11, 2023 at 3:47

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags