I have a problem for which I have eight elements that can contain 0, 1, or 2. I can easily represent this in 16 bits, but for SIMD efficiency reasons, I need it to occupy 13 bits (it is not the only thing present in the lane).
Fortunately, 2^13==8192
, and 3^8==6561
, so the states I want can fit. However, here's where things get interesting. Naively, I would just represent these states by counting the ternary numeral states. For example, to represent the tritmask 0t12211012
(I'll use this as an example thoughout), I could just write 0t12211012 = 2*3^0+1*3^1+0*3^2+1*3^3+1*3^4+2*3^5+2*3^6+1*3^7 = 4244 = 0b1000010010100
.
I have a set of operations I need to support:
- Modify trits. This is easy in the default representation. For instance, if I have tritmask
0t12211012
and I wish to place a2
in the position holding a zero, I can simply add0t200=18
. (Note that the conversion to tritspace is easy, because I only have 8 trits, so I can store the base powers in a register and index it with pshufw). - Find all elements set to a particular value. For example, given the tritmask
0t12211012
, I want to be able to extract the bitmask for0
, which is0b00000100
, for1
, which is0b10011010
, and for2
, which is0b01100001
. This I have not figured out how to do, and is what I would like assistance with. How can I do this in a small number of operations suitable for x86 SIMD?
Thank you!
Edit 11/18/20: To give an example of an approach I consider too slow: we can iteratively find the value mod 3 and divide by 3 to pull trits off the least-significant end of the representation, then assemble the mask that way. C++ snippet:
uint32_t trits = <something>;
uint8_t mask0 = 0, mask1 = 0, mask2 = 0;
for (uint8_t shift = 0; shift < 8; ++shift) {
const uint32_t remainder = trits % 3;
mask0 |= (!remainder) << shift;
mask1 |= (remainder == 1) << shift;
mask2 |= (remainder == 2) << shift;
trits /= 3;
}
When actually writing this in a SIMD language, we would use the standard multiply-and-shift trick for division by a constant. But you can see it's linear in the number of trits, and has a lot of ops per iteration. We could code-golf this down a bit, but I think it is fundamentally the wrong approach. It should ideally be possible to do something in parallel for each trit... but I don't see it.
Edit 11/20/20: I've made a halfhearted effort to apply Aha to this problem without success. Maybe an interesting subproblem to solve instead is - is there a short sequence of bitwise ops under the same constraints as above that acts as a 'ternary bitwise AND'? That is, an op that compares two encoded numbers in tritspace and returns a bitmask that is 1 when the corresponding trits are equal and zero otherwise? That would be a primitive from which we could build up the ops needed. We have left and right shift in tritspace (just multiply or divide by 3); and we have +/- a value. So what we are missing is the ability to test if trits are particular values...
pmovzxbw
to load from that 8-bit array and line up metadata with trit elements) – Doanuint16_t three2four[6561]
pre-computed array? From 4 quadinary to 4 trits , maybe 2 stages ofunsigned uint8_t four2three[256]
? – Tafiauint16_t three2four[6561]
though – Trinitarianismk
by3^k
and calculate the remainder mod 3 of that (all using inverse-multiplication, of course) -- there are probably more elegant solutions, though. – Schreib