Testing which trits are set in a binary representation
Asked Answered
S

0

10

I have a problem for which I have eight elements that can contain 0, 1, or 2. I can easily represent this in 16 bits, but for SIMD efficiency reasons, I need it to occupy 13 bits (it is not the only thing present in the lane).

Fortunately, 2^13==8192, and 3^8==6561, so the states I want can fit. However, here's where things get interesting. Naively, I would just represent these states by counting the ternary numeral states. For example, to represent the tritmask 0t12211012 (I'll use this as an example thoughout), I could just write 0t12211012 = 2*3^0+1*3^1+0*3^2+1*3^3+1*3^4+2*3^5+2*3^6+1*3^7 = 4244 = 0b1000010010100.

I have a set of operations I need to support:

  • Modify trits. This is easy in the default representation. For instance, if I have tritmask 0t12211012 and I wish to place a 2 in the position holding a zero, I can simply add 0t200=18. (Note that the conversion to tritspace is easy, because I only have 8 trits, so I can store the base powers in a register and index it with pshufw).
  • Find all elements set to a particular value. For example, given the tritmask 0t12211012, I want to be able to extract the bitmask for 0, which is 0b00000100, for 1, which is 0b10011010, and for 2, which is 0b01100001. This I have not figured out how to do, and is what I would like assistance with. How can I do this in a small number of operations suitable for x86 SIMD?

Thank you!

Edit 11/18/20: To give an example of an approach I consider too slow: we can iteratively find the value mod 3 and divide by 3 to pull trits off the least-significant end of the representation, then assemble the mask that way. C++ snippet:

uint32_t trits = <something>;
uint8_t mask0 = 0, mask1 = 0, mask2 = 0;
for (uint8_t shift = 0; shift < 8; ++shift) {
  const uint32_t remainder = trits % 3;
  mask0 |= (!remainder) << shift;
  mask1 |= (remainder == 1) << shift;
  mask2 |= (remainder == 2) << shift;
  trits /= 3;
}

When actually writing this in a SIMD language, we would use the standard multiply-and-shift trick for division by a constant. But you can see it's linear in the number of trits, and has a lot of ops per iteration. We could code-golf this down a bit, but I think it is fundamentally the wrong approach. It should ideally be possible to do something in parallel for each trit... but I don't see it.

Edit 11/20/20: I've made a halfhearted effort to apply Aha to this problem without success. Maybe an interesting subproblem to solve instead is - is there a short sequence of bitwise ops under the same constraints as above that acts as a 'ternary bitwise AND'? That is, an op that compares two encoded numbers in tritspace and returns a bitmask that is 1 when the corresponding trits are equal and zero otherwise? That would be a primitive from which we could build up the ops needed. We have left and right shift in tritspace (just multiply or divide by 3); and we have +/- a value. So what we are missing is the ability to test if trits are particular values...

Symbolism answered 19/11, 2020 at 2:39 Comment(7)
If you have your number in binary, but need the base 3 digits, AFAIK there's no more efficient way than repeated division and mod (using a multiplicative inverse, of course). Same as when you need the decimal digits of a binary integer. You're probably better off using BCD, err BCT I guess you'd call it, with 2-bit fields for each trit, even if it means lower data density (like 2 vector to hold all your data, or keeping the 3 bits of metadata(?) in a separate uint8_t array, like array of structs style. You can pmovzxbw to load from that 8-bit array and line up metadata with trit elements)Doan
To go from 8 trits to 8 quadinary, perhaps a uint16_t three2four[6561] pre-computed array? From 4 quadinary to 4 trits , maybe 2 stages of unsigned uint8_t four2three[256]?Tafia
Sounds like you have to make a choice: the convenient representation on 16 bits, or the compressed representation on 13 bits. Now you are asking for the 13 bit representation but without the inconvenience!Aluminiferous
you can pack 5 trits into 8 bits (3^5 = 243 < 256). But you still need to use a division to extract the trits. You can use a smaller lookup table than uint16_t three2four[6561] thoughTrinitarianism
This paper may be interesting, though they focus on trits with value set {-1, 0, 1} rather than {0, 1, 2}.Attachment
If you want to SIMDify your algorithm, you could broadcast the input to 8 lanes, and divide element k by 3^k and calculate the remainder mod 3 of that (all using inverse-multiplication, of course) -- there are probably more elegant solutions, though.Schreib
PeterCordes: Unfortunately, this is a field in a 32-bit lane, where these three bits are pushing it to 35 bits. I agree that this can be stored separately. chux-ReinstateMonica: This works, but is not vectorizable. Stef: That is in fact what I am hoping to avoid. phuclv: Acknowledged. MarkDickinson: Indeed this paper is very interesting. They are focused on constructing an arbitrary mapping between bitspace and tritspace, so it loses all the properties that makes it usable in tritspace, though :( chtz: Yes; but this is homeomorphic to doing a slow op in parallel in SIMD.Symbolism

© 2022 - 2024 — McMap. All rights reserved.