Packing 4 Integers as ONE BYTE?
Asked Answered
D

5

17

I have four integers {a, b, c, d} that can have the following range of values:

a - {0 or 1} (1 bit)

b - {0 or 1} (1 bit)

c - {0, 1, 2, ..., 7} (3 bits)

d - {0, 1, 2, ..., 7} (3 bits)

at first, I would like to pack them into a one byte that can be then written to a binary file. later, I would like to unpack that one byte and get from it to a tuple of the form (a, b, c, d).

I know how to read/write a byte to a binary file in python. But how do I do the bits packing/unpacking?

Derickderide answered 14/3, 2011 at 17:53 Comment(1)
If you need to do this a lot, check out construct.wikispaces.com and bitbucket.org/haypo/hachoir/wiki/Home which both pack and unpack according to a declarative format definition.Bentlee
B
32

Use shift and bitwise OR, then convert to a character to get a "byte":

x = chr(a | (b << 1) | (c << 2) | (d << 5))

To unpack this byte again, first convert to an integer, then shift and use bitwise AND:

i = ord(x)
a = i & 1
b = (i >> 1) & 1
c = (i >> 2) & 7
d = (i >> 5) & 7

Explanation: Initially, you have

0000000a
0000000b
00000ccc
00000ddd

The left-shifts give you

0000000a
000000b0
000ccc00
ddd00000

The bitwise OR results in

dddcccba

Converting to a character will convert this to a single byte.

Unpacking: The four different right-shifts result in

dddcccba
0dddcccb
00dddccc
00000ddd

Masking (bitwise AND) with 1 (0b00000001) or 7 (0b00000111) results in

0000000a
0000000b
00000ccc
00000ddd

again.

Best answered 14/3, 2011 at 17:56 Comment(8)
Note that strings are only strings of bytes in Python 2; in Python 3 they're Unicode, so chr(x) will actually give you a 2- or 4-byte word. You'll want to construct an actual bytes object instead of a string. I'm not sure off-hand what the bytes equivalent of chr is, if there is one.Longford
(I think the correct thing to do is generally as sjr does: simply return it as an int, and handle changing it to a single byte at the point you actually use it.)Longford
I also favour ANDing before you do the bit shift in constructing the byte to ensure that the the data is of the expected width. Other wise you can get all sorts of nasty and subtle errors.Mosley
@Glenn: regarding your first comment: Very good point. It would probably be best to use bytearray.append() for the conversion from integer to byte and and item access on a bytearray for the conversion back. This would work in Python 2.6 or newer.Best
@Peter: You may as well just assert your requirements, eg. assert 0 <= a <= 1, a; if that's what you expect to receive, then it's better to throw an error if you get invalid data instead of silently returning something bogus.Longford
@Glen Maynard .. point taken. I suppose its a question of philosophy. In general the stuff I work on has a greater penalty for stopping than producing erroneous data under a specific set of circumstances - so that has coloured my thinking towards not throwing an error. Although "correct data in" is still the better option!Mosley
@GlennMaynard: Using assert for data validation is a bad idea -- if/when somebody runs your program with optimization turned on, the asserts (and your checks) are removed.Conway
@Ethan: Checking for invalid parameters (programming errors) is exactly the sort of thing you want removed by that optimization. If you want a check that's intended to be thrown and caught in production, then you don't want AssertionError anyway; you probably want ValueError.Longford
M
10
def encode(a, b, c, d):
  return a | b << 1 | c << 2 | d << 5

def decode(x):
  return x & 1, (x >> 1) & 1, (x >> 2) & 7, (x >> 5) & 7
Matless answered 14/3, 2011 at 18:2 Comment(2)
Note that the parentheses around multiple return values are typically omitted in Python: return x & 1, (x >> 1) & 1, (x >> 2) & 7, (x >> 5) & 7).Longford
Note that you have a hanging closing paranthesis. ;-)Theocritus
C
5

If you need to this kind of thing a lot then bit shifting can become tedious and error prone. There are third-party libraries that can help - I wrote one called bitstring:

To pack and convert to a byte:

x = bitstring.pack('2*uint:1, 2*uint:3', a, b, c, d).bytes

and to unpack:

a, b, c, d = bitstring.BitArray(bytes=x).unpack('2*uint:1, 2*uint:3')

This is probably overkill for your example, but it's helpful when things get more complicated.

Complexity answered 14/3, 2011 at 19:1 Comment(0)
S
4

Pretty simple. Mask (for range), shift them into place, and or them together.

packed = ((a & 1) << 7) | ((b & 1) << 6) | ((c & 7) << 3) | (d & 7)

a = (packed >> 7) & 1
b = (packed >> 6) & 1
c = (packed >> 3) & 7
d = packed & 7
Suit answered 14/3, 2011 at 17:59 Comment(0)
B
0

nested arithmetic form (NAF) :

p = 8 * (8 * (2 * a + b) + c) + d                  # "a" at MSB
p =               a + (b + (c + d * 8) * 2) * 2    # "a" at LSB

nested shifting form (NSF) :

p = ( ( ( ( (a << 1) + b) << 3) + c) << 3) + d     # "a" at MSB
p = ( ( ( ( (d << 3) + c) << 1) + b) << 1) + a     # "a" at LSB

Personally I think NAF is much cleaner (even if not the fastest), because it keeps all the digit groups on the same side and all the scaling factors on the other regardless of endian-ness, all while minimizing the layers of nesting required, and total arithmetic ops.

"NAF" and "NSF" aren't formal terms - it's just a colloquial way of describing them


So packing it would be chr(p) or just sprintf("%c", p)

as for decoding, using the "a" at MSB approach, it's basically splitting out a byte's octal codes back out to its components ("a" at LSB is the same decoding process in little-endian) :

     2a+b c  d   a  b
        \  \/     \/
======================
62      \0 76     00
63      \0 77     00
64      \1 00     01
65      \1 01     01

126     \1 76     01
127     \1 77     01
128     \2 00     10
129     \2 01     10

190     \2 76     10
191     \2 77     10
192     \3 00     11
193     \3 01     11

254     \3 76     11
255     \3 77     11
0       \0 00     00
1       \0 01     00

2a+b is really useful when it comes to encoding UTF-8 :

\0 ## - all the ASCII digits and arithmetic operators, most of the ASCII linguistic punctuation symbols, and nearly all the ASCII [[:cntrl:]] bytes (sans \177 DEL)

\1 ## - all the ASCII letters, plus misc [[:punct:]]


\2 ## - trailing/"continuation" bytes for Unicode code points U+0080 - U+10FFFFF

\3 ## - all the leading byte for Unicode code points U+0080 - U+10FFFFF

— sans \300 xC0 - \301 xC1 and \365 xF5 - \377 xFF

Bigeye answered 17/11, 2023 at 13:18 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.