Short Integers in Python
Asked Answered
S

7

32

Python allocates integers automatically based on the underlying system architecture. Unfortunately I have a huge dataset which needs to be fully loaded into memory.

So, is there a way to force Python to use only 2 bytes for some integers (equivalent of C++ 'short')?

Sidesman answered 23/9, 2008 at 10:35 Comment(2)
If you're doing any sort of manipulation of this huge dataset, you'll probably want to use Numpy, which has support for a wide variety of numeric types, and efficient operations on arrays of them.Cipango
just a heads up: C++'s short not 2 bytes width. It's implementation dependent.Eldwun
T
45

Nope. But you can use short integers in arrays:

from array import array
a = array("h") # h = signed short, H = unsigned short

As long as the value stays in that array it will be a short integer.

Team answered 23/9, 2008 at 10:36 Comment(3)
A better and more complete answer than my own. :)Rabid
So, is an array('h') with only one element, the same as creating a short integer?Sidesman
@Arnav: nope. that would be a PyObject + a short integer.Team
S
5

Thanks to Armin for pointing out the 'array' module. I also found the 'struct' module that packs c-style structs in a string:

From the documentation (https://docs.python.org/library/struct.html):

>>> from struct import *
>>> pack('hhl', 1, 2, 3)
'\x00\x01\x00\x02\x00\x00\x00\x03'
>>> unpack('hhl', '\x00\x01\x00\x02\x00\x00\x00\x03')
(1, 2, 3)
>>> calcsize('hhl')
8
Sidesman answered 23/9, 2008 at 11:34 Comment(0)
B
5

You can use NumyPy's int as np.int8 or np.int16.

Bolte answered 12/2, 2020 at 14:43 Comment(1)
My name is numpy -with an umpy. All other data packages allow me to bump thee.Armallas
L
3

Armin's suggestion of the array module is probably best. Two possible alternatives:

  • You can create an extension module yourself that provides the data structure that you're after. If it's really just something like a collection of shorts, then that's pretty simple to do.
  • You can cheat and manipulate bits, so that you're storing one number in the lower half of the Python int, and another one in the upper half. You'd write some utility functions to convert to/from these within your data structure. Ugly, but it can be made to work.

It's also worth realising that a Python integer object is not 4 bytes - there is additional overhead. So if you have a really large number of shorts, then you can save more than two bytes per number by using a C short in some way (e.g. the array module).

I had to keep a large set of integers in memory a while ago, and a dictionary with integer keys and values was too large (I had 1GB available for the data structure IIRC). I switched to using a IIBTree (from ZODB) and managed to fit it. (The ints in a IIBTree are real C ints, not Python integers, and I hacked up an automatic switch to a IOBTree when the number was larger than 32 bits).

Ludmilla answered 23/9, 2008 at 11:35 Comment(2)
Can I use IIBTree without installing all of Zope? Where do I get it? What's an IOBTree?Monogenetic
Just install ZODB (pypi.python.org/pypi/ZODB3/3.8.0). An IOBTree is a BTree that has integer keys (the I) and object values (the O).Ludmilla
A
1

You can also store multiple any size of integers in a single large integer.

For example as seen below, in python3 on 64bit x86 system, 1024 bits are taking 164 bytes of memory storage. That means on average one byte can store around 6.24 bits. And if you go with even larger integers you can get even higher bits storage density. For example around 7.50 bits per byte with 2**20 bits wide integer.

Obviously you will need some wrapper logic to access individual short numbers stored in the larger integer, which is easy to implement.

One issue with this approach is your data access will slow down due use of the large integer operations.

If you are accessing a big batch of consecutively stored integers at once to minimize the access to large integers, then the slower access to long integers won't be an issue.

I guess use of numpy will be easier approach.

>>> a = 2**1024
>>> sys.getsizeof(a)
164
>>> 1024/164
6.2439024390243905

>>> a = 2**(2**20)
>>> sys.getsizeof(a)
139836
>>> 2**20 / 139836
7.49861266054521
Annalee answered 5/1, 2020 at 21:25 Comment(0)
A
1

Using bytearray in python which is basically a C unsigned char array under the hood will be a better solution than using large integers. There is no overhead for manipulating a byte array and, it has much less storage overhead compared to large integers. It's possible to get storage density of 7.99+ bits per byte with bytearrays.

>>> import sys
>>> a = bytearray(2**32)
>>> sys.getsizeof(a)
4294967353
>>> 8 * 2**32 / 4294967353
7.999999893829228
Annalee answered 6/1, 2020 at 20:14 Comment(0)
S
0

You can make an int into a bunch of smaller ints, then access specific bits from them:

n = 4532  # '0b1000110110100'
mask = 0b000011110000  # We want to access the middle 4 bits
mid = ((n & mask) << 4)  # Leave only specified data and move places back

For putting data in, first use the same mask to blank the part of the int you need to use, then bitshift the new data into position then 'or' them together

n = ((n & mask) | (yourvalue << 4))

The downside is you have to keep track of where that data is in memory yourself, we are effectively managing memory ourselves.

Sortie answered 10/7 at 21:10 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.