Data size in memory vs. on disk
Asked Answered
C

2

19

How does the RAM required to store data in memory compare to the disk space required to store the same data in a file? Or is there no generalized correlation?

For example, say I simply have a billion floating point values. Stored in binary form, that'd be 4 billion bytes or 3.7GB on disk (not including headers and such). Then say I read those values into a list in Python... how much RAM should I expect that to require?

Chrissa answered 10/4, 2014 at 21:52 Comment(5)
More RAM! There is list overhead, among other things. If you’re worried, a) find out, and b) consider just storing the raw data in memory and unpacking it on the fly (it depends on what you’re doing with it).Brittain
Related: https://mcmap.net/q/18890/-what-are-the-advantages-of-numpy-over-regular-python-listsNotwithstanding
My first thought is that would take a while for the user to wait until all that data was loaded into RAM.Generic
My first thought is why the hell wouldn't you use mmap?Plait
Both in RAM and disk, you use exactly as many bytes as you are asking to use (though this asking is possibly hidden deep inside libraries), modulo metadata for the {filesystem,memory manager} which is hard to compare or quantify and rarely significant.Epileptoid
E
14

Python Object Data Size

If the data is stored in some python object, there will be a little more data attached to the actual data in memory.

This may be easily tested.

The size of data in various forms

It is interesting to note how, at first, the overhead of the python object is significant for small data, but quickly becomes negligible.

Here is the iPython code used to generate the plot

%matplotlib inline
import random
import sys
import array
import matplotlib.pyplot as plt

max_doubles = 10000

raw_size = []
array_size = []
string_size = []
list_size = []
set_size = []
tuple_size = []
size_range = range(max_doubles)

# test double size
for n in size_range:
    double_array = array.array('d', [random.random() for _ in xrange(n)])
    double_string = double_array.tostring()
    double_list = double_array.tolist()
    double_set = set(double_list)
    double_tuple = tuple(double_list)

    raw_size.append(double_array.buffer_info()[1] * double_array.itemsize)
    array_size.append(sys.getsizeof(double_array))
    string_size.append(sys.getsizeof(double_string))
    list_size.append(sys.getsizeof(double_list))
    set_size.append(sys.getsizeof(double_set))
    tuple_size.append(sys.getsizeof(double_tuple))

# display
plt.figure(figsize=(10,8))
plt.title('The size of data in various forms', fontsize=20)
plt.xlabel('Data Size (double, 8 bytes)', fontsize=15)
plt.ylabel('Memory Size (bytes)', fontsize=15)
plt.loglog(
    size_range, raw_size, 
    size_range, array_size, 
    size_range, string_size,
    size_range, list_size,
    size_range, set_size,
    size_range, tuple_size
)
plt.legend(['Raw (Disk)', 'Array', 'String', 'List', 'Set', 'Tuple'], fontsize=15, loc='best')
Expellant answered 2/5, 2015 at 22:44 Comment(3)
This answer is not correct. The documentation for sys.getsizeof states that "Only the memory consumption directly attributed to the object is accounted for, not the memory consumption of objects it refers to." So you only accounted for the memory allocated in the containers and did not consider the additional memory allocated for the number objects themselves.Aleutian
Do you have a recommendation for how to determine the full memory allocation? I'll redo the plot!Expellant
I think that you need to add len(double_list) * sys.getsizeof(1.0) to the reported memory size for list, set and tuple. There is probably some additional memory needed to manage the allocations, but I don't know how to measure it and it should be negligible.Aleutian
C
4

In a plain Python list, every double-precision number requires at least 32 bytes of memory, but only 8 bytes are used to store the actual number, the rest is necessary to support the dynamic nature of Python.

The float object used in CPython is defined in floatobject.h:

typedef struct {
    PyObject_HEAD
    double ob_fval;
} PyFloatObject;

where PyObject_HEAD is a macro that expands to the PyObject struct:

typedef struct _object {
    Py_ssize_t ob_refcnt;
    struct _typeobject *ob_type;
} PyObject;

Therefore, every floating point object in Python stores two pointer-sized fields (so each takes 8 bytes on a 64-bit architecture) besides the 8-byte double, giving 24 bytes of heap-allocated memory per number. This is confirmed by sys.getsizeof(1.0) == 24.

This means that a list of n doubles in Python takes at least 8*n bytes of memory just to store the pointers (PyObject*) to the number objects, and each number object requires additional 24 bytes. To test it, try running the following lines in the Python REPL:

>>> import math
>>> list_of_doubles = [math.sin(x) for x in range(10*1000*1000)]

and see the memory usage of the Python interpreter (I got around 350 MB of allocated memory on my x86-64 computer). Note that if you tried:

>>> list_of_doubles = [1.0 for __ in range(10*1000*1000)]

you would obtain just about 80 MB, because all elements in the list refer to the same instance of the floating point number 1.0.

Cosmism answered 28/7, 2016 at 12:4 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.