I'm running into some trouble with memory management related to bytes
in Python3.2. In some cases the ob_sval
buffer seems to contain memory that I cannot account for.
For a particular secure application I need to be able to ensure that memory is "zeroed" and returned to the OS as soon as possible after it is no longer being used. Since re-compiling Python isn't really an option, I'm writing a module that can be used with LD_PRELOAD to:
- Disable memory pooling by replacing
PyObject_Malloc
withPyMem_Malloc
,PyObject_Realloc
withPyMem_Realloc
, andPyObject_Free
withPyMem_Free
(e.g.: what you would get if you compiled withoutWITH_PYMALLOC
). I don't really care if the memory is pooled or not, but this seems to be the easiest approach. - Wraps
malloc
,realloc
, andfree
so as to track how much memory is requested and tomemset
everything to0
when it is released.
At a cursory glance, this approach seems to work great:
>>> from ctypes import string_at
>>> from sys import getsizeof
>>> from binascii import hexlify
>>> a = b"Hello, World!"; addr = id(a); size = getsizeof(a)
>>> print(string_at(addr, size))
b'\x01\x00\x00\x00\xd4j\xb2x\r\x00\x00\x00<J\xf6\x0eHello, World!\x00'
>>> del a
>>> print(string_at(addr, size))
b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x13\x00'
The errant \x13
at the end is odd but doesn't come from my original value so at first I assumed it was okay. I quickly found examples where things were not so good though:
>>> a = b'Superkaliphragilisticexpialidocious'; addr = id(a); size = getsizeof(a)
>>> print(string_at(addr, size))
b'\x01\x00\x00\x00\xd4j\xb2x#\x00\x00\x00\x9cb;\xc2Superkaliphragilisticexpialidocious\x00'
>>> del s
>>> print(string_at(addr, size))
b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00))\n\x00\x00ous\x00'
Here the last three bytes, ous
, survived.
So, my question:
What's going on with the leftover bytes for bytes
objects, and why don't they get deleted when del
is called on them?
I'm guessing that my approach is missing something similar to a realloc
, but I can't see what that would be in bytesobject.c
.
I've attempted to quantify the number of 'leftover' bytes that remain after garbage collection and it appears to be predictable to some extent.
from collections import defaultdict
from ctypes import string_at
import gc
import os
from sys import getsizeof
def get_random_bytes(length=16):
return os.urandom(length)
def test_different_bytes_lengths():
rc = defaultdict(list)
for ii in range(1, 101):
while True:
value = get_random_bytes(ii)
if b'\x00' not in value:
break
check = [b for b in value]
addr = id(value)
size = getsizeof(value)
del value
gc.collect()
garbage = string_at(addr, size)[16:-1]
for jj in range(ii, 0, -1):
if garbage.endswith(bytes(bytearray(check[-jj:]))):
# for bytes of length ii, tail of length jj found
rc[jj].append(ii)
break
return {k: len(v) for k, v in rc.items()}, dict(rc)
# The runs all look something like this (there is some variation):
# ({1: 2, 2: 2, 3: 81}, {1: [1, 13], 2: [2, 14], 3: [3, 4, 5, 6, 7, 8, 9, 10, 11, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 83, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100]})
# That is:
# - One byte left over twice (always when the original bytes object was of lengths 1 or 13, the first is likely because of the internal 'characters' list kept by Python)
# - Two bytes left over twice (always when the original bytes object was of lengths 2 or 14)
# - Three bytes left over in most other cases (the exact ones varies between runs but never has '12' in it)
# For added fun, if I replace the get_random_bytes call with one that returns an encoded string or random alphanumerics then results change slightly: lengths of 13 and 14 are now fully cleared too. My original test string was 13 bytes of encoded alphanumerics, of course!
Edit 1
I had originally expressed concern about the fact that if the bytes
object is used in a function it doesn't get cleaned up at all:
>>> def hello_forever():
... a = b"Hello, World!"; addr = id(a); size = getsizeof(a)
... print(string_at(addr, size))
... del a
... print(string_at(addr, size))
... gc.collect()
... print(string_at(addr, size))
... return addr, size
...
>>> addr, size = hello_forever()
b'\x02\x00\x00\x00\xd4J0x\r\x00\x00\x00<J\xf6\x0eHello, World!\x00'
b'\x01\x00\x00\x00\xd4J0x\r\x00\x00\x00<J\xf6\x0eHello, World!\x00'
b'\x01\x00\x00\x00\xd4J0x\r\x00\x00\x00<J\xf6\x0eHello, World!\x00'
>>> print(string_at(addr, size))
b'\x01\x00\x00\x00\xd4J0x\r\x00\x00\x00<J\xf6\x0eHello, World!\x00'
It turns out that this is an artificial concern that isn't covered by my requirements. You can see the comments to this question for details, but the problem comes from the way the hello_forever.__code__.co_consts
tuple will contain a reference to Hello, World!
even after a
is deleted from the locals
.
In the real code, the "secure" values would be coming from an external source and would never be hard-coded and later deleted like this.
Edit 2
I had also expressed confusion over the behaviour with strings
. It has been pointed out that they likely also suffer the same problem as bytes
with respect to hard-coding them in functions (e.g.: an artifact of my test code). There are two other risks with them that I have not been able to demonstrate as being a problem but will continue to investigate:
- String interning is done by Python at various points to speed up access. This shouldn't be a problem since the interned strings are supposed to be removed when the last reference is lost. If it proves to be a concern it should be possible to replace
PyUnicode_InternInPlace
so that it doesn't do anything. - Strings and other 'primitive' object types in Python often keep a 'free list' to make it faster to get memory for new objects. If this proves to be a problem, the
*_dealloc
methods in theObjects/*.c
can be replaced.
I had also believed that I was seeing a problem with class instances not getting zeroed correctly, but I now believe that was an error on my part.
Thanks
Much thanks to @Dunes and @Kevin for pointing out the issues that obfuscated my original question. Those issues have been left above in the "edit" sections above for reference.
hello_forever.__code__.co_consts
. – Cremona_Py_Dealloc
orPy_DECREF
macros to zero the memory after deallocation? As opposed to messing around with memory allocation. – Cremonabytes
and the fact that strings are often not being zero'd at all. @Kevin, are you referring to the same thing as @Dunes, or is there another kind of automatic interning happening? In real life, the actual strings would be coming from an external source (file or TCP), not hard-coded. – Eliason.intern()
method. Other Python implementations could intern any string at any time, unless you've specifically examined them and confirmed they don't. – FahlandPyUnicode_InternInPlace
to myLD_PRELOAD
library might be worth exploring. – Eliason