How to determine the number of interned strings in Python 2.7.5?

Asked 14/10, 2016 at 9:3 Answered 14/11, 2016 at 19:30

In an earlier version of Python (I don't remember which), calling gc.get_referrers on an arbitrary interned string could be used to obtain a reference to the interned dict, which could then be queried for its length.

But this is no longer working in Python 2.7.5: gc.get_referrers(...) no longer includes the interned dict in the list it returns.

Is there any other way, in Python 2.7.5, to determine the number of interned strings? If so, how?

Elastin answered 14/10, 2016 at 9:3 Comment(6)

Why do you care? What are you trying to accomplish with such a low-level version-specific hack? Oh yeah, 2.7.12 is the current version, so why do you need this level of detail on a release that's oveer 3 years old? I don't mean to be hostile, but I can't fathom why this would ever matter. – Charmion 14/10, 2016 at 9:9

(a) I care, because I'm interested in understanding the memory usage of our Python processes, and this is one additional data point. (b) I'm interested in Python 2.7.5 because that's the version that we use in our product, though I suspect the answer would be the same in Python 2.7.12. – Elastin 14/10, 2016 at 9:29

Thanks for the answer. I've never taken the time to investigate the size of the interned dict, since the (non-literal) strings in my applications have always been of more consequence, so ensuring I only have one copy of each of those strings has been where I've spent my time. As as result, I'm still curious as to what your goal is - if you have the information you're asking for, how would you use it? – Charmion 14/10, 2016 at 9:45

It's true that the specific piece of data I've asked for here (the number of interned strings) probably isn't that helpful on its own, but it's somewhere to start. What would be more interesting are: the total size of the interned strings; the size of the interned dict itself; the number (and size) of interned strings that are referred to from nowhere else; the number (and size) of interned strings that are referred to from only one other place. Together, these help answer the question: are we wasting significant amounts of memory by interning strings unnecessarily. – Elastin 14/10, 2016 at 11:0

The docs ( docs.python.org/2/library/… ) say that interned strings are not immortal (since 2.3), so there should be no interned strings without at least one outside reference to keep them alive. – Charmion 14/10, 2016 at 22:54

Interning strings doesn't prolong their lifetime, so you're very unlikely to waste substantial amounts of space by overaggressive interning. – Lobell 10/11, 2016 at 21:58

You can sort of do this, but all options are messy and full of caveats to the point of near-uselessness, so first, let's consider whether you really want to.

Interning a string doesn't prolong its lifetime. You don't have to worry about the interned dict growing forever, full of strings you don't need. Thus, string interning is unlikely to be an actual memory problem, and learning how many strings have been interned might be pretty useless.

If you still want to do this, let's go through your options.

The Right Way would probably be to use your own interning implementation... except that Python's lackluster weak reference support doesn't let you create weak references to strings. That means that if you try this approach, you're stuck either passing around your own weak-referenceable string wrappers or keeping interned strings alive forever. Both options are terrible.

There is actually a function that prints the information you're asking about... but it also de-interns everything. Its existence is an implementation detail, and it's only accessible through the C API, so we'll need to use ctypes.pythonapi to get at it.

import ctypes

_Py_ReleaseInternedStrings = ctypes.pythonapi._Py_ReleaseInternedStrings

_Py_ReleaseInternedStrings.argtypes = ()
_Py_ReleaseInternedStrings.restype = None

_Py_ReleaseInternedStrings()

Output:

releasing 3461 interned strings
total size of all interned strings: 33685/0 mortal/immortal

The total sizes listed are sums of string lengths, so they don't include object headers or null terminators.

You're probably not happy about having to release all interned strings every time you want to check how many there were. Unfortunately, Python doesn't expose the interned dict, even through the C API or through GC hooks. What else could you try? Well, moving on to even crazier options, there's the debugger.

ecatmur posted a crazy hack launching a GDB process in unattended mode and using a conditional breakpoint to get at errnomap, a very similar dict to the interned dict you'd like to access. This could be adapted to access the interned dict instead. It would be highly non-portable and extremely difficult to maintain.

Launching a debugger is also a terrible option. What else could you try? Well, you could always build your own custom build of Python. Download the source from python.org, add

PyObject *
AwfulHackToGetTheInternedDict(void)
{
    if (interned == NULL) {
        // No interned dict yet.
        Py_RETURN_NONE;
    }
    Py_INCREF(interned);
    return interned;
}

to Objects/stringobject.c, build, and install. You'll probably want to use a virtualenv to keep this separate from your normal Python interpreter. With this awful hack in place, you can do

import ctypes

AwfulHackToGetTheInternedDict = ctypes.pythonapi.AwfulHackToGetTheInternedDict

AwfulHackToGetTheInternedDict.argtypes = ()
AwfulHackToGetTheInternedDict.restype = ctypes.py_object

interned = AwfulHackToGetTheInternedDict()

to get the dict of all interned strings.

So, those are your options, or at least, the options I've thought of. I also tried forcing the GC to track a string and then interning it to make the interned dict visible through the GC, but calling PyObject_GC_Track on a string caused a fatal error, so that doesn't work.

Lobell answered 14/11, 2016 at 19:30 Comment(1)

Thanks for the very comprehensive answer. – Elastin 4/12, 2016 at 10:21

For your purposes, I think the real answer is to use a more robust memory profiling solution.

There are several options for doing this, such as the free memory_profiler option on pypi.

Coolie answered 10/11, 2016 at 21:47 Comment(0)

Recommended topics

Hot tags