A.I, Data and Software Engineering

Determine the sizes of python objects

D

When working with data structures, you may want to analyse the space complexity of a particular object. For example, you want to compare the size of a sparse matrix and a dense matrix which represent the same data.

With python, “Just use sys.getsizeof” is not a complete answer. It does work for builtin objects directly, but it does not account for what those objects may contain, specifically, what types, such as custom objects, tuples, lists, dicts, and sets contain. They can contain instances each other, as well as numbers, strings and other objects.

A More Complete Answer

Using 64 bit Python 3.X from the Anaconda distribution, with sys.getsizeof, I have determined the minimum size of the following objects, and note that sets and dicts preallocate space so empty ones don’t grow again until after a set amount (which may vary by the implementation of the language):

python object size
Python object map

Python 3.X:

Empty
Bytes  type        scaling notes
28     int         +4 bytes about every 30 powers of 2
37     bytes       +1 byte per additional byte
49     str         +1-4 per additional character (depending on max width)
48     tuple       +8 per additional item
64     list        +8 for each additional
224    set         5th increases to 736; 21nd, 2272; 85th, 8416; 341, 32992
240    dict        6th increases to 368; 22nd, 1184; 43rd, 2280; 86th, 4704; 171st, 9320
136    func def    does not include default args and other attrs
1056   class def   no slots
56     class inst  has a __dict__ attr, same scaling as dict above
888    class def   with slots
16     __slots__   seems to store in mutable tuple-like structure
                   first slot grows to 48, and so on.

How do you interpret this? Well say you have a set with 10 items in it. If each item is 100 bytes each, how big is the whole data structure? The set is 736 itself because it has sized up one time to 736 bytes. Then you add the size of the items, so that’s 1736 bytes in total.

Some caveats for function and class definitions:

Note each class definition has a proxy __dict__ (48 bytes) structure for class attrs. Each slot has a descriptor (like a property) in the class definition.

Slotted instances start out with 48 bytes on their first element, and increase by 8 each additional. Only empty slotted objects have 16 bytes, and an instance with no data makes very little sense.

Also, each function definition has code objects, maybe docstrings, and other possible attributes, even a __dict__.

Python 2.7 analysis, confirmed with guppy.hpy and sys.getsizeof:

Bytes  type        empty + scaling notes
24     int         NA
28     long        NA
37     str         + 1 byte per additional character
52     unicode     + 4 bytes per additional character
56     tuple       + 8 bytes per additional item
72     list        + 32 for first, 8 for each additional
232    set         sixth item increases to 744; 22nd, 2280; 86th, 8424
280    dict        sixth item increases to 1048; 22nd, 3352; 86th, 12568 *
120    func def    does not include default args and other attrs
64     class inst  has a __dict__ attr, same scaling as dict above
16     __slots__   class with slots has no dict, seems to store in
                   mutable tuple-like structure.
904    class def   has a proxy __dict__ structure for class attrs
104    old class   makes sense, less stuff, has real dict though.

Note that dictionaries (but not sets) got a more compact representation in Python 3.6

I think 8 bytes per additional item to reference makes a lot of sense on a 64-bit machine. Those 8 bytes point to the place in memory the contained item is at. The 4 bytes are fixed width for Unicode in Python 2, if I recall correctly, but in Python 3, str becomes a unicode of width equal to the max width of the characters (And for more on slots, see this reference )

A More Complete sizing Function

We want a function that searches the elements in lists, tuples, sets, dicts, obj.__dict__‘s, and obj.__slots__, as well as other things we may not have yet thought of.

We want to rely on gc.get_referents to do this search because it works at the C level (making it very fast). The downside is that get_referents can return redundant members, so we need to ensure we don’t double count.

Classes, modules, and functions are singletons – they exist one time in memory. We’re not so interested in their size, as there’s not much we can do about them – they’re a part of the program. So we’ll avoid counting them if they happen to be referenced.

We’re going to use a blacklist of types so we don’t include the entire program in our size count.

import sys
from types import ModuleType, FunctionType
from gc import get_referents
# Custom objects know their class.
# Function objects seem to know way too much, including modules.
# Exclude modules as well.
BLACKLIST = type, ModuleType, FunctionType
def getsize(obj):
    """sum size of object & members."""
    if isinstance(obj, BLACKLIST):
        raise TypeError('getsize() does not take argument of type: '+ str(type(obj)))
    seen_ids = set()
    size = 0
    objects = [obj]
    while objects:
        need_referents = []
        for obj in objects:
            if not isinstance(obj, BLACKLIST) and id(obj) not in seen_ids:
                seen_ids.add(id(obj))
                size += sys.getsizeof(obj)
                need_referents.append(obj)
        objects = get_referents(*need_referents)
    return size

To contrast this with the following whitelisted function, most objects know how to traverse themselves for the purposes of garbage collection (which is approximately what we’re looking for when we want to know how expensive in memory certain objects are. This functionality is used by gc.get_referents.) However, this measure is going to be much more expansive in scope than we intended if we are not careful.

For example, functions know quite a lot about the modules they are created in.

Another point of contrast is that strings that are keys in dictionaries are usually interned so they are not duplicated. Checking for id(key) will also allow us to avoid counting duplicates, which we do in the next section. The blacklist solution skips counting keys that are strings altogether.

Whitelisted Types, Recursive visitor (old implementation)

To cover most of these types myself, instead of relying on the gc module, I wrote this recursive function to try to estimate the size of most Python objects, including most builtins, types in the collections module, and custom types (slotted and otherwise).

This sort of function gives much more fine-grained control over the types we’re going to count for memory usage, but has the danger of leaving types out:

import sys
from numbers import Number
from collections import Set, Mapping, deque
try: # Python 2
    zero_depth_bases = (basestring, Number, xrange, bytearray)
    iteritems = 'iteritems'
except NameError: # Python 3
    zero_depth_bases = (str, bytes, Number, range, bytearray)
    iteritems = 'items'
def getsize(obj_0):
    """Recursively iterate to sum size of object & members."""
    _seen_ids = set()
    def inner(obj):
        obj_id = id(obj)
        if obj_id in _seen_ids:
            return 0
        _seen_ids.add(obj_id)
        size = sys.getsizeof(obj)
        if isinstance(obj, zero_depth_bases):
            pass # bypass remaining control flow and return
        elif isinstance(obj, (tuple, list, Set, deque)):
            size += sum(inner(i) for i in obj)
        elif isinstance(obj, Mapping) or hasattr(obj, iteritems):
            size += sum(inner(k) + inner(v) for k, v in getattr(obj, iteritems)())
        # Check for custom object instances - may subclass above too
        if hasattr(obj, '__dict__'):
            size += inner(vars(obj))
        if hasattr(obj, '__slots__'): # can have __slots__ with __dict__
            size += sum(inner(getattr(obj, s)) for s in obj.__slots__ if hasattr(obj, s))
        return size
    return inner(obj_0)

Test object sizes and results

>>> getsize(['a', tuple('bcd'), Foo()])
344
>>> getsize(Foo())
16
>>> getsize(tuple('bcd'))
194
>>> getsize(['a', tuple('bcd'), Foo(), {'foo': 'bar', 'baz': 'bar'}])
752
>>> getsize({'foo': 'bar', 'baz': 'bar'})
400
>>> getsize({})
280
>>> getsize({'foo':'bar'})
360
>>> getsize('foo')
40
>>> class Bar():
...     def baz():
...         pass
>>> getsize(Bar())
352
>>> getsize(Bar().__dict__)
280
>>> sys.getsizeof(Bar())
72
>>> getsize(Bar.__dict__)
872
>>> sys.getsizeof(Bar.__dict__)
280

Test with sparse and dense matrix of numpy.

# load coo_matrix from Scipy.sparse module
from scipy.sparse import coo_matrix
from scipy import sparse
# import numpy
import numpy as np
A_coo = coo_matrix(np.eye(4000, 5000))
A_dense = A_coo.todense()
print(getsize(A_dense.data))
#result 216
print(getsize(A_coo.data))
#result 32096
print(getsize(sparse.csr_matrix(A_dense).data))
#result 96

Conclusion

The provided function can calculate the size of most basic data types in python. Nevertheless, if you want a library, try the Pympler package’s asizeof module can do this.

Add comment

💬

A.I, Data and Software Engineering

PetaMinds focuses on developing the coolest topics in data science, A.I, and programming, and make them so digestible for everyone to learn and create amazing applications in a short time.

Categories