pythonpandas

How does Pandas.Series.nbytes work for strings? Results don't seem to match expectations


The help doc for pandas.Series.nbytes shows the following example:

s = pd.Series(['Ant', 'Bear', 'Cow'])
s  

0 Ant
1 Bear
2 Cow
dtype: object

s.nbytes

24
<< end example >>

How is that 24 bytes?
I tried looking at three different encodings, none of which seems to yield that total.

print(s.str.encode('utf-8').str.len().sum())
print(s.str.encode('utf-16').str.len().sum())
print(s.str.encode('ascii').str.len().sum())

10
26
10


Solution

  • Pandas nbytes does not refer to the bytes required to store the string data encoded in specific formats like UTF-8, UTF-16, or ASCII. It refers to the total number of bytes consumed by the underlying array of the Series data in memory.

    Pandas stores a NumPy array of pointers to these Python objects when using the object dtype.

    On a 64-bit system, each pointer/reference takes 8 bytes.

    3 × 8 bytes =24 bytes.

    Link: nbyte source code

    Link: ndarray documentation