pandasdataframenumpyinteger-overflowdtype

Unreported integer-overflow in pandas math arithmetic's, pandas.eval() when using np.int16, np.int32


I need to process quite huge dataframe ~80M records, and essentially memory consumption is an issue. So columns with numerical data are shrank to possible minimum dtype, like np.int8, np.int16, np.int32. at some stage I need to compute new column, using some math from existing columns, and it needs capacity of int64. that where most pandas arithmetic constructions fail. I took me a while to track that the reason was integer overflow: in simple words, calculations like

   newCol = col16*col16, 
   newCol = col32*value16, 

produce often incorrect result, despite newCol is created as int64. here are some simple explicit example: calculate newCol = A * 100000, which obviously for any A=aaaaa should compute to value like aaaaa00000.

however, see below:

import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randint(1<<7,1<<15, size=(int(5))), columns=list('A'), dtype=np.int16)
df.eval('Q = A * 100000', inplace=True) # 1st naive approach from a head
df['W'] = df['A'] * 100000

# trying to use const c=int64() to force expr evaluator to use int64
c = np.int64(10000)
df.eval('R = @c * A', inplace=True)     

# trying to create new int64 col 1st and use it in calc:
df['T']=0    # this creates new col 'T' dtype=int64 filled with 0
df.eval('T = 100000 * A', inplace=True)

df['S']=0    
# trying to force int64  via 1st element 'S', which is int64
df['S'] = df['S'] + df['A'] * 100000 

# here finally this approach works, calculation is using int64 instructions:
df['X']=1   
df.eval('X = X * 100000 * A', inplace=True)

# just preformatting
pd.set_option('display.max_columns', None)
pd.options.display.width=222
df.index=[''] * len(df)

print(df)
df.info()

A           Q           W           R           T           S           X
   3396   339600000   339600000   339600000   339600000   339600000   339600000
  26508 -1644167296 -1644167296 -1644167296 -1644167296 -1644167296  2650800000
  27942 -1500767296 -1500767296 -1500767296 -1500767296 -1500767296  2794200000
   3441   344100000   344100000   344100000   344100000   344100000   344100000
  27880 -1506967296 -1506967296 -1506967296 -1506967296 -1506967296  2788000000
<class 'pandas.core.frame.DataFrame'>
Index: 5 entries,  to 
Data columns (total 7 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   A       5 non-null      int16
 1   Q       5 non-null      int32
 2   W       5 non-null      int32
 3   R       5 non-null      int32
 4   T       5 non-null      int64
 5   S       5 non-null      int64
 6   X       5 non-null      int64
dtypes: int16(1), int32(3), int64(3)
memory usage: 230.0+ bytes

with 6 different ways to do trivial math, only col 'X' produced what I (and I'd guess most users) expected.

obvious way to solve would be, at first to convert source column's dtype to int64 (like doing it 'on the fly') like

df['x'] = df['A'].astype(np.int64) * 100000

but I don't see its a good solution in my case, as data already is to big to create an even bigger tmp copy, and performance is also will degrade with redundant conversions..

So my question: how to calculate it on the fly, not creating a copy of whole source data columns in int64 (don't have RAM for that), e.g. directly properly calc newCol64 = srcCol8 * srcCol16 * srcCol16 in 64 bits cpu.

is there explicit type translation syntaxis available for pandas.eval()? which could be done on the fly while computing result row by row?


Solution

  • If you multiply an array by a scalar value, the selected dtype will be that of the array. However if you do a math operation between two arrays like X (X * A), the array will be "upcast":

    >>> df['A'] * c
    0   -31392
    1    14992
    2   -14064
    3     6944
    4   -13904
    Name: A, dtype: int16  # bad result, bad dtype
    
    >>> df['A'] * [c]
    0    196380000
    1    288570000
    2    317770000
    3    168500000
    4     76270000
    Name: A, dtype: int64  # everything is ok
    

    You can read the documentation about output type determination.

    In your case, you can do:

    c = np.array([10000])
    df.eval('R = @c * A', inplace=True)
    

    Output:

    >>> df
           A          R
    0  12399  123990000
    1   6026   60260000
    2  17133  171330000
    3  30974  309740000
    4  28216  282160000
    
    >>> df.info()
    <class 'pandas.core.frame.DataFrame'>
    RangeIndex: 5 entries, 0 to 4
    Data columns (total 2 columns):
     #   Column  Non-Null Count  Dtype
    ---  ------  --------------  -----
     0   A       5 non-null      int16
     1   R       5 non-null      int64
    dtypes: int16(1), int64(1)
    memory usage: 178.0 bytes
    

    EDIT

    For some reason on my machine df['A'] * [100000] is calculated as int32, despite python is 64bit.. (and so result is wrong). Using c=np.array([100000],dtype=np.int64) instead of [100000] is works though.

    Even if your system is 64bit, it seems the default int type is 32bit in your case (and I don't know why). To check the default dtype, you can use:

    # My system
    >>> np.int_
    numpy.int64
    
    # Your system
    >>> np.int_
    numpy.int32
    

    In your case, if you do df['A'] * [10000], [10000] will be converted as np.array([10000], dtype=np.int32) that's why the result will be wrong. You have to be explicit by creating the array with np.array([10000], dtype=np.int64).