pandas dataframe numpy integer-overflow dtype

Unreported integer-overflow in pandas math arithmetic's, pandas.eval() when using np.int16, np.int32

I need to process quite huge dataframe ~80M records, and essentially memory consumption is an issue. So columns with numerical data are shrank to possible minimum dtype, like np.int8, np.int16, np.int32. at some stage I need to compute new column, using some math from existing columns, and it needs capacity of int64. that where most pandas arithmetic constructions fail. I took me a while to track that the reason was integer overflow: in simple words, calculations like

   newCol = col16*col16, 
   newCol = col32*value16,

produce often incorrect result, despite newCol is created as int64. here are some simple explicit example: calculate newCol = A * 100000, which obviously for any A=aaaaa should compute to value like aaaaa00000.

however, see below:

import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randint(1<<7,1<<15, size=(int(5))), columns=list('A'), dtype=np.int16)
df.eval('Q = A * 100000', inplace=True) # 1st naive approach from a head
df['W'] = df['A'] * 100000

# trying to use const c=int64() to force expr evaluator to use int64
c = np.int64(10000)
df.eval('R = @c * A', inplace=True)     

# trying to create new int64 col 1st and use it in calc:
df['T']=0    # this creates new col 'T' dtype=int64 filled with 0
df.eval('T = 100000 * A', inplace=True)

df['S']=0    
# trying to force int64  via 1st element 'S', which is int64
df['S'] = df['S'] + df['A'] * 100000 

# here finally this approach works, calculation is using int64 instructions:
df['X']=1   
df.eval('X = X * 100000 * A', inplace=True)

# just preformatting
pd.set_option('display.max_columns', None)
pd.options.display.width=222
df.index=[''] * len(df)

print(df)
df.info()

A           Q           W           R           T           S           X
   3396   339600000   339600000   339600000   339600000   339600000   339600000
  26508 -1644167296 -1644167296 -1644167296 -1644167296 -1644167296  2650800000
  27942 -1500767296 -1500767296 -1500767296 -1500767296 -1500767296  2794200000
   3441   344100000   344100000   344100000   344100000   344100000   344100000
  27880 -1506967296 -1506967296 -1506967296 -1506967296 -1506967296  2788000000
<class 'pandas.core.frame.DataFrame'>
Index: 5 entries,  to 
Data columns (total 7 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   A       5 non-null      int16
 1   Q       5 non-null      int32
 2   W       5 non-null      int32
 3   R       5 non-null      int32
 4   T       5 non-null      int64
 5   S       5 non-null      int64
 6   X       5 non-null      int64
dtypes: int16(1), int32(3), int64(3)
memory usage: 230.0+ bytes

with 6 different ways to do trivial math, only col 'X' produced what I (and I'd guess most users) expected.

obvious way to solve would be, at first to convert source column's dtype to int64 (like doing it 'on the fly') like

df['x'] = df['A'].astype(np.int64) * 100000

but I don't see its a good solution in my case, as data already is to big to create an even bigger tmp copy, and performance is also will degrade with redundant conversions..

So my question: how to calculate it on the fly, not creating a copy of whole source data columns in int64 (don't have RAM for that), e.g. directly properly calc newCol64 = srcCol8 * srcCol16 * srcCol16 in 64 bits cpu.

is there explicit type translation syntaxis available for pandas.eval()? which could be done on the fly while computing result row by row?

Solution

If you multiply an array by a scalar value, the selected dtype will be that of the array. However if you do a math operation between two arrays like X (X * A), the array will be "upcast":

>>> df['A'] * c
0   -31392
1    14992
2   -14064
3     6944
4   -13904
Name: A, dtype: int16  # bad result, bad dtype

>>> df['A'] * [c]
0    196380000
1    288570000
2    317770000
3    168500000
4     76270000
Name: A, dtype: int64  # everything is ok

You can read the documentation about output type determination.

In your case, you can do:

c = np.array([10000])
df.eval('R = @c * A', inplace=True)

Output:

>>> df
       A          R
0  12399  123990000
1   6026   60260000
2  17133  171330000
3  30974  309740000
4  28216  282160000

>>> df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   A       5 non-null      int16
 1   R       5 non-null      int64
dtypes: int16(1), int64(1)
memory usage: 178.0 bytes

EDIT

For some reason on my machine df['A'] * [100000] is calculated as int32, despite python is 64bit.. (and so result is wrong). Using c=np.array([100000],dtype=np.int64) instead of [100000] is works though.

Even if your system is 64bit, it seems the default int type is 32bit in your case (and I don't know why). To check the default dtype, you can use:

# My system
>>> np.int_
numpy.int64

# Your system
>>> np.int_
numpy.int32

In your case, if you do df['A'] * [10000], [10000] will be converted as np.array([10000], dtype=np.int32) that's why the result will be wrong. You have to be explicit by creating the array with np.array([10000], dtype=np.int64).