I am learning the code from this github Intrusion Detection (CIC-IDS2017)
Here is the code and the result that the authors use to reduce the memory, but I don't know why the author made adjustments based on the maximum and minimum values?
old_memory_usage = data.memory_usage().sum() / 1024 ** 2
print(f'Initial memory usage: {old_memory_usage:.2f} MB')
for col in data.columns:
col_type = data[col].dtype
if col_type != object:
c_min = data[col].min() # <-- retrieve c_min
c_max = data[col].max()
# Downcasting float64 to float32
if str(col_type).find('float') >= 0 and c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
data[col] = data[col].astype(np.float32)
# Downcasting int64 to int32
elif str(col_type).find('int') >= 0 and c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
data[col] = data[col].astype(np.int32)
new_memory_usage = data.memory_usage().sum() / 1024 ** 2
print(f"Final memory usage: {new_memory_usage:.2f} MB")
reference: Intrusion Detection (CIC-IDS2017)
I have tried to print the c_min
and np.finfo(np.float32).min
to find some roles.
Why exist numbers (np.finfo(np.float32).min
) smaller than the minimum value (c_min
) I have in the dataframe column?
Format:
column names ; c_min
; np.finfo(np.float32).min
Initial memory usage: 798.63 MB
Fwd Packet Length Mean 0.0 -3.4028235e+38
Fwd Packet Length Std 0.0 -3.4028235e+38
Bwd Packet Length Mean 0.0 -3.4028235e+38
Bwd Packet Length Std 0.0 -3.4028235e+38
Flow Bytes/s -261000000.0 -3.4028235e+38
Flow Packets/s -2000000.0 -3.4028235e+38
Flow IAT Mean -13.0 -3.4028235e+38
Flow IAT Std 0.0 -3.4028235e+38
Fwd IAT Mean 0.0 -3.4028235e+38
Fwd IAT Std 0.0 -3.4028235e+38
Bwd IAT Mean 0.0 -3.4028235e+38
Bwd IAT Std 0.0 -3.4028235e+38
Fwd Packets/s 0.0 -3.4028235e+38
Bwd Packets/s 0.0 -3.4028235e+38
Packet Length Mean 0.0 -3.4028235e+38
Packet Length Std 0.0 -3.4028235e+38
Packet Length Variance 0.0 -3.4028235e+38
Average Packet Size 0.0 -3.4028235e+38
Avg Fwd Segment Size 0.0 -3.4028235e+38
Avg Bwd Segment Size 0.0 -3.4028235e+38
Active Mean 0.0 -3.4028235e+38
Active Std 0.0 -3.4028235e+38
Idle Mean 0.0 -3.4028235e+38
Idle Std 0.0 -3.4028235e+38
Final memory usage: 798.63 MB
And search what is np.finfo()
Official introduction: Machine limits for floating point types
A small toy example:
import numpy as np
import pandas as pd
# A value we need float64 for
max_32 = np.finfo(np.float32).max
needs_64 = max_32 + 1000
# Downcasting would be problematic
print(needs_64.astype(np.float32))
# Output:
# <stdin>:1: RuntimeWarning: overflow encountered in cast
# inf
If you have a dataframe with such a large values you cannot cast everything to float32. The script you posed makes sure that you cast only the columns that do not cause inf
values.
# Dataframe with two float64 columns
df = pd.DataFrame([max_32, needs_64]).T
print(df.memory_usage().sum()) # 148
# Cast all columns to float32; you loose some values
df_float32 = df.astype(np.float32)
print(df_float32)
# 0 1
#0 3.402823e+38 inf
Note that all values in a column share the same data type, which you can check with:
df.dtypes
0 float64
1 float64
So you should only cast to float32 if your column-wise max/min values c_min/c_max
are within the float32 min and max range
df_compressed = df.copy()
# cast only columns that are safe with your code
# ..., i.e. update:
df[0] = df[0].astype(np.float32)
print(df_compressed.memory_usage().sum())
# 144
print(df_compressed.dtypes)
# 0 float32
# 1 float64
# No inf value present
print(df_compressed)
# 0 1
#0 3.402823e+38 3.402823e+38