pythonpandasnumpymemoryprecision

What role does min value and max value play in reducing memory usage?


I am learning the code from this github Intrusion Detection (CIC-IDS2017)

Here is the code and the result that the authors use to reduce the memory, but I don't know why the author made adjustments based on the maximum and minimum values?

old_memory_usage = data.memory_usage().sum() / 1024 ** 2
print(f'Initial memory usage: {old_memory_usage:.2f} MB')
for col in data.columns:
    col_type = data[col].dtype
    if col_type != object:
        c_min = data[col].min()  # <-- retrieve c_min
        c_max = data[col].max()
        # Downcasting float64 to float32
        if str(col_type).find('float') >= 0 and c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
            data[col] = data[col].astype(np.float32)

        # Downcasting int64 to int32
        elif str(col_type).find('int') >= 0 and c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
            data[col] = data[col].astype(np.int32)

new_memory_usage = data.memory_usage().sum() / 1024 ** 2
print(f"Final memory usage: {new_memory_usage:.2f} MB")

reference: Intrusion Detection (CIC-IDS2017)

I have tried to print the c_min and np.finfo(np.float32).min to find some roles. Why exist numbers (np.finfo(np.float32).min) smaller than the minimum value (c_min) I have in the dataframe column?

Format:

column names ; c_min ; np.finfo(np.float32).min

Initial memory usage: 798.63 MB
Fwd Packet Length Mean 0.0 -3.4028235e+38
Fwd Packet Length Std 0.0 -3.4028235e+38
Bwd Packet Length Mean 0.0 -3.4028235e+38
Bwd Packet Length Std 0.0 -3.4028235e+38
Flow Bytes/s -261000000.0 -3.4028235e+38
Flow Packets/s -2000000.0 -3.4028235e+38
Flow IAT Mean -13.0 -3.4028235e+38
Flow IAT Std 0.0 -3.4028235e+38
Fwd IAT Mean 0.0 -3.4028235e+38
Fwd IAT Std 0.0 -3.4028235e+38
Bwd IAT Mean 0.0 -3.4028235e+38
Bwd IAT Std 0.0 -3.4028235e+38
Fwd Packets/s 0.0 -3.4028235e+38
Bwd Packets/s 0.0 -3.4028235e+38
Packet Length Mean 0.0 -3.4028235e+38
Packet Length Std 0.0 -3.4028235e+38
Packet Length Variance 0.0 -3.4028235e+38
Average Packet Size 0.0 -3.4028235e+38
Avg Fwd Segment Size 0.0 -3.4028235e+38
Avg Bwd Segment Size 0.0 -3.4028235e+38
Active Mean 0.0 -3.4028235e+38
Active Std 0.0 -3.4028235e+38
Idle Mean 0.0 -3.4028235e+38
Idle Std 0.0 -3.4028235e+38
Final memory usage: 798.63 MB

And search what is np.finfo() Official introduction: Machine limits for floating point types


Solution

  • A small toy example:

    import numpy as np
    import pandas as pd
    
    # A value we need float64 for
    max_32 = np.finfo(np.float32).max
    needs_64 = max_32 + 1000
    
    # Downcasting would be problematic
    print(needs_64.astype(np.float32))
    # Output:
    # <stdin>:1: RuntimeWarning: overflow encountered in cast
    # inf
    

    If you have a dataframe with such a large values you cannot cast everything to float32. The script you posed makes sure that you cast only the columns that do not cause inf values.

    # Dataframe with two float64 columns
    df = pd.DataFrame([max_32, needs_64]).T
    print(df.memory_usage().sum()) # 148
    
    # Cast all columns to float32; you loose some values
    df_float32 = df.astype(np.float32)
    print(df_float32)
    #              0    1
    #0  3.402823e+38  inf
    

    Note that all values in a column share the same data type, which you can check with:

    df.dtypes
    0    float64
    1    float64
    

    So you should only cast to float32 if your column-wise max/min values c_min/c_max are within the float32 min and max range

    df_compressed = df.copy()
    # cast only columns that are safe with your code
    # ..., i.e. update:
    df[0] = df[0].astype(np.float32)
    
    print(df_compressed.memory_usage().sum())
    # 144
    
    print(df_compressed.dtypes)
    # 0    float32
    # 1    float64
    
    # No inf value present
    print(df_compressed)
    #              0             1
    #0  3.402823e+38  3.402823e+38