pythonpandasplotly-pythonxlsxwriterfloat32

Converting Pandas dataframe to float32 changes value of low precision number (by adding incorrect high precision decimals)


I have a very large dataset with values that don't require a lot of decimal point precision. In one test scenario, my dataframe is 102 MB, and all columns have a float64 datatype.

I was hoping to reduce the memory usage, and potentially the output file sizes by changing my pandas dataframe to hold float32 values. With this dataframe, I am creating these files:

After adding one line casting my datatype to float32 via df = df.astype('float32'), I was surprised to find that some of my files were quite a bit larger than they were before. One .html file went from 30 MB to 44 MB. One xlsx file went from 31 MB to ~39 MB. When I look at the data I stored, I see more - and inaccurate - digits after the decimal point: enter image description here

Digging into this more, I am finding unexpected behavior in how Pandas downcasts to float32 - or maybe it's how various methods and functions represent a float32 datatype vs a float64.

Given a simple csv file:

59.11,59,59.86,59.86,59.0839945
60.28,59.7817845,59.75,59.75,59

A simple script:

import pandas as pd

df = pd.read_csv('float_test.csv', header=None)
s = df.iloc[0,:]

print(df.info())
print(df)
print(f's dtype: {s.dtypes}')  # Show Datatype
print(s) # Show the data as pandas prints it
print(f's to list: {s.to_list()}')  # Show data convertered to list
s32 = s.astype('float32')
print(s32)
print(f's32 to list: {s32.to_list()}')  # Convert to float32 and print as list

Looking at the output from this script, I am confused by what is happening. Each column of the dataframe is a "float64". When I print the dataframe, it shows decimal values padded to the most precise float in any column (to a max of 6 digits after the decimal). Similarly when I grab just the first row - it is then treated as a series - and that is represented with padding to 6 decimal points.

Now once I convert the series to a float32, I see the value of index spot 2 and 3 change. Instead of 59.86 or 59.860000, it becomes 59.860001.

When I use to_list(), I find the original series has the correct values from the CSV files (with ".0" added to indicate float). But the float32 series, has .00000061035156 added to some of the values, and the last value has a similar (but not the same) value added.

There is something happening here that I am just not understanding. Why would a float32 of "59.11", not just be "59.11000"? (<< pretend that's the right # of zeros)
enter image description here

I understanding loosing precision when down casting from 64 to 32. But I don't understand how a number that should be exact in float32 is becoming non-exact.

And while this seems to be a Pandas issue, I am finding the bloating in storage space related to other libraries (xlsxwriter and plotly). I'm guessing this is because a float32 is forcing the data to be kept to a certain decimal point, yet somehow the float64 is okay with 59.11. However this could also be because the number itself is changing slightly (at the most granular level), forcing that many decimal points to be kept.


Solution

  • I'm going to flag your question as a duplicate of this question, but to help understand why I will also submit this answer.

    Part 1:

    Why would a float32 of "59.11", not just be "59.11000"?

    Answer: there is no way to represent exactly 59.11 as a binary floating point number (float).

    The float representation of 59.11 is some other number really close to 59.11, but not exactly equal. The exact number depends on the machine's implementation of floating point numbers, but in any case if you look at enough decimal places to get beyond the floating point's precision then you could see "garbage". For float32, there are about 7 precise digits (including those left of the .), and the rest is "garbage". For a float64, there are about 15 precise digits before the "garbage" starts.

    To prove this to yourself, try configuring pandas to print an absurd number of decimal places, and then look at your data again. This way you are looking at exactly the number that is stored in memory, not some rounded version of it.

    import pandas as pd
    data = [
        [59.11,59,59.86,59.86,59.0839945],
        [60.28,59.7817845,59.75,59.75,59],
    ]
    df = pd.DataFrame(data)
    s = df.iloc[0,:]
    pandas.set_option("display.precision", 18)
    print(s) 
    s32 = s.astype('float32')
    print(s32)
    

    Which prints:

    0    59.109999999999999432
    1    59.000000000000000000
    2    59.859999999999999432
    3    59.859999999999999432
    4    59.083994500000002859
    Name: 0, dtype: float64
    0    59.110000610351562500
    1    59.000000000000000000
    2    59.860000610351562500
    3    59.860000610351562500
    4    59.083995819091796875
    Name: 0, dtype: float32
    

    Part 2:

    I was surprised to find that some of my files were quite a bit larger than they were before.

    I am guessing that pandas is printing the binary floating point representation of the number after it has been rounded to a certain number of decimal points. For a float64, the float is very close so even rounding it to 6 or 7 decimal places has only trailing 0s after the data you want to see, and those trailing 0s would be omitted. For a float32, the "garbage" starts early enough that it isn't hidden by the rounding, and therefore takes up more characters in the text version of the result.

    If you wanted to store the data as numbers directly on the disk instead of storing a text version of the rounded floating point numbers, you could try saving to a binary format like a .pkl (see DataFrame.to_pickle). Just keep in mind that this file would not be human-readable.