pythonpandassetnan

Why does it take so long to convert a Pandas column to a set when it only contains NaNs?


I have a pandas DataFrame, created this way:

import pandas as pd
wb = pd.io.parsers.ExcelFile('/path/to/data.xlsx')
df = wb.parse(wb.sheet_names[0])

The resulting dataframe has about a dozen columns and about 150K rows.

For most columns, the following operation is nearly instantaneous

aset = set(df.acolumn)

But for some columns, the same operation, i.e.

aset = set(df.weirdcolumn)

takes > 10 minutes! (Or rather, the operation fails to complete before the 10-minute timeout period expires.) Same number of elements!

Stranger still:

In [106]: set([type(c) for c in df.weirdcolumn])
Out[106]: set([numpy.float64])

In [107]: df.weirdcolumn.value_counts()
Out[107]: []

It appears that the content of the column is all nans

In [118]: all(np.isnan(df.weirdcolumn.values))
Out[118]: True

But this does not explain the slowdown mentioned before, because the following operation takes only a couple of seconds:

In [121]: set([np.nan for _ in range(len(data))])
Out[121]: set([nan])

I have run out of ways to find out the cause of the massive slowdown mentioned above. Suggestions welcome.


Solution

  • One weird thing about nans is that they don't compare as equal. This means that "different" nan objects will be inserted separately for sets:

    >>> float('nan') == float('nan')
    False
    >>> float('nan') is float('nan')
    False
    >>> len(set([float('nan') for _ in range(1000)]))
    1000
    

    This doesn't happen for your test of np.nan, because it's the same object over and over:

    >>> np.nan == np.nan
    False
    >>> np.nan is np.nan
    True
    >>> len(set([np.nan for _ in range(1000)]))
    1
    

    This is probably your problem; you're making a 150,000 element set where every single element has the exact same hash (hash(float('nan')) == 0). This means that an inserting a new nan into a set that already has n nans takes at least O(n) time, so building a set of N nans takes at least O(N^2) time. 150k^2 is...big.

    So yeah, nans suck. You could work around this by doing something like

    nan_idx = np.isnan(df.weirdcolumn)
    s = set(df.weirdcolumn[~nan_idx])
    if np.any(nan_idx):
        s.add(np.nan)