I have a pandas DataFrame
, created this way:
import pandas as pd
wb = pd.io.parsers.ExcelFile('/path/to/data.xlsx')
df = wb.parse(wb.sheet_names[0])
The resulting dataframe has about a dozen columns and about 150K rows.
For most columns, the following operation is nearly instantaneous
aset = set(df.acolumn)
But for some columns, the same operation, i.e.
aset = set(df.weirdcolumn)
takes > 10 minutes! (Or rather, the operation fails to complete before the 10-minute timeout period expires.) Same number of elements!
Stranger still:
In [106]: set([type(c) for c in df.weirdcolumn])
Out[106]: set([numpy.float64])
In [107]: df.weirdcolumn.value_counts()
Out[107]: []
It appears that the content of the column is all nan
s
In [118]: all(np.isnan(df.weirdcolumn.values))
Out[118]: True
But this does not explain the slowdown mentioned before, because the following operation takes only a couple of seconds:
In [121]: set([np.nan for _ in range(len(data))])
Out[121]: set([nan])
I have run out of ways to find out the cause of the massive slowdown mentioned above. Suggestions welcome.
One weird thing about nans is that they don't compare as equal. This means that "different" nan objects will be inserted separately for sets:
>>> float('nan') == float('nan')
False
>>> float('nan') is float('nan')
False
>>> len(set([float('nan') for _ in range(1000)]))
1000
This doesn't happen for your test of np.nan
, because it's the same object over and over:
>>> np.nan == np.nan
False
>>> np.nan is np.nan
True
>>> len(set([np.nan for _ in range(1000)]))
1
This is probably your problem; you're making a 150,000 element set where every single element has the exact same hash (hash(float('nan')) == 0
). This means that an inserting a new nan into a set that already has n
nans takes at least O(n)
time, so building a set of N
nans takes at least O(N^2)
time. 150k^2 is...big.
So yeah, nans suck. You could work around this by doing something like
nan_idx = np.isnan(df.weirdcolumn)
s = set(df.weirdcolumn[~nan_idx])
if np.any(nan_idx):
s.add(np.nan)