python-3.xpandasdataframevectorizationnumexpr

"AttributeError: 'PandasExprVisitor'...." Error when trying to optimize pandas performance using pd.eval()


Here are the first 15 rows of my data -

    col1 col2     col3     col4     col5     col6     col7     col8
0   f43   e1   -44.30   442.24  2353.55  1964.71   326.74  2424.52
1   f58   e1   -44.30   442.24  2353.55  1964.71   326.74  2424.52
2   f13   e2   715.04  1656.92  -142.41  2979.43  2279.57  2629.24
3    f8   e2   715.04  1656.92  -142.41  2979.43  2279.57  2629.24
4   f15   e3   850.45   -80.54  1468.52  2540.86  2108.47  2175.76
5   f19   e3   850.45   -80.54  1468.52  2540.86  2108.47  2175.76
6   f56   e4  2540.86  2108.47  2175.76   497.09   314.43  2498.20
7   f21   e4  2540.86  2108.47  2175.76   497.09   314.43  2498.20
8   f45   e5  1906.87  2632.99   569.87   497.09   314.43  2498.20
9   f59   e5  1906.87  2632.99   569.87   497.09   314.43  2498.20
10  f16   e6  1529.13  2369.10  2995.27   968.80   -10.81  2706.25
11  f41   e6  1529.13  2369.10  2995.27   968.80   -10.81  2706.25
12  f48   e7   607.30  1570.72  2206.87   590.33    33.49  2123.76
13  f38   e7   607.30  1570.72  2206.87   590.33    33.49  2123.76
14   f1   e8   590.33    33.49  2123.76   831.43  2887.28  1306.86
15   f9   e8   590.33    33.49  2123.76   831.43  2887.28  1306.86

I have around 3600 records of data like this and my expected output is to get data as given by a previous SO question of mine.

Jezrael's answer takes around 5 seconds to process 3600 records and the time taken will get higher as the size of my dataset increases. In a bid to future-proof my code I am attempting to use pd.eval()(as suggested by the Pandas docs) but am however running into this error AttributeError: 'PandasExprVisitor' object has no attribute 'visit_ExtSlice'

The working code using Jezraels' method with my own modifications to get the desired solution:

df["v1"] = df.apply(lambda row: row['col3':'col5'].tolist(), axis=1)
df["v2"] = df.apply(lambda row: row['col6':'col8'].tolist(), axis=1)
v1 = df['v1'].to_numpy()
v2 = df['v2'].to_numpy()

m = (v1 == v1[:, None]) | (v2 == v2[:, None]) | (v1 == v2[:, None]) | (
            v2 == v1[:, None])

np.fill_diagonal(m, False)

df['col9'] = np.dot(m, df['col2'] + ',')
df['col9'] = df['col9'].str[:-1].replace('', np.nan, regex=True)

The code that is breaking which I'm trying to optimize using pd.eval() is as follows:

df["v1"] = df.apply(lambda row: row['col3':'col5'].tolist(), axis=1)
df["v2"] = df.apply(lambda row: row['col6':'col8'].tolist(), axis=1)
v1 = df['v1'].to_numpy()
v2 = df['v2'].to_numpy()

# check if current value in column A is equal to any other values in column A
mask_A_A = pd.eval('v1[:, None] == v1')

# check if current value in column A is equal to any other values in column B
mask_A_B = pd.eval('v1[:, None] == v2')

# check if current value in column B is equal to any other values in column B
mask_B_B = pd.eval('v2[:, None] == v2')

# check if any values in column B is equal to any other values in column A
mask_B_A = pd.eval('v2[:, None] == v1')

# combine masks and use numexpr to create col9 column
mask = ne.evaluate(
        '(mask_A_A | mask_A_B | mask_B_B | mask_B_A) & (v1[:, None] != v1) & (v2[:, None] != v2)')
col9 = np.where(mask, df['col2'].values[:, None], '')
df['col9'] = pd.Series([', '.join(names) if len(names) > 0 else None for names in col9])

My Pandas version is 1.5.3 and numexpr version is 2.8.4

EDIT:

pd.eval() does not support a column containing numpy arrays. So if you have to optimise columns with such data, it is best to use Cython or Numba.


Solution

  • In my opinion here is bottleneck in start of code:

    df = pd.concat([df] * 100, ignore_index=True)
        
    In [226]: %timeit df["v1"] = df.apply(lambda row: row['col3':'col5'].tolist(), axis=1)
    243 ms ± 34.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
    
    In [227]: %timeit df["v2"] = df.loc[:, 'col3':'col5'].to_numpy().tolist()
    774 µs ± 9.62 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
    

    Next optimalisation is use 2d numpy arrays instead of array of lists, solution for col9 was changed for avoid empty , between values of col2:

    df = pd.concat([df] * 100, ignore_index=True)
    
    
    In [258]: %%timeit
         ...: v1 = df.loc[:, 'col3':'col5'].to_numpy()
         ...: v2 = df.loc[:, 'col6':'col8'].to_numpy()
         ...: df["v1"] = v1.tolist()
         ...: df["v2"] = v2.tolist()
         ...: 
         ...: v11 = v1[:, None]
         ...: v22 = v2[:, None]
         ...: m = (v1 == v11).all(axis=2) |  (v2 == v22).all(axis=2) |
                  (v1 == v22).all(axis=2) |  (v2 == v11).all(axis=2)
         ...: 
         ...: np.fill_diagonal(m, False)
         ...: 
         ...: col9 = np.where(m, df['col2'].add(','), '')
         ...: df['col9'] = [''.join(names).strip(',') 
                            if len(names) > 0 
                            else None for names in col9]
         ...: 
    142 ms ± 2 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
    

    In [259]: %%timeit
         ...: df["v1"] = df.apply(lambda row: row['col3':'col5'].tolist(), axis=1)
         ...: df["v2"] = df.apply(lambda row: row['col6':'col8'].tolist(), axis=1)
         ...: v1 = df['v1'].to_numpy()
         ...: v2 = df['v2'].to_numpy()
         ...: 
         ...: m = (v1 == v1[:, None]) | (v2 == v2[:, None]) | (v1 == v2[:, None]) | (
         ...:             v2 == v1[:, None])
         ...: 
         ...: np.fill_diagonal(m, False)
         ...: 
         ...: df['col9'] = np.dot(m, df['col2'] + ',')
         ...: df['col9'] = df['col9'].str[:-1].replace('', np.nan, regex=True)
         ...: 
         ...: 
    1.05 s ± 103 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)