Here are the first 15 rows of my data -
col1 col2 col3 col4 col5 col6 col7 col8
0 f43 e1 -44.30 442.24 2353.55 1964.71 326.74 2424.52
1 f58 e1 -44.30 442.24 2353.55 1964.71 326.74 2424.52
2 f13 e2 715.04 1656.92 -142.41 2979.43 2279.57 2629.24
3 f8 e2 715.04 1656.92 -142.41 2979.43 2279.57 2629.24
4 f15 e3 850.45 -80.54 1468.52 2540.86 2108.47 2175.76
5 f19 e3 850.45 -80.54 1468.52 2540.86 2108.47 2175.76
6 f56 e4 2540.86 2108.47 2175.76 497.09 314.43 2498.20
7 f21 e4 2540.86 2108.47 2175.76 497.09 314.43 2498.20
8 f45 e5 1906.87 2632.99 569.87 497.09 314.43 2498.20
9 f59 e5 1906.87 2632.99 569.87 497.09 314.43 2498.20
10 f16 e6 1529.13 2369.10 2995.27 968.80 -10.81 2706.25
11 f41 e6 1529.13 2369.10 2995.27 968.80 -10.81 2706.25
12 f48 e7 607.30 1570.72 2206.87 590.33 33.49 2123.76
13 f38 e7 607.30 1570.72 2206.87 590.33 33.49 2123.76
14 f1 e8 590.33 33.49 2123.76 831.43 2887.28 1306.86
15 f9 e8 590.33 33.49 2123.76 831.43 2887.28 1306.86
I have around 3600 records of data like this and my expected output is to get data as given by a previous SO question of mine.
Jezrael's answer takes around 5 seconds to process 3600 records and the time taken will get higher as the size of my dataset increases. In a bid to future-proof my code I am attempting to use pd.eval()
(as suggested by the Pandas docs) but am however running into this error AttributeError: 'PandasExprVisitor' object has no attribute 'visit_ExtSlice'
The working code using Jezraels' method with my own modifications to get the desired solution:
df["v1"] = df.apply(lambda row: row['col3':'col5'].tolist(), axis=1)
df["v2"] = df.apply(lambda row: row['col6':'col8'].tolist(), axis=1)
v1 = df['v1'].to_numpy()
v2 = df['v2'].to_numpy()
m = (v1 == v1[:, None]) | (v2 == v2[:, None]) | (v1 == v2[:, None]) | (
v2 == v1[:, None])
np.fill_diagonal(m, False)
df['col9'] = np.dot(m, df['col2'] + ',')
df['col9'] = df['col9'].str[:-1].replace('', np.nan, regex=True)
The code that is breaking which I'm trying to optimize using pd.eval()
is as follows:
df["v1"] = df.apply(lambda row: row['col3':'col5'].tolist(), axis=1)
df["v2"] = df.apply(lambda row: row['col6':'col8'].tolist(), axis=1)
v1 = df['v1'].to_numpy()
v2 = df['v2'].to_numpy()
# check if current value in column A is equal to any other values in column A
mask_A_A = pd.eval('v1[:, None] == v1')
# check if current value in column A is equal to any other values in column B
mask_A_B = pd.eval('v1[:, None] == v2')
# check if current value in column B is equal to any other values in column B
mask_B_B = pd.eval('v2[:, None] == v2')
# check if any values in column B is equal to any other values in column A
mask_B_A = pd.eval('v2[:, None] == v1')
# combine masks and use numexpr to create col9 column
mask = ne.evaluate(
'(mask_A_A | mask_A_B | mask_B_B | mask_B_A) & (v1[:, None] != v1) & (v2[:, None] != v2)')
col9 = np.where(mask, df['col2'].values[:, None], '')
df['col9'] = pd.Series([', '.join(names) if len(names) > 0 else None for names in col9])
My Pandas
version is 1.5.3
and numexpr
version is 2.8.4
pd.eval()
does not support a column containing numpy
arrays. So if you have to optimise columns with such data, it is best to use Cython
or Numba
.
In my opinion here is bottleneck in start of code:
df = pd.concat([df] * 100, ignore_index=True)
In [226]: %timeit df["v1"] = df.apply(lambda row: row['col3':'col5'].tolist(), axis=1)
243 ms ± 34.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [227]: %timeit df["v2"] = df.loc[:, 'col3':'col5'].to_numpy().tolist()
774 µs ± 9.62 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Next optimalisation is use 2d numpy arrays instead of array of lists, solution for col9
was changed for avoid empty ,
between values of col2
:
df = pd.concat([df] * 100, ignore_index=True)
In [258]: %%timeit
...: v1 = df.loc[:, 'col3':'col5'].to_numpy()
...: v2 = df.loc[:, 'col6':'col8'].to_numpy()
...: df["v1"] = v1.tolist()
...: df["v2"] = v2.tolist()
...:
...: v11 = v1[:, None]
...: v22 = v2[:, None]
...: m = (v1 == v11).all(axis=2) | (v2 == v22).all(axis=2) |
(v1 == v22).all(axis=2) | (v2 == v11).all(axis=2)
...:
...: np.fill_diagonal(m, False)
...:
...: col9 = np.where(m, df['col2'].add(','), '')
...: df['col9'] = [''.join(names).strip(',')
if len(names) > 0
else None for names in col9]
...:
142 ms ± 2 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [259]: %%timeit
...: df["v1"] = df.apply(lambda row: row['col3':'col5'].tolist(), axis=1)
...: df["v2"] = df.apply(lambda row: row['col6':'col8'].tolist(), axis=1)
...: v1 = df['v1'].to_numpy()
...: v2 = df['v2'].to_numpy()
...:
...: m = (v1 == v1[:, None]) | (v2 == v2[:, None]) | (v1 == v2[:, None]) | (
...: v2 == v1[:, None])
...:
...: np.fill_diagonal(m, False)
...:
...: df['col9'] = np.dot(m, df['col2'] + ',')
...: df['col9'] = df['col9'].str[:-1].replace('', np.nan, regex=True)
...:
...:
1.05 s ± 103 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)