Taking as example the following table:
index | column_1 | column_2 |
---|---|---|
0 | bli bli | d e |
1 | bla bla | a b c d e |
2 | ble ble | a b c |
If I give a token_list = ['c', 'e']
I want to order the table by the number of times the tokens each row contains in column number 2.
By ordering the table I should get the following:
index | column_1 | column_2 | score_tmp |
---|---|---|---|
1 | bla bla | a b c d e | 2 |
0 | bli bli | d e | 1 |
2 | ble ble | a b c | 1 |
Currently, I have reached the following way of doing this, but it is taking a lot of time. How could I improve the time? Thank you in advance.
df['score_tmp'] = df[['column_2']].apply(
lambda x: len([True for token in token_list if
token in str(x['column_2'])]), axis=1)
results = df.sort_values('score_tmp', ascending=False).loc[df['score_tmp'] == len(token_list)].reset_index(inplace=False).to_dict('records')
You can split
column_2 based on whitespaces, convert each row into a set
and then use df.apply
with set intersection
with sort_values
:
In [200]: df['matches'] = df.column_2.str.split().apply(lambda x: set(x) & set(token_list)).str.len()
In [204]: df.sort_values('matches', ascending=False).drop('matches', 1)
Out[204]:
index column_1 column_2
1 1 bla bla a b c d e
0 0 bli bli d e
2 2 ble ble a b c
Timings:
In [208]: def f1():
...: df['score_tmp'] = df[['column_2']].apply(lambda x: len([True for token in token_list if token in str(x['column_2'])]), axis=1)
...: results = df.sort_values('score_tmp', ascending=False).loc[df['score_tmp'] == len(token_list)].reset_index(inplace=False).to_dict('records')
...:
In [209]: def f2():
...: df['matches'] = df.column_2.str.split().apply(lambda x: set(x) & set(token_list)).str.len()
...: df.sort_values('matches', ascending=False).drop('matches', 1)
...:
In [210]: %timeit f1() # solution provided in question
2.36 ms ± 55.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [211]: %timeit f2() # my solution
1.22 ms ± 14.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)