I have a pandas dataframe. From multiple columns therein, I need to select the value from only one into a single new column, according to the ID (bar
in this example) of that row.
I need the fastest way to do this.
Dataframe for application is like this:
foo bar ID_A ID_B ID_C ID_D ID_E ...
1 B 1.5 2.3 4.1 0.5 6.6 ...
2 E 3 4 5 6 7 ...
3 A 9 6 3 8 1 ...
4 C 13 5 88 9 0 ...
5 B 6 4 6 9 4 ...
...
An example of a way to do it (my fastest at present) is thus - however, it is too slow for my purposes.
df.loc[df.bar=='A', 'baz'] = df.ID_A
df.loc[df.bar=='B', 'baz'] = df.ID_B
df.loc[df.bar=='C', 'baz'] = df.ID_C
df.loc[df.bar=='D', 'baz'] = df.ID_D
df.loc[df.bar=='E', 'baz'] = df.ID_E
df.loc[df.bar=='F', 'baz'] = df.ID_F
df.loc[df.bar=='G', 'baz'] = df.ID_G
Result will be like this (after dropping used columns):
foo baz
1 2.3
2 7
3 9
4 88
5 4
...
I have tried with .apply()
and it was very slow.
I tried with np.where()
which was still much slower than the example shown above (which was 1000% faster than np.where()
).
Would appreciate recommendations! Many thanks
EDIT: after the first few answers, I think I need to add this:
"whilst I would appreciate runtime estimate relative to the example, I know it's a small example so may be tricky.
My actual data has 280000 rows and an extra 50 columns (which I need to keep along with foo
and baz
). I have to reduce 13 columns to the single column per the example.
The speed is the only reason for asking, & no mention of speed thus far in first few responses. Thanks again!"
You can use a variant of the indexing lookup:
idx, cols = pd.factorize('ID_'+df['bar'])
out = pd.DataFrame({'foo': df['foo'],
'baz': df.reindex(cols, axis=1).to_numpy()[np.arange(len(df)), idx]})
output:
foo baz
0 1 2.3
1 2 7.0
2 3 9.0
3 4 88.0
4 5 4.0
Setting up a test dataset (280k rows, 54 ID columns):
from string import ascii_uppercase, ascii_lowercase
letters = list(ascii_lowercase+ascii_uppercase)
N = 280_000
np.random.seed(0)
df = (pd.DataFrame({'foo': np.arange(1, N+1),
'bar': np.random.choice(letters, size=N)})
.join(pd.DataFrame(np.random.random(size=(N, len(letters))),
columns=[f'ID_{l}' for l in letters]
))
)
speed testing:
%%timeit
idx, cols = pd.factorize('ID_'+df['bar'])
out = pd.DataFrame({'foo': df['foo'],
'baz': df.reindex(cols, axis=1).to_numpy()[np.arange(len(df)), idx]})
output:
54.4 ms ± 3.22 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)