pythonpandasdataframeprocessing-efficiency

Python/Pandas: use one column's value to be the suffix of the column name from which I want a value


I have a pandas dataframe. From multiple columns therein, I need to select the value from only one into a single new column, according to the ID (bar in this example) of that row. I need the fastest way to do this.

Dataframe for application is like this:

foo bar ID_A    ID_B    ID_C    ID_D    ID_E    ...
1   B   1.5     2.3     4.1     0.5     6.6     ...
2   E   3       4       5       6       7       ...
3   A   9       6       3       8       1       ...
4   C   13      5       88      9       0       ...
5   B   6       4       6       9       4       ...
...

An example of a way to do it (my fastest at present) is thus - however, it is too slow for my purposes.

df.loc[df.bar=='A', 'baz'] = df.ID_A
df.loc[df.bar=='B', 'baz'] = df.ID_B
df.loc[df.bar=='C', 'baz'] = df.ID_C
df.loc[df.bar=='D', 'baz'] = df.ID_D
df.loc[df.bar=='E', 'baz'] = df.ID_E
df.loc[df.bar=='F', 'baz'] = df.ID_F
df.loc[df.bar=='G', 'baz'] = df.ID_G

Result will be like this (after dropping used columns):

foo baz
1   2.3
2   7
3   9
4   88
5   4
...

I have tried with .apply() and it was very slow.
I tried with np.where() which was still much slower than the example shown above (which was 1000% faster than np.where()).

Would appreciate recommendations! Many thanks

EDIT: after the first few answers, I think I need to add this:
"whilst I would appreciate runtime estimate relative to the example, I know it's a small example so may be tricky. My actual data has 280000 rows and an extra 50 columns (which I need to keep along with foo and baz). I have to reduce 13 columns to the single column per the example. The speed is the only reason for asking, & no mention of speed thus far in first few responses. Thanks again!"


Solution

  • You can use a variant of the indexing lookup:

    idx, cols = pd.factorize('ID_'+df['bar'])
    out = pd.DataFrame({'foo': df['foo'],
                        'baz': df.reindex(cols, axis=1).to_numpy()[np.arange(len(df)), idx]})
    

    output:

       foo   baz
    0    1   2.3
    1    2   7.0
    2    3   9.0
    3    4  88.0
    4    5   4.0
    

    testing speed

    Setting up a test dataset (280k rows, 54 ID columns):

    from string import ascii_uppercase, ascii_lowercase
    
    letters = list(ascii_lowercase+ascii_uppercase)
    N = 280_000
    
    np.random.seed(0)
    
    df = (pd.DataFrame({'foo': np.arange(1, N+1),
                        'bar': np.random.choice(letters, size=N)})
            .join(pd.DataFrame(np.random.random(size=(N, len(letters))),
                               columns=[f'ID_{l}' for l in letters]
                              ))
         )
    

    speed testing:

    %%timeit
    idx, cols = pd.factorize('ID_'+df['bar'])
    out = pd.DataFrame({'foo': df['foo'],
                        'baz': df.reindex(cols, axis=1).to_numpy()[np.arange(len(df)), idx]})
    

    output:

    54.4 ms ± 3.22 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)