pythonpandasdataframeexploratory-data-analysis

Pandas Data Wrangling / Dataframe Assignment


sample_data = {
    'Class': ['TEST', 'NOT TEST', 'TEST 123', 'SOMETHING ELSE', 'TEST'],
    'Values': ['1.3 / 4.5', '1.5', '0.7 / 5.2', '123', '1.5 / 4.0']
}

df = pd.DataFrame(sample_data)
mask = df['Class'].str.startswith('TEST')
# I WANT TO DO SOMETHING LIKE THIS -->
df.loc[mask, ['COL1', 'COL2']] = df.loc[mask, 'Values'].str.split(' / ', expand=True)

df.head()

I am getting Nan for COL1 and COL2 for all rows. My expectation would be that I would get the split values for only the rows matching the mask and Nan for others.

I have read the pandas documentation (not enough apparently) and have been working with AI, but it is sending me in circles. Clearly I am lost in the weeds and not understanding something that I think should be very simple. Maybe something with copies and such.

Pandas

I would expect, and what I want, is that I would get values in COL1 and COL2 for just rows which start with 'TEST'. Other rows would be NaN for COL1 and COL2.

I understand why the below code doesn't work (am working on a copy), but I don't understand why the original snipet doesn't work:

df_subset = df[mask]
df_subset[['COL1', 'COL2']] = df_subset['Values'].str.split(' / ', expand=True)
df.head() #NOTHING CHANGED AS EXPECTED BUT df_subset is as expected

Solution

  • Since the column names are not aligned with the output of split (columns are 0/1), you should use the underlying numpy array:

    df.loc[mask, ['COL1', 'COL2']] = df.loc[mask, 'Values'].str.split(' / ', expand=True).values
    

    Output:

                Class     Values COL1 COL2
    0            TEST  1.3 / 4.5  1.3  4.5
    1        NOT TEST        1.5  NaN  NaN
    2        TEST 123  0.7 / 5.2  0.7  5.2
    3  SOMETHING ELSE        123  NaN  NaN
    4            TEST  1.5 / 4.0  1.5  4.0
    

    Output of df.loc[mask, 'Values'].str.split(' / ', expand=True):

         0    1 # not aligned with COL1/COL2
    0  1.3  4.5
    2  0.7  5.2
    4  1.5  4.0