pandasconditional-statements

Pandas Dataframe: Creating a new column and filling it with values according to 2 conditional statements on other columns


I've written this script that create new columns based on a value meeting two conditions.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
df=pd.DataFrame()

df['variable 1']= np.arange(0,1.1,0.1)
df['variable 2']= 0.2*df['variable 1']
df['variable 3']= 0.4 -0.2*df['variable 1']


# Create new columns 

slope = [2, 1.5, 1, 0.5]

for i in range(len(slope)):

    df['slope = ' + str(slope[i])]=''
    for j in range(len(df['variable 1'])):
    # Calculating Scl_disp_sd with equation 1
        curve = 0.5 - slope[i]*df['variable 1'][j]
        df['slope = ' + str(slope[i])][j]= np.where((curve>df['variable 2'][j]) & (curve<df['variable 3'][j]), curve,np.nan)

display(df)

plt.plot(df['variable 1'], df['variable 2'], 'o', label='variable 2')
plt.plot(df['variable 1'], df['variable 3'], 'o', label='variable 3')
plt.plot(df['variable 1'], df.filter(like='slope =', axis=1), marker='.')
plt.legend()

enter image description here

The script works, however, I get this message:

/var/folders/m0/_y1fs5x50xx99pjg2yf42y7r0000gp/T/ipykernel_1964/2618301266.py:11: FutureWarning: ChainedAssignmentError: behaviour will change in pandas 3.0!
You are setting values through chained assignment. Currently this works in certain cases, but when using Copy-on-Write (which will become the default behaviour in pandas 3.0) this will never work to update the original DataFrame or Series, because the intermediate object on which we are setting values will behave as a copy.
A typical example is when you are setting values in a column of a DataFrame, like:

df["col"][row_indexer] = value

Use `df.loc[row_indexer, "col"] = values` instead, to perform the assignment in a single step and ensure this keeps updating the original `df`.

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

  df['slope = ' + str(slope[i])][j]= np.where((curve>df['variable 2'][j]) & (curve<df['variable 3'][j]),
/var/folders/m0/_y1fs5x50xx99pjg2yf42y7r0000gp/T/ipykernel_1964/2618301266.py:11: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['slope = ' + str(slope[i])][j]= np.where((curve>df['variable 2'][j]) & (curve<df['variable 3'][j]),
/var/folders/m0/_y1fs5x50xx99pjg2yf42y7r0000gp/T/ipykernel_1964/2618301266.py:11: FutureWarning: ChainedAssignmentError: behaviour will change in pandas 3.0!
You are setting values through chained assignment. Currently this works in certain cases, but when using Copy-on-Write (which will become the default behaviour in pandas 3.0) this will never work to update the original DataFrame or Series, because the intermediate object on which we are setting values will behave as a copy.
A typical example is when you are setting values in a column of a DataFrame, like:

df["col"][row_indexer] = value

Use `df.loc[row_indexer, "col"] = values` instead, to perform the assignment in a single step and ensure this keeps updating the original `df`.

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
...
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['slope = ' + str(slope[i])][j]= np.where((curve>df['variable 2'][j]) & (curve<df['variable 3'][j]),
Output is truncated. View as a scrollable element or open in a text editor. Adjust cell output settings...

I'd appreciate if someone would have another idea how to write this script in order to avoid the message


Solution

  • No need for the nested loop. Just apply you operation as a vector:

    slope = [2, 1.5, 1, 0.5]
    
    for i in range(len(slope)):
        curve = 0.5 - slope[i]*df['variable 1']
        df['slope = ' + str(slope[i])] = np.where((curve>df['variable 2'])
                                                   & (curve<df['variable 3']),
                                                  curve,np.nan)
    

    Or full vectorial with :

    curve = 0.5 - slope*df['variable 1'].to_numpy()[:, None]
    cols = [f'slope = {c}' for c in slope]
    df[cols] = np.where(  (curve > df[['variable 2']].to_numpy())
                        & (curve < df[['variable 3']].to_numpy()),
                        curve, np.nan)
    

    Output:

        variable 1  variable 2  variable 3  slope = 2  slope = 1.5  slope = 1  slope = 0.5
    0          0.0        0.00        0.40        NaN          NaN        NaN          NaN
    1          0.1        0.02        0.38        0.3         0.35        NaN          NaN
    2          0.2        0.04        0.36        0.1         0.20        0.3          NaN
    3          0.3        0.06        0.34        NaN          NaN        0.2          NaN
    4          0.4        0.08        0.32        NaN          NaN        0.1         0.30
    5          0.5        0.10        0.30        NaN          NaN        NaN         0.25
    6          0.6        0.12        0.28        NaN          NaN        NaN         0.20
    7          0.7        0.14        0.26        NaN          NaN        NaN         0.15
    8          0.8        0.16        0.24        NaN          NaN        NaN          NaN
    9          0.9        0.18        0.22        NaN          NaN        NaN          NaN
    10         1.0        0.20        0.20        NaN          NaN        NaN          NaN