pythonpandasnumpytypes

Data type checking, conditional logic and use of np.where,df.sort_values and cumsum


This is a question regarding calculating a Flag value of 0 or 1 using multiple input values within a pandas dataframe. At the same time, if any of the input-values are not floating-point (or are blank i.e. NaN in pandas) then the Flag would be blank i.e. ''

Say, cols_to_check represents the list of columns that serve as the input.

# Function to check if a value is a floating point number
def is_floating_point(aString):
    if (aString==aString): #this is to handle NaN values
        try:
            float(aString)
            return True
        except (ValueError, TypeError):
            return False
    else:
        return False

# Check if all input values in the specified columns are floating points
data['valid'] = data[cols_to_check].applymap(is_floating_point).all(axis=1)
    
# Calculate the flag value
    data['flag'] = np.where(data['valid'], data.sort_values(cols_to_check, ascending=[True, True, False, False, False,False]).groupby("emp_f8")["Rank"].cumsum(), '')
    # data['flag'] = data.sort_values(cols_to_check, ascending=[True, True, False, False, False,False]).groupby("emp_f8")["Rank"].cumsum()
    
# Ensure flag is 0 or 1 (unless input data quality issue then it'd be '')
data['flag'] = np.where(data['flag'].astype(str) == '1','1', np.where(data['flag'].astype(str) == '', '', '0'))

In the code above, I seem to be getting the right flag values if I do not check for a valid input data set. If I do check for validity of the input data set (i.e. all input values are floating point and not Null/ blank/ NaN) then I am not getting the right flag values.

Why would it not work correctly if I include the part that says np.where(data['valid'],

In my opinion, the datatype-checking and the np.where statements are causing unexpected behavior. I am getting flag values of 1 whereas I should be getting a flag value of 0 and 1 for a given input data set.

Update: I am tryign to provide sample data but having a hard time to present the data neatly in tables. Tried the table formatting tips explained here, but it's not working. Anyways here it is -

emp_f8      emp_l3  3gee    3hee    3iee    4bee       4dee
60033312    002      19     1       64      454133     426717.24
23232354    001      2      1       15.2    98353.39    98538.43
23232354    005      3      1       60.8    106493.24   21262.34

The flag'd be 1 for the first row of data. The next 2 rows have the same emp_f8 values; so only 1 row will have a Flag value of 1 and the other would be 0. I hope that helps.


Solution

  • It's likely because of the sort_values within the np.where. Try sorting first.

    data = data.sort_values(cols_to_check, ascending=[True, True, False, False, False,False]).reset_index(drop=True)
    
    data['valid'] = data[cols_to_check].applymap(is_floating_point).all(axis=1)
        
    data['flag'] = np.where(data['valid'], data.groupby("emp_f8", sort=False)["Rank"].cumsum(), '')
    
    data['flag'] = np.where(data['flag'].astype(str) == '1','1', np.where(data['flag'].astype(str) == '', '', '0'))