This is a question regarding calculating a Flag value of 0 or 1 using multiple input values within a pandas dataframe. At the same time, if any of the input-values are not floating-point (or are blank i.e. NaN in pandas) then the Flag would be blank i.e. ''
Say, cols_to_check represents the list of columns that serve as the input.
# Function to check if a value is a floating point number
def is_floating_point(aString):
if (aString==aString): #this is to handle NaN values
try:
float(aString)
return True
except (ValueError, TypeError):
return False
else:
return False
# Check if all input values in the specified columns are floating points
data['valid'] = data[cols_to_check].applymap(is_floating_point).all(axis=1)
# Calculate the flag value
data['flag'] = np.where(data['valid'], data.sort_values(cols_to_check, ascending=[True, True, False, False, False,False]).groupby("emp_f8")["Rank"].cumsum(), '')
# data['flag'] = data.sort_values(cols_to_check, ascending=[True, True, False, False, False,False]).groupby("emp_f8")["Rank"].cumsum()
# Ensure flag is 0 or 1 (unless input data quality issue then it'd be '')
data['flag'] = np.where(data['flag'].astype(str) == '1','1', np.where(data['flag'].astype(str) == '', '', '0'))
In the code above, I seem to be getting the right flag values if I do not check for a valid input data set. If I do check for validity of the input data set (i.e. all input values are floating point and not Null/ blank/ NaN) then I am not getting the right flag values.
Why would it not work correctly if I include the part that says np.where(data['valid'],
In my opinion, the datatype-checking and the np.where statements are causing unexpected behavior. I am getting flag values of 1 whereas I should be getting a flag value of 0 and 1 for a given input data set.
Update: I am tryign to provide sample data but having a hard time to present the data neatly in tables. Tried the table formatting tips explained here, but it's not working. Anyways here it is -
emp_f8 emp_l3 3gee 3hee 3iee 4bee 4dee
60033312 002 19 1 64 454133 426717.24
23232354 001 2 1 15.2 98353.39 98538.43
23232354 005 3 1 60.8 106493.24 21262.34
The flag'd be 1 for the first row of data. The next 2 rows have the same emp_f8 values; so only 1 row will have a Flag value of 1 and the other would be 0. I hope that helps.
It's likely because of the sort_values
within the np.where
. Try sorting first.
data = data.sort_values(cols_to_check, ascending=[True, True, False, False, False,False]).reset_index(drop=True)
data['valid'] = data[cols_to_check].applymap(is_floating_point).all(axis=1)
data['flag'] = np.where(data['valid'], data.groupby("emp_f8", sort=False)["Rank"].cumsum(), '')
data['flag'] = np.where(data['flag'].astype(str) == '1','1', np.where(data['flag'].astype(str) == '', '', '0'))