I have a problem of this type: A customer creates an order by hand, which might be erroneous. Submitting a wrong order is costly, which is why we try to reduce the error rate.
I need to detect what factors cause an error, so that a new rule can be created, such as Product "A" and type "B" must not go together. All explanatory variables are categorical.
I have 2 questions:
Below is a sample dataset and a simple approach I took -- finding variables with high proportion of errors to be proposed as rules. I create a single interaction term by hand (based on prior knowledge, but I might be missing others).
I also tried using classification models (LASSO, Decision tree, RF), but I had an issue with 1. high dimensionality (especially when including many interactions), 2. extracting simple rules, since models use many coefficients even with regularization.
import pandas as pd
# Create sample dataset for task
df = pd.DataFrame(data={'error':[0,1,0,0,0,0,0,1,1,1],
'product':[1,2,1,2,2,3,4,2,2,2],
'type':[1,1,2,3,3,1,2,1,4,4],
'discount_level':[5,3,3,4,1,2,2,1,4,5],
'extra1':[1,1,1,2,2,2,3,3,3,3,],
'extra2':[1,2,3,1,2,3,1,2,3,1],
'extra3':[6,6,9,9,8,8,7,7,6,6]
})
# Variable interaction based on prior knowledge
df['product_type'] = df['product'].astype(str) + '_' + df['type'].astype(str)
X = df.drop('error', axis=1)
# Find groups with high portion of errors
groups_expl = pd.DataFrame()
for col in X.columns:
groups = df.groupby(col).agg(count_all=('error', 'count'),
count_error=('error', 'sum'))
groups['portion_error'] = groups['count_error'] / groups['count_all']
groups['column'] = col
# Save groups with high portion of errors
groups_expl = pd.concat([groups_expl, groups.loc[groups['portion_error']>0.8, :]], axis=0)
groups_expl['col_val'] = groups_expl.index
print(groups_expl)
Thank you for help!
What approach do I take to extract simple but useful rules to give to a human expert for further review?
You could experiment with a shallow bagging model. For example, XGBClassifier(n_estimators = 100, max_depth = 2)
.
The idea is that each ensemble element comes to represent some feature combination that corresponds to elevated risk.
How do I make sure variable interactions are taken into account?
Decision tree models are easy to visualize and interpret, and they do feature interactions automatically.
Imagine the following split logic:
if product == 1:
if extra == 3:
return "high risk"
else:
return "no risk"
else
return "no risk"
As you can see, this decision tree only contributes towards the total risk score when product == 1 and extra == 3
. That's a feature interaction.