I have a dataframe with two columns: hash and results.
The hash column contains the task ID, it can be repeated several times.
The result column contains the user's answer to this task. The answers contain specific string values like: "Yes", "No", "Perhaps", 404"
My task is to calculate the consistency of user answers.
For example, if 3 users answered the same way to a task with one hash, then the consistency is 100%. If 2 out of 3 answered the same way, then 66.6. If one of 3, then 33.3%. If everyone answered differently, then 0.
I tried to calculate the maximum number of repeated answer values for each hash via "duplicated", but this does not work when there are no repeated values.
Please tell me what functions can be used to solve this problem.
df for example:
df = pd.DataFrame({'hash': list('AAABBBCCCCDDD'),
'results': ['Yes', 'No', 'No',
'Yes', 'Yes', 'Yes',
'Perhaps', 'Yes', '404', 'Perhaps',
'Yes', 'No', 'Perhaps',
]
})
Since you do not provide any reference code, I create a mini python script to simulate your code. This is what I used:
First I create a function that can take the df and make all the operation
def calculate_consistency(df):
Then, first of all, I grouped all the records at the hash, and result columns:
df.groupby('hash')['result'].value_counts().unstack(fill_value=0)
Then, calculate total responses for each hash
total_responses = grouped.sum(axis=1)
To find the maximum count for each hash
max_counts = grouped.max(axis=1)
And finally, calculate consistency percentage and return the consistency.
consistency = (max_counts / total_responses) * 100
return consistency
The full code, with the example looks like this:
import pandas as pd
def calculate_consistency(df):
grouped = df.groupby('hash')['result'].value_counts().unstack(fill_value=0)
total_responses = grouped.sum(axis=1)
max_counts = grouped.max(axis=1)
consistency = (max_counts / total_responses) * 100
return consistency
df = pd.DataFrame({'hash': ['A', 'A', 'A', 'B', 'B', 'C', 'C', 'C'], 'result': ['Yes', 'Yes', 'No', 'Yes', 'No', 'Perhaps', '404', 'Yes']})
result = calculate_consistency(df)
print(result)
Let me know if this is useful for you, good luck!