pythonpandasdataframeduplicates

Count the number of identical values ​in a DataFrame Python


I have a dataframe with two columns: hash and results.

The hash column contains the task ID, it can be repeated several times.

The result column contains the user's answer to this task. The answers contain specific string values ​​like: "Yes", "No", "Perhaps", 404"

My task is to calculate the consistency of user answers.

For example, if 3 users answered the same way to a task with one hash, then the consistency is 100%. If 2 out of 3 answered the same way, then 66.6. If one of 3, then 33.3%. If everyone answered differently, then 0.

I tried to calculate the maximum number of repeated answer values ​​for each hash via "duplicated", but this does not work when there are no repeated values.

Please tell me what functions can be used to solve this problem.

df for example:

df = pd.DataFrame({'hash': list('AAABBBCCCCDDD'),
               'results': ['Yes', 'No', 'No',
                           'Yes', 'Yes', 'Yes',
                           'Perhaps', 'Yes', '404', 'Perhaps',
                           'Yes', 'No', 'Perhaps',
                          ]
              })

Solution

  • Since you do not provide any reference code, I create a mini python script to simulate your code. This is what I used:

    First I create a function that can take the df and make all the operation

    def calculate_consistency(df):
    

    Then, first of all, I grouped all the records at the hash, and result columns:

    df.groupby('hash')['result'].value_counts().unstack(fill_value=0)
    

    Then, calculate total responses for each hash

    total_responses = grouped.sum(axis=1)
    

    To find the maximum count for each hash

    max_counts = grouped.max(axis=1)
    

    And finally, calculate consistency percentage and return the consistency.

    consistency = (max_counts / total_responses) * 100
    return consistency
    

    The full code, with the example looks like this:

    import pandas as pd
    
    def calculate_consistency(df):
        grouped = df.groupby('hash')['result'].value_counts().unstack(fill_value=0)
        total_responses = grouped.sum(axis=1)
        max_counts = grouped.max(axis=1)
        consistency = (max_counts / total_responses) * 100
        return consistency
    
    
    df = pd.DataFrame({'hash': ['A', 'A', 'A', 'B', 'B', 'C', 'C', 'C'], 'result': ['Yes', 'Yes', 'No', 'Yes', 'No', 'Perhaps', '404', 'Yes']})
    
    result = calculate_consistency(df)
    print(result)
    

    Let me know if this is useful for you, good luck!