pythonpandascountappendfrequency-analysis

Calculating Frequency and % of frequency for all categorical variables in a data frame Python


I am new to python and I am working on a requirement to list all unique value in a categorical column along with frequency of each value and the % frequency of each value in a column and using a for loop to perform it on the complete dataset. Also I am not sure if I have to use pd.Series to append data into a dataframe as per the screenshot attached because the length of the columns are different based on the unique values in a column.

Appreciate your help.

The below is the code I tried to work out but I am not able to workout on the other columns for unique value and % of frequency and create it as a data frame so that I can export it to CSV

Count_df = []
for item in df.columns:
    Count_df_ = pd.DataFrame(df1[item].value_counts())
    Count_df.append(Count_df_)
Count_dfdf = pd.DataFrame(Count_df)
Count_dfdf
Count_dfdf.to_csv(path_or_buf = Output + '_' + 'Count_.csv')

The input and Output expected is as below and the same is attached as an :

[Input data and expected Output][1]

Thanks in advance


Solution

  • No magic. Just append the output DataFrame column-by-column patiently.

    Here I assume a 4-columned output in a single .csv file. Based on personal work experience, this format is more handy than separate files for spreadsheet softwares. However, separated output is also possible within in the loop.

    Code:

    import pandas as pd
    
    # please provide copy-able sample data next time
    df = pd.DataFrame(
        data={
            "Name": ["A", "B", "C", "C", "A", "F"],
            "col2": [True, False, False, False, False, True],
            "col3": [1, 2, 3, 1, 1, 3],
        }
    )
    
    # Construct an empty dataframe with convenient column order.
    # The ordering can be adjusted later on.
    df_ans = pd.DataFrame(
        data={
            "var_name": [],
            "var_count": [],
            "var_freq": [],
            "col_name": [],
        }
    )
    
    # process each column
    for col in df.columns:
    
        # get variable name and count
        df_col_count = df[col].value_counts().to_frame().reset_index()
        # rename columns
        df_col_count.columns = ["var_name", "var_count"]
    
        # compute frequency
        df_col_count["var_freq"] = df_col_count["var_count"] / df_col_count["var_count"].sum()
    
        # append column name
        df_col_count["col_name"] = col
    
        # sort (optional)
        # (1) by name
        df_col_count.sort_values(by="var_name", inplace=True)
        # (2) by descending frequency
        # df_col_count.sort_values(by="var_freq", ascending=False, inplace=True)
    
        # append
        df_ans = df_ans.append(df_col_count)
    
        # For separated CSV output, output here (and "col_name" can be removed)
        #df_col_count.to_csv(f"/path/to/{col}_freq.csv")
    
    # reorder columns
    df_ans = df_ans[["col_name", "var_name", "var_count", "var_freq"]]
    # reindex
    df_ans.reset_index(drop=True, inplace=True)
    
    # write csv
    # df_ans.to_csv(f"/path/to/all_freq.csv")
    

    Output

    # Each column (variable) is sorted by name.
    df_ans   
    
    Out[12]: 
      col_name var_name  var_count  var_freq
    0     Name        A        2.0  0.333333
    1     Name        B        1.0  0.166667
    2     Name        C        2.0  0.333333
    3     Name        F        1.0  0.166667
    4     col2    False        4.0  0.666667
    5     col2     True        2.0  0.333333
    6     col3        1        3.0  0.500000
    7     col3        2        1.0  0.166667
    8     col3        3        2.0  0.333333