pythonpandasdataframegroup-byanalysis

Python Column in Percentage


I need someone's help with a data analysis issue.

I am analysing a dataframe with data about sleep. It has 5 columns as follows: data frame

Enough - Do you think that you get enough sleep? Hours - On average, how many hours of sleep do you get on a weeknight? PhoneReach - Do you sleep with your phone within arms reach? PhoneTime - Do you use your phone within 30 minutes of falling asleep? Tired - On a scale from 1 to 5, how tired are you throughout the day? (1 being not tired, 5 being very tired) Breakfast - Do you typically eat breakfast?

grouped_df = df.groupby(['Hours', 'Enough']).count().sort_values('Hours', ascending=False)
reduced_df = grouped_df[['Tired']]
total_count = reduced_df['Tired'].sum()
reduced_df['Tired_percentage'] = (reduced_df['Tired'] / total_count) * 100
reduced_df

I wrote the following lines to group the data by columns 'Enough' and 'Hours'.

I wanted the 'Tired Percentage' column to show the percentage of Yes and and No of 'Enough' for each 'Hour'. But as you can see below this percentage is returning incorrectly, in the first row it shpuld be 100% for example. What am I doing wrong? grouped data frame

I tried what I described above and expect to have a percentage column showing correct values


Solution

  • The code you've written is calculating the percentage of each ['Hours', 'Enough'] group with respect to the total count of the 'Tired' column in the entire DataFrame. However, what you are trying to achieve is to get the percentage of 'Yes' and 'No' values of 'Enough' for each 'Hour'. To accomplish that, you should calculate the sum of 'Tired' values for each 'Hour' and then calculate the percentage of each ['Hours', 'Enough'] group with respect to that sum.

    Here is how you might adjust your code:

    # First, group by 'Hours' and 'Enough', and count the occurrences.
    grouped_df = df.groupby(['Hours', 'Enough']).size().reset_index(name='count')
    
    # Now, group by 'Hours' and sum the counts to get the total count for each 'Hour'.
    hours_sum = grouped_df.groupby('Hours')['count'].sum().reset_index(name='hours_sum')
    
    # Merge the two DataFrames on the 'Hours' column.
    merged_df = pd.merge(grouped_df, hours_sum, on='Hours')
    
    # Now, calculate the percentage.
    merged_df['Tired_percentage'] = (merged_df['count'] / merged_df['hours_sum']) * 100
    
    # If you want, you can sort the DataFrame.
    sorted_df = merged_df.sort_values(['Hours', 'Tired_percentage'], ascending=[False, False])
    
    # Display the DataFrame.
    sorted_df