pythonpandasdataframebindiscretization

Discretization : converting continuous values into a certain number of categories


1   Create a column Usage_Per_Year from Miles_Driven_Per_Year by discretizing the values into three equally sized categories. The names of the categories should be Low, Medium, and High.

2   Group by Usage_Per_Year and print the group sizes as well as the ranges of each.

3   Do the same as in #1, but instead of equally sized categories, create categories that have the same number of points per category.

4   Group by Usage_Per_Year and print the group sizes as well as the ranges of each.

My codes are below

df["Usage_Per_Year "], bins = pd.cut(df["Miles_Driven_Per_Year"], 3, precision=2, retbins=True)
group_label = pd.Series(["Low", "Medium", "High"])
#3.3.2
group_size = df.groupby("Usage_Per_Year").size()
#print(group_size)
print(group_size.reset_index().set_index(group_label))
#3.3.3
Year2 = pd.cut(df["Miles_Driven_Per_Year"], 3, precision=2)
group_label = pd.Series(["Low", "Medium", "High"])
#3.3.4
group_size = df.groupby("Usage_Per_Year").size()
#print(group_size)
print(group_size.reset_index().set_index(group_label))

the out put is below:

               Usage_Per_Year     0 Low       (-1925.883, 663476.235]  6018 Medium  (663476.235, 1326888.118]     0 High     (1326888.118, 1990300.0]     1
               Usage_Per_Year     0 Low       (-1925.883, 663476.235]  6018 Medium  (663476.235, 1326888.118]     0 High     (1326888.118, 1990300.0]     1

but -1925 is wrong...

The right answer should be like this. enter image description here

How can I do...


Solution

  • Maybe a typo on line 1: df["Usage_Per_Year "]? There is a space at the end of the column name.

    pd.cut bins values into equal size. That's why all of your bins have same size. It seems that you should compute the min and max of each group after binning.

    Also, to bin value into equal frequency, you should use pd.qcut.


    Example input:

    import numpy as np
    import pandas as pd
    
    rng = np.random.default_rng(20210514)
    df = pd.DataFrame({
        'Miles_Driven_Per_Year': rng.gamma(1.05, 10000, (1000,)).astype(int)
    })
    
    # 1
    group_label = ['Low', 'Medium', 'High']
    df['Usage_Per_Year'] = pd.cut(df['Miles_Driven_Per_Year'],
                                  bins=3, labels=group_label)
    
    # 2
    print(df.groupby('Usage_Per_Year').agg(['count', 'min', 'max']))
    
    # 3
    df['Usage_Per_Year'] = pd.qcut(df['Miles_Driven_Per_Year'],
                                   q=3, labels=group_label)
    
    # 4
    print(df.groupby('Usage_Per_Year').agg(['count', 'min', 'max']))
    

    Example output:

                   Miles_Driven_Per_Year              
                                   count    min    max
    Usage_Per_Year                                    
    Low                              878     31  20905
    Medium                           107  20955  41196
    High                              15  41991  62668
                   Miles_Driven_Per_Year              
                                   count    min    max
    Usage_Per_Year                                    
    Low                              334     31   4378
    Medium                           333   4449  11424
    High                             333  11442  62668