1 Create a column Usage_Per_Year from Miles_Driven_Per_Year by discretizing the values into three equally sized categories. The names of the categories should be Low, Medium, and High.
2 Group by Usage_Per_Year and print the group sizes as well as the ranges of each.
3 Do the same as in #1, but instead of equally sized categories, create categories that have the same number of points per category.
4 Group by Usage_Per_Year and print the group sizes as well as the ranges of each.
My codes are below
df["Usage_Per_Year "], bins = pd.cut(df["Miles_Driven_Per_Year"], 3, precision=2, retbins=True)
group_label = pd.Series(["Low", "Medium", "High"])
#3.3.2
group_size = df.groupby("Usage_Per_Year").size()
#print(group_size)
print(group_size.reset_index().set_index(group_label))
#3.3.3
Year2 = pd.cut(df["Miles_Driven_Per_Year"], 3, precision=2)
group_label = pd.Series(["Low", "Medium", "High"])
#3.3.4
group_size = df.groupby("Usage_Per_Year").size()
#print(group_size)
print(group_size.reset_index().set_index(group_label))
the out put is below:
Usage_Per_Year 0 Low (-1925.883, 663476.235] 6018 Medium (663476.235, 1326888.118] 0 High (1326888.118, 1990300.0] 1 Usage_Per_Year 0 Low (-1925.883, 663476.235] 6018 Medium (663476.235, 1326888.118] 0 High (1326888.118, 1990300.0] 1
but -1925 is wrong...
The right answer should be like this.
How can I do...
Maybe a typo on line 1: df["Usage_Per_Year "]
? There is a space at the end of the column name.
pd.cut
bins values into equal size. That's why all of your bins have same size. It seems that you should compute the min and max of each group after binning.
Also, to bin value into equal frequency, you should use pd.qcut
.
Example input:
import numpy as np
import pandas as pd
rng = np.random.default_rng(20210514)
df = pd.DataFrame({
'Miles_Driven_Per_Year': rng.gamma(1.05, 10000, (1000,)).astype(int)
})
# 1
group_label = ['Low', 'Medium', 'High']
df['Usage_Per_Year'] = pd.cut(df['Miles_Driven_Per_Year'],
bins=3, labels=group_label)
# 2
print(df.groupby('Usage_Per_Year').agg(['count', 'min', 'max']))
# 3
df['Usage_Per_Year'] = pd.qcut(df['Miles_Driven_Per_Year'],
q=3, labels=group_label)
# 4
print(df.groupby('Usage_Per_Year').agg(['count', 'min', 'max']))
Example output:
Miles_Driven_Per_Year
count min max
Usage_Per_Year
Low 878 31 20905
Medium 107 20955 41196
High 15 41991 62668
Miles_Driven_Per_Year
count min max
Usage_Per_Year
Low 334 31 4378
Medium 333 4449 11424
High 333 11442 62668