I have a dataset, df
, that looks like this:
Date | Code | City | State | Population | Quantity | QTDPERCAPITA |
---|---|---|---|---|---|---|
2020-01 | 11001 | Los Angeles | CA | 5000000 | 100000 | 0.02 |
2020-02 | 11001 | Los Angeles | CA | 5000000 | 125000 | 0.025 |
2020-03 | 11001 | Los Angeles | CA | 5000000 | 135000 | 0.027 |
2020-01 | 12002 | Houston | TX | 3000000 | 150000 | 0.05 |
2020-02 | 12002 | Houston | TX | 3000000 | 100000 | 0.033 |
2020-03 | 12002 | Houston | TX | 3000000 | 200000 | 0.066 |
... | ... | ... | ... | ... | ... | ... |
2021-07 | 11001 | Los Angeles | CA | 5500499 | 340000 | 0.062 |
2021-07 | 12002 | Houston | TX | 3250012 | 211000 | 0.065 |
WhereQTDPERCAPITA
is simply Quantity/Population
. I have multiple cities (4149 to be more precise).
The quantities change according to every month, and so does the population.
I would like to create a new variable that serve as a ranking, ranging from [0,1]
, where 0
is the city with the lowest QTDPERCAPITA
in that month, and 1
is the city with the most quantity per capita in that month. Essentially, I want to create a new column that looks like this:
Date | Code | City | State | Population | Quantity | QTDPERCAPITA | RANKING |
---|---|---|---|---|---|---|---|
2020-01 | 11001 | Los Angeles | CA | 5000000 | 100000 | 0.02 | 0 |
2020-02 | 11001 | Los Angeles | CA | 5000000 | 125000 | 0.025 | 0 |
2020-03 | 11001 | Los Angeles | CA | 5000000 | 135000 | 0.027 | 0 |
2020-01 | 12002 | Houston | TX | 3000000 | 150000 | 0.05 | 1 |
2020-02 | 12002 | Houston | TX | 3000000 | 100000 | 0.033 | 1 |
2020-03 | 12002 | Houston | TX | 3000000 | 200000 | 0.066 | 1 |
... | ... | ... | ... | ... | ... | ... | ... |
2021-07 | 11001 | Los Angeles | CA | 5500499 | 340000 | 0.062 | 0 |
2021-07 | 12002 | Houston | TX | 3250012 | 211000 | 0.065 | 1 |
How can I create this column such that the RANKING
changes every month? I was thinking of a for
loop that extracts the QTDPERCAPITA
for every city on every unique date, and creates a new column, df['RANKING']
with the same date
and city
.
You can use:
# MinMax scaler: (rank - min) / (max - min)
ranking = lambda x: (x.rank() - 1) / (len(x) - 1)
# Rank between [0, 1] -> 0 the lowest, 1 the highest
df['RANKING'] = df.groupby('Date')['QTDPERCAPITA'].apply(ranking)
# Rank between [1, 4149] -> 1 the lowest, 4149 the highest
# df['RANKING'] = df.groupby('Date')['QTDPERCAPITA'].rank('dense')
Output:
Date | Code | City | State | Population | Quantity | QTDPERCAPITA | RANKING |
---|---|---|---|---|---|---|---|
2020-01 | 11001 | Los Angeles | CA | 5000000 | 100000 | 0.02 | 0 |
2020-02 | 11001 | Los Angeles | CA | 5000000 | 125000 | 0.025 | 0 |
2020-03 | 11001 | Los Angeles | CA | 5000000 | 135000 | 0.027 | 0 |
2020-01 | 12002 | Houston | TX | 3000000 | 150000 | 0.05 | 1 |
2020-02 | 12002 | Houston | TX | 3000000 | 100000 | 0.033 | 1 |
2020-03 | 12002 | Houston | TX | 3000000 | 200000 | 0.066 | 1 |
2021-07 | 11001 | Los Angeles | CA | 5500499 | 340000 | 0.618 | 1 |
2021-07 | 12002 | Houston | TX | 3250012 | 211000 | 0.065 | 0 |