pythonpandasgroup-bycalculated-columns

Dynamic calculation for columns in pandas


I have a data frame like

Bus_no.   City 
3412.     Kolkata 
7658.     Nagpur
3412.     Mumbai
5516.     Kolkata 
7658.     Chennai
3412.     Mumbai

I need to do a dynamic calculation based on these columns like for example bus_no.3412 is appearing in the whole dataset 3 times and appearing in kolkata 1time and Mumbai 2 times. So for 3412 in kolkata it will be = (1/3)*100 and for Mumbai = (2/3)*100. Similarly I need calculation for each bus_no. In each city. I am using python 3.6 . Kindly help

So for 3412 in kolkata it will be = (1/3)*100 and for Mumbai = (2/3)*100.

Similarly I need calculation for each bus_no. In each city.

I am using python 3.6.

Output should be like:

Busno. Kolkata mumbai Chennai nagpur
3412. Value Value Value value

Kindly help


Solution

  • If you want to include all cities per bus (i.e. including the ones with 0%), you can use pd.crosstab with normalize='index', chaining df.mul and df.stack:

    import pandas as pd
    
    data = {'Bus_no.': {0: '3412.', 1: '7658.', 2: '3412.', 3: '5516.', 4: '7658.', 
                        5: '3412.'}, 
            'City': {0: 'Kolkata', 1: 'Nagpur', 2: 'Mumbai', 3: 'Kolkata', 4: 'Chennai', 
                     5: 'Mumbai'}}
    df = pd.DataFrame(data)
    
    out = (pd.crosstab(index=df['Bus_no.'], 
                       columns=df['City'], 
                       normalize='index')
           .mul(100)
           .stack()
           )
    

    Output

    Bus_no.  City   
    3412.    Chennai      0.000000
             Kolkata     33.333333
             Mumbai      66.666667
             Nagpur       0.000000
    5516.    Chennai      0.000000
             Kolkata    100.000000
             Mumbai       0.000000
             Nagpur       0.000000
    7658.    Chennai     50.000000
             Kolkata      0.000000
             Mumbai       0.000000
             Nagpur      50.000000
    dtype: float64
    

    Edit: without chaining .stack() and instead chaining df.reset_index and df.rename_axis to get this in the desired format:

    (pd.crosstab(index=df['Bus_no.'], 
                 columns=df['City'], 
                 normalize='index')
     .mul(100).reset_index().rename_axis(columns=None)
     )
    
      Bus_no.  Chennai     Kolkata     Mumbai  Nagpur
    0   3412.      0.0   33.333333  66.666667     0.0
    1   5516.      0.0  100.000000   0.000000     0.0
    2   7658.     50.0    0.000000   0.000000    50.0
    

    Alternative with df.groupby

    Without needing all cities per bus, you could also use df.groupby and get the normalized value_counts:

    out2 = df.groupby('Bus_no.')['City'].value_counts(normalize=True).mul(100)
    

    Output alternative

    Bus_no.  City   
    3412.    Mumbai      66.666667
             Kolkata     33.333333
    5516.    Kolkata    100.000000
    7658.    Chennai     50.000000
             Nagpur      50.000000
    Name: proportion, dtype: float64
    

    But in that case, the answer by @mozway will be better.