pythonpandasdataframegroup-byrunning-count

Iterating through pandas groupby groups


I have a pandas dataframe school_df that looks like this:

    school_id  date_posted date_completed
0    A          2014-01-01  2014-01-01
1    A          2014-01-01  2014-01-08
2    A          2014-04-29  2014-05-01
3    B          2014-01-01  2014-01-01
4    B          2014-01-20  2014-02-23

Each row represents one project by that school. I'd like to add two columns: for each unique school_id, a count of how many projects were posted before that date and a count of how many projects were completed before that date.

The code below works, but I have ~300,000 unique schools, so it's taking a long time to run. Is there a faster way to get what I am looking for? Thank you for your assistance!

import pandas as pd
groups = school_df.groupby("school_id")
blank_df = pd.DataFrame()
for g, df in groups:
    df['school_previous_projects'] = df.date_posted.map(lambda x: len(df[df.date_posted < x]))
    df['school_previous_completed'] = df.date_posted.map(lambda x: len(df[df.date_completed < x]))
    blank_df = pd.concat([blank_df, df])

Solution

  • Here is a version using cumcount (I simplified the dates, but still should work):

    import pandas as pd
    import io
    
    
    df = pd.DataFrame({'school_id': ['A', 'A', 'A', 'B', 'B'],
                       'date_posted': pd.date_range('2014-01-01', '2014-01-05'),
                       'date_completed': pd.date_range('2014-01-01', '2014-01-05')})
    
    posted = df.set_index('date_posted').groupby('school_id').cumcount()
    comp = df.set_index('date_completed').groupby('school_id').cumcount()
    
    df['posted'] = posted.values
    df['comp'] = comp.values
    
    print df
    

    Results in:

      date_completed date_posted school_id  posted  comp 
    0     2014-01-01  2014-01-01         A       0     0 
    1     2014-01-02  2014-01-02         A       1     1 
    2     2014-01-03  2014-01-03         A       2     2 
    3     2014-01-04  2014-01-04         B       0     0 
    4     2014-01-05  2014-01-05         B       1     1