python pandas csv datatables combinations

Pair of employees that worked together for the longest period of time - Python/Pandas

I recently had to do a code, which returns the pair of employees that have worked the most together on a common project. This is the code I came up with:

Note 1: Null is read by the program as "Today"

Note 2: The data comes from a .txt file in this form:

EmpID,ProjectID,DateFrom,DateTo
1,101,2014-11-01,2015-05-01
1,103,2013-11-01,2016-05-01
2,101,2013-12-06,2014-10-06
2,103,2014-06-05,2015-05-14
3,100,2016-03-01,2018-07-03
3,102,2015-06-04,2017-09-04
3,103,2015-06-04,2017-09-04
4,102,2013-11-13,2014-03-13
4,103,2016-02-14,2017-03-15
4,104,2014-10-01,2015-12-01
5,100,2013-03-07,2015-11-07
5,101,2015-07-09,2019-01-19
5,102,2014-03-15,NULL
6,101,2014-03-15,2014-03-16

The problem that I currently have is that I have to adapt/change the code to return the pair of employees that have worked together with each other the longest (not on a single project, but all projects combined). I am having troubles of adapting my current code, which runs perfectly fine for what it is, and I am wondering if I should just scratch all of this and start from the beginning (but it would cost me a lot of time, which I don't have currently). I am having difficulties with obtaining the combinations of employees that have worked together on projects.

I would very much appreciate it if anyone can give me any tips! Thanks!

Edit 1: A person in the comments reminded me to mention that overlapping days should be counted as for example:

Person A and B work on two projects for the entirety of June. This means that it should be counted as 30 days total common work (for the two projects), not adding both project times together, which would result in 60 days.

Solution

Here's one of the more straight-forward ways I can think of doing this.

Expand the timespans to a single row per date.
Merge all Days on the same project (to get all combinations of people who worked together)
Remove duplicated rows of people who work together on the same day, but different projects.
Just find how many rows are within each worker pairing.

Code:

import pandas as pd
import numpy as np

def expand_period_daily(df, start, stop):
    # Allows it to work for one day spans. 
    df.loc[df[stop].notnull(), stop] = (df.loc[df[stop].notnull(), stop] 
                                        + pd.Timedelta(hours=1))

    real_span = df[[start, stop]].notnull().all(1)

    # Resample timespans to daily fields. 
    df['temp_id'] = range(len(df))
    dailydf = (df.loc[real_span, ['temp_id', start, stop]].set_index('temp_id').stack()
                 .reset_index(level=-1, drop=True).rename('period').to_frame())
    dailydf = (dailydf.groupby('temp_id').apply(lambda x: x.set_index('period')
                      .resample('d').asfreq()).reset_index())

    # Merge back other information
    dailydf = (dailydf.merge(df, on=['temp_id'])
                      .drop(columns=['temp_id', start, stop]))

    return dailydf

# Make dates, fill missings.
df[['DateFrom', 'DateTo']] = df[['DateFrom', 'DateTo']].apply(pd.to_datetime, errors='coerce')
df[['DateFrom', 'DateTo']] = df[['DateFrom', 'DateTo']].fillna(pd.to_datetime('today').normalize())

dailydf = expand_period_daily(df.copy(), start='DateFrom', stop='DateTo')

# Merge, remove rows of employee with him/herself.
m = (dailydf.merge(dailydf, on=['period', 'ProjectID'])
            .loc[lambda x: x.EmpID_x != x.EmpID_y])

# Ensure A-B and B-A are grouped the same
m[['EmpID_x', 'EmpID_y']] = np.sort(m[['EmpID_x', 'EmpID_y']].to_numpy(), axis=1)

# Remove duplicated projects on same date between employee pairs
m = m.drop_duplicates(['period', 'EmpID_x', 'EmpID_y'])

m.groupby(['EmpID_x', 'EmpID_y']).size().to_frame('Days_Together')

Output:

                 Days_Together
EmpID_x EmpID_y               
1       2                  344
        3                  333
        4                   78
2       6                    2
3       4                  396
        5                  824

Test Case

To give a bit more clarity on how it handles overlaps, and combines different projects, Here's the following test case:

   EmpID  ProjectID   DateFrom     DateTo
0      1        101 2014-11-01 2014-11-15
1      1        103 2014-11-01 2014-11-15
2      1        105 2015-11-02 2015-11-03
3      2        101 2014-11-01 2014-11-15
4      2        103 2014-11-01 2014-11-15
5      2        105 2015-10-02 2015-11-05
6      3        101 2014-11-01 2014-11-15

Employees 1 and 2 perfectly overlap for 15 days on 2 projects in Nov 2014. They then work together for 2 additional days on another project in 2015. 1, 2 and 3 all work together for 15 days on a single Project.

Running with this test case we obtain:

                 Days_Together
EmpID_x EmpID_y               
1       2                   17
        3                   15
2       3                   15