pythonpandasdataframemergepandas-merge

PANDAS How to include data in a MERGE that has missing data in some ROWS


I have two dataframes. One called SERVICES and one called TIMES.

I am joining them together like so:

servicesMerged = pd.merge(services, times, left_on='Ref_Id', right_on='Ref_ID')

This is fine and works, except some of the TIMES data is missing a ref_id.

This is service data for a booking system, so for example we might have this:

**TIMES**

Ref_Id  |  TIMES
1  |  30
2  |  15
3  |  10


**SERVICES**

Ref_ID  |  Name
1  |  Mowing
2  |  Raking
3  |  Blowing
4  |  Trimming

What is happening, is we're getting a nice merge, but the service Trimming does not come into the new dataset, as it's missing the time in the times dataframe.

What we need it to do, is, if the time is missing (as per this example) that we add some data, so we'd add say 15 minutes.

Something you would traditionally do like so:

If not exists time:
    Create a time and make it 15

I've tried how = inner, outer, left, right, but nothing works.

How can I, if a row is missing like above, force the data to be added to the merged data?

Thank you.


Solution

  • Creating the dfs like this:

    times = pd.DataFrame({'Ref_Id':[1,2,3],
                          'TIMES':[30, 15, 10]})
    
    services = pd.DataFrame({'Ref_ID':[1,2,3,4],
                             'Name':['Mowing', 'Raking', 'Blowing', 'Trimming']})
    

    Then you should just be able to add how='left' to your code (note I had to swap your left_on and right_on, as the capital D in Ref_ID is in the left service table):

    servicesMerged = pd.merge(services, times, left_on='Ref_ID', right_on='Ref_Id', how='left')
    
       Ref_ID      Name  Ref_Id  TIMES
    0       1    Mowing     1.0   30.0
    1       2    Raking     2.0   15.0
    2       3   Blowing     3.0   10.0
    3       4  Trimming     NaN    NaN
    

    alternatively, you can write it like this:

    servicesMerged = services.merge(times, left_on='Ref_ID', right_on='Ref_Id', how='left')
    

    To fill in the blank times, you can use .fillna():

    servicesMerged['TIMES'] = servicesMerged['TIMES'].fillna(15)
    
       Ref_ID      Name  Ref_Id  TIMES
    0       1    Mowing     1.0   30.0
    1       2    Raking     2.0   15.0
    2       3   Blowing     3.0   10.0
    3       4  Trimming     NaN   15.0
    

    NB: If you were to have the Ref_Id column name to match in both tables (either both Ref_Id or both Ref_ID) you could then just use on='Ref_Id' instead of both left and right on, and then you wouldn't get the second Ref_Id column in the output.