pythondataframenumpycomparedifferentiation

Automating Winmerge comparison in Python


I require an assistant with coding a comparison between two tables in Python, that is currently done in winmerge.

The code is as follows

import pandas as pd

Last week's table

df1=pd.read_csv(r"C:\Users\ri0a\OneDrive - Department of Environment, Land, Water and Planning\Python practice\pvmodules+_210326.csv")

This week table with new model numbers, and expire dates

df2=pd.read_csv(r"C:\Users\ri0a\OneDrive - Department of Environment, Land, Water and Planning\Python practice\pvmodules+_210401.csv")

The table head is as below

enter image description here

the third column is PV_module certificate: Expiry date. I want to set a logic similar to excel logic '=IF (D2<DATEVALUE("19/04/2021"),"Expired","OK). The objective here is to delete the entire rows where the expiry date is below a specific date/ today's date.

Next,Importing dataframe_diff package

from dataframe_diff import dataframe_diff

Executing the difference

d1_column,d2_additional=dataframe_diff(df1,df2,key=['PV Module Certificate: Licensee/Certificate Holder Account','Model Number/s'])

With this package d2_additional shows if there are new rows associated with model numbers added this week compared to last week. enter image description here

However, I am trying to replicate the following output

enter image description here

The tasks involved are

  1. If some model, in this case a row, was included in the last week's table, but is missing in current week's table, I want to assign a new field "Expired" in a new column "Status" next to it./ Or create a new dataframe, d2_expires, from only those missing rows.
  2. Another dataframe, where the rows or the product models that were missing last week but added this week remains...As d2_additional.
  3. A third dataframe, where any changes (for example expiry date) for same rows (same cerificate + same model but different new expiry date) is captures as d3_comparison.

Now: as with

d2_expires = merged_df[merged_df._merge == 'left_only']

and with

d2_additional = merged_df[merged_df._merge == 'right_only']

I get the same output. Same rows are returned, which should not be the case. As seen from the screen below

Expires output

This is the same as additions d2_aDDITIONS IS THE SAME OUTPUT AS D2_EXPIRES

And Finally, I get an error with d2_comaprison.

d3_comparison = merged_df[merged_df._merge == 'both'].\
        loc[lambda x: x.PV Module Certificate: Expiry Date_last_week != x.PV Module Certificate: Expiry Date_this_week]

error


Solution

  • You have to ensure to convert the dates into datetime format after loading the data, and rename the columns to something easier to work with (for example 'cert_holder', 'model_no','approval_date','expiry_date')

    I want to set a logic similar to excel logic '=IF (D2<DATEVALUE("19/04/2021"),"Expired","OK). The objective here is to delete the entire rows where the expiry date is below a specific date/ today's date.

    This (removing) can be done with:

    df = df[df['expiry_date'] >= pd.Timestamp('today')]
    
    # Or
    
    df = df[df['expiry_date'] >= pd.Timestamp('2021-04-23')]
    

    But this only works, if your expiry dates are in datetime format.

    Next merge the two dataframes:

    merged_df = pd.merge(df1,df2, how='outer', on=['cert_holder','model_no'],\
                         suffixes=['_last_week','_this_week'], indicator=True)
    

    If some model, in this case a row, was included in the last week's table, but is missing in current week's table, I want to assign a new field "Expired" in a new column "Status" next to it./ Or create a new dataframe, d2_expires, from only those missing rows.

    d2_expires = merged_df[merged_df._merge == 'left_only']
    

    Another dataframe, where the rows or the product models that were missing last week but added this week remains...As d2_additional.

    d2_additional = merged_df[merged_df._merge == 'right_only']
    

    A third dataframe, where any changes (for example expiry date) for same rows (same certificate + same model but different new expiry date) is captures as d3_comparison.

    d3_comparison = merged_df[merged_df._merge == 'both'].\
            loc[lambda x: x.expiry_date_last_week != x.expiry_date_this_week]