I require an assistant with coding a comparison between two tables in Python, that is currently done in winmerge.
The code is as follows
import pandas as pd
Last week's table
df1=pd.read_csv(r"C:\Users\ri0a\OneDrive - Department of Environment, Land, Water and Planning\Python practice\pvmodules+_210326.csv")
This week table with new model numbers, and expire dates
df2=pd.read_csv(r"C:\Users\ri0a\OneDrive - Department of Environment, Land, Water and Planning\Python practice\pvmodules+_210401.csv")
The table head is as below
the third column is PV_module certificate: Expiry date. I want to set a logic similar to excel logic '=IF (D2<DATEVALUE("19/04/2021"),"Expired","OK). The objective here is to delete the entire rows where the expiry date is below a specific date/ today's date.
Next,Importing dataframe_diff package
from dataframe_diff import dataframe_diff
Executing the difference
d1_column,d2_additional=dataframe_diff(df1,df2,key=['PV Module Certificate: Licensee/Certificate Holder Account','Model Number/s'])
With this package d2_additional shows if there are new rows associated with model numbers added this week compared to last week.
However, I am trying to replicate the following output
The tasks involved are
Now: as with
d2_expires = merged_df[merged_df._merge == 'left_only']
and with
d2_additional = merged_df[merged_df._merge == 'right_only']
I get the same output. Same rows are returned, which should not be the case. As seen from the screen below
And Finally, I get an error with d2_comaprison.
d3_comparison = merged_df[merged_df._merge == 'both'].\
loc[lambda x: x.PV Module Certificate: Expiry Date_last_week != x.PV Module Certificate: Expiry Date_this_week]
You have to ensure to convert the dates into datetime format after loading the data, and rename the columns to something easier to work with (for example 'cert_holder', 'model_no','approval_date','expiry_date')
I want to set a logic similar to excel logic '=IF (D2<DATEVALUE("19/04/2021"),"Expired","OK). The objective here is to delete the entire rows where the expiry date is below a specific date/ today's date.
This (removing) can be done with:
df = df[df['expiry_date'] >= pd.Timestamp('today')]
# Or
df = df[df['expiry_date'] >= pd.Timestamp('2021-04-23')]
But this only works, if your expiry dates are in datetime format.
Next merge the two dataframes:
merged_df = pd.merge(df1,df2, how='outer', on=['cert_holder','model_no'],\
suffixes=['_last_week','_this_week'], indicator=True)
If some model, in this case a row, was included in the last week's table, but is missing in current week's table, I want to assign a new field "Expired" in a new column "Status" next to it./ Or create a new dataframe, d2_expires, from only those missing rows.
d2_expires = merged_df[merged_df._merge == 'left_only']
Another dataframe, where the rows or the product models that were missing last week but added this week remains...As d2_additional.
d2_additional = merged_df[merged_df._merge == 'right_only']
A third dataframe, where any changes (for example expiry date) for same rows (same certificate + same model but different new expiry date) is captures as d3_comparison.
d3_comparison = merged_df[merged_df._merge == 'both'].\
loc[lambda x: x.expiry_date_last_week != x.expiry_date_this_week]