pythonpython-3.xpandasdataframepandas-merge

Duplicate rows and incorrect indicators in merged DataFrame with Pandas merge


The date & function is:

pd.DataFrame({'Column1': ['dummy_text1'],
              'Column2': [''],
              'Column3': [''],
              'Column4': ['dummy_text2'],
              'Column5': [''],
              'Column6': ['Dummy Role 1']
              })
left_df.merge(right_df, 
              on=['Column1', 'Column2', 'Column3', 'Column4', 'Column5', 'Column6'], 
              how='outer', 
              indicator=True)

The question is:

the result contains two rows, one as 'left_only' and one as 'right_only', instead of a single row with 'both' as indicator.

Other info.:

I have already tried the following steps to address the issue:

Edit:

Sharing code snippet: I forgot to mention this part, this is to resolve case of the values:

left_df['Column1_U'] = left_df['Column1'].copy()
left_df['Column6_U'] = left_df['Column6'].copy()
right_df['Column1_U'] = right_df['Column1'].copy()
right_df['Column6_U'] = right_df['Column6'].copy()

this is the rest of the code:

columns_to_exclude = ['Column1_U', 'Column6_U']

left_df = left_df.apply(lambda s: s.str.lower() if s.name not in columns_to_exclude else s)
right_df = right_df.apply(lambda s: s.str.lower() if s.name not in columns_to_exclude else s)

merged_df = merge(left = left_df
                ,right = right_df
                ,on=['Column1', 'Column2', 'Column3', 'Column4', 'Column5', 'Column6']
                ,how='outer'
                ,indicator=True)

but the output that i'm getting is this: Column1,Column2,Column3,Column4,Column5,Column6,Column1_U_x,Column6_U_x,Column1_U_y,Column6_U_y,_merge dummy_text1,,,dummy_text1,,dummy role 1,Dummy_text1,Dummy Role 1,,,left_only dummy_text1,,,dummy_text1,,dummy role 1,,,Dummy_text1,Dummy Role 1,right_only


Solution

  • It is solved. The issue turned out to be with NULL value and '', both these dataframes were getting created from SQL tables and in the query I was using ISNULL(Column, '') as Column in one and NULL AS Column in the other. Turns out after doing dataframe.fillna(''), it was treating the '' differently between both dataframes, after I updated the queries the issue was resolved.