[SOLVED] Merge more than 2 dataframes if they exist and initialised

Merge more than 2 dataframes if they exist and initialised

I am trying to merge three dataframes using intersection(). How can we check that all dataframes exists/initialised before running the intersection() without multiple if-else check blocks. If any dataframe is not assigned, then don't take it while doing the intersection(). Sometimes I am getting error - UnboundLocalError: local variable 'df_2' referenced before assignment, because file2 does not exist.

OR is there any other easy way to achieve below?

Below is my approach:

if os.path.exists(file1):
        df_1 = pd.read_csv(file1, header=None, names=header_1, sep=',', index_col=None)
if os.path.exists(file2):
        df_2 = pd.read_csv(file2, header=None, names=header_2, sep=',', index_col=None)
if os.path.exists(file3):
        df_3 = pd.read_csv(file3, header=None, names=header_3, sep=',', index_col=None)

common_columns = df_1.columns.intersection(df_2.columns).intersection(df_3.columns)
filtered_1 = df_1[common_columns]
filtered_2 = df_2[common_columns]
filtered_3 = df_3[common_columns]
concatenated_df = pd.concat([filtered_1, filtered_2, filtered_3], ignore_index=True)

Solution

Your code is already very good. You have many repeated elements in your current version. To make it cleaner, you could use list comprehension like [function(x) for x in a_list]

files = [file1, file2, file3]
headers = [header_1, header_2, header_3]

dfs = [pd.read_csv(f, header=None, names=h, sep=',') for f, h in zip(files, headers) if os.path.exists(f)]

if dfs:
    common_columns = set.intersection(*(set(df.columns) for df in dfs))
    concatenated_df = pd.concat([df[list(common_columns)] for df in dfs], ignore_index=True)
else:
    concatenated_df = pd.DataFrame()