I am trying to merge three dataframes using intersection(). How can we check that all dataframes exists/initialised before running the intersection() without multiple if-else check blocks. If any dataframe is not assigned, then don't take it while doing the intersection(). Sometimes I am getting error - UnboundLocalError: local variable 'df_2' referenced before assignment, because file2 does not exist.
OR is there any other easy way to achieve below?
Below is my approach:
if os.path.exists(file1):
df_1 = pd.read_csv(file1, header=None, names=header_1, sep=',', index_col=None)
if os.path.exists(file2):
df_2 = pd.read_csv(file2, header=None, names=header_2, sep=',', index_col=None)
if os.path.exists(file3):
df_3 = pd.read_csv(file3, header=None, names=header_3, sep=',', index_col=None)
common_columns = df_1.columns.intersection(df_2.columns).intersection(df_3.columns)
filtered_1 = df_1[common_columns]
filtered_2 = df_2[common_columns]
filtered_3 = df_3[common_columns]
concatenated_df = pd.concat([filtered_1, filtered_2, filtered_3], ignore_index=True)
Your code is already very good. You have many repeated elements in your current version. To make it cleaner, you could use list comprehension like [function(x) for x in a_list]
files = [file1, file2, file3]
headers = [header_1, header_2, header_3]
dfs = [pd.read_csv(f, header=None, names=h, sep=',') for f, h in zip(files, headers) if os.path.exists(f)]
if dfs:
common_columns = set.intersection(*(set(df.columns) for df in dfs))
concatenated_df = pd.concat([df[list(common_columns)] for df in dfs], ignore_index=True)
else:
concatenated_df = pd.DataFrame()