pythonpandascsvawkdata-cleaning

How to merge two CSV files based on matching values in different columns and keep unmatched rows with placeholders?


I'm working on a data cleaning task and could use some help. I have two CSV files with thousands of rows each:

File A contains product shipment records. File B contains product descriptions and categories. Here’s a simplified example:

File A (shipments.csv):

shipment_id,product_code,quantity,date
S001,P123,10,2025-07-01
S002,P456,5,2025-07-02
S003,P789,8,2025-07-03 

File B (products.csv):

product_code,description,category
P123,Widget A,Tools
P456,Widget B,Hardware

I want to create a merged file where each row from File A is enriched with the matching product description and category from File B (based on product_code). If there's no match, I’d like to keep the row from File A and fill the missing columns with "N/A".

Expected Output:

shipment_id,product_code,quantity,date,description,category
S001,P123,10,2025-07-01,Widget A,Tools
S002,P456,5,2025-07-02,Widget B,Hardware
S003,P789,8,2025-07-03,N/A,N/A

I tried using pandas.merge() in Python but it drops unmatched rows unless I use how='left', and I’m not sure how to fill missing values properly.

Any help? Thanks in advance!


Solution

  • Yor way is correct, first you should merge two dfs, based on the product_code , then fill missing values with "N\A". Because when the left dataframe does not find its match on the right dataframe, it automatically puts NaN to those rows.

    import pandas as pd
    shipments= [
        ["S001", "P123", 10, "2025-07-01"],
        ["S002", "P456", 5, "2025-07-02"],
        ["S003", "P789", 8, "2025-07-03"]
    ]
    
    columns = ["shipment_id", "product_code", "quantity", "date"]
    df_shipments = pd.DataFrame(shipments, columns=columns)
    
    products = [
        ["P123", "Widget A", "Tools"],
        ["P456", "Widget B", "Hardware"]
    ]
    
    columns = ["product_code", "description", "category"]
    
    df_products = pd.DataFrame(products, columns=columns)
    
    merged_df = pd.merge(df_shipments, df_products,how = 'left').fillna("N/A")
    

    At the last line of the code, it is merging as you said and then we are using fillna to fill those missing values.

    Additional: Since you have product_code on both dataframes, it automatically matches those columns. But if you have more than one column that have same name on both dataframes, then it try to match both of them. So if you want to stick to the selected column, then better use left_on and right_on .

    Moreover, if you want only category from the right dataframe, you can use something like:

    merged_df = pd.merge(df_shipments, df_products[['product_code', 'category']], how = 'left').fillna("N/A") 
    

    This could be better as it does not alter the original dataframe of df_products .