pythoncsvpandascsvkit

Python Pandas compare CSV keyerror


I am using Python Pandas to try and match the references from CSV2 to the data in CSV1 and create a new output file.

CSV1

reference,name,house
234 8A,john,37
564 68R,bill,3
RT4 VV8,kate,88
76AA,harry ,433

CSV2

reference
234 8A
RT4 VV8

CODE

 import pandas as pd
    df1 = pd.read_csv(r'd:\temp\data1.csv')
    df2 = pd.read_csv(r'd:\temp\data2.csv')
    df3 = pd.merge(df1,df2, on= 'reference', how='inner')
    df3.to_csv('outpt.csv')

I am getting a keyerror for reference when I run it, could it be the spaces in the data that is causing the issue? The data is comma delimited.


Solution

  • most probably you have either leading or trailing white spaces in reference column after reading your CSV files.

    you can check it in this way:

    print(df1.columns.tolist())
    print(df2.columns.tolist())
    

    you can "fix" it by adding sep=r'\s*,\s*' parameter to your pd.read_csv() calls

    Example:

    In [74]: df1
    Out[74]:
      reference    name  house
    0    234 8A    john     37
    1   564 68R    bill      3
    2   RT4 VV8    kate     88
    3      76AA  harry     433
    
    In [75]: df2
    Out[75]:
      reference
    0     234 8A
    1    RT4 VV8
    
    In [76]: df2.columns.tolist()
    Out[76]: ['reference ']
    
    In [77]: df1.columns.tolist()
    Out[77]: ['reference', 'name', 'house']
    
    In [78]: df1.merge(df2, on='reference')
    
    ...
    
    KeyError: 'reference'
    

    fixing df2:

    data = """\
    reference 
    234 8A
    RT4 VV8"""
    df2 = pd.read_csv(io.StringIO(data), sep=r'\s*,\s*')
    

    now it works:

    In [80]: df1.merge(df2, on='reference')
    Out[80]:
      reference  name  house
    0    234 8A  john     37
    1   RT4 VV8  kate     88