pythondata-cleaningkeyerror

Error filtering multiple csv files based on rows


I have a folder that contains 20 csv files. Each file has about 10 columns and thousands of rows. The csv files look something like the following:

gene p-value xyz
acan 0.05 123
mmp2 0.02 456
mmp9 0.07 789
nnos 0.09 123
gfap 0.01 456

I have written the following script with the purpose of going through each file and filtering the rows only by the genes of interest that I have indicated (in this case mmp2 and mmp9) and then saving the csv file only with those rows.

# the goal is to edit and save the csv files so they only contain the genes of interest

path = '/Users/adriana/Library/Documents/raw_data'
all_files = glob.glob(os.path.join(path, "*.csv")) #make list of file paths 
genes = ["mmp2", "mmp9"]
for file in all_files:
    path = '/Users/adriana/Library/Documents/raw_data'
    df = pd.read_csv(file,delimiter ='\t')
    cleaned = df[df['gene'].isin(genes)]
    cleaned.to_csv(file)

However, I get the following error:

KeyError: 'gene'

Not sure why I am getting this error as this is a column in each of my files.


Solution

  • Try skipping the dfs that it says are missing a gene column. And check to see if the column names exactly match the word gene by printing df.info(). This way the code won't fail and you can see which dfs are causing the issue

    if 'gene' in df.columns:
        cleaned = df[df['gene'].isin(genes)]
        ...
    else:
        print(df.info())