pythonpandascsvkeyerror

KeyError when selecting pandas columns


I'm trying to read in a CSV file into a pandas dataframe and select a column, but keep getting a key error.

The file reads in successfully and I can view the dataframe in an iPython notebook, but when I want to select a column any other than the first one, it throws a key error.

I am using this code:

import pandas as pd

transactions = pd.read_csv('transactions.csv',low_memory=False, delimiter=',', header=0, encoding='ascii')
transactions['quarter']

Traceback

KeyError                                  Traceback (most recent call last)
Cell In[10], line 2
      1 transactions = pd.read_csv('transactions.csv',low_memory=False, delimiter=',', header=0, encoding='ascii')
----> 2 transactions['quarter']

File ~\anaconda3\envs\py312\Lib\site-packages\pandas\core\frame.py:3896, in DataFrame.__getitem__(self, key)
   3894 if self.columns.nlevels > 1:
   3895     return self._getitem_multilevel(key)
-> 3896 indexer = self.columns.get_loc(key)
   3897 if is_integer(indexer):
   3898     indexer = [indexer]

File ~\anaconda3\envs\py312\Lib\site-packages\pandas\core\indexes\base.py:3797, in Index.get_loc(self, key)
   3792     if isinstance(casted_key, slice) or (
   3793         isinstance(casted_key, abc.Iterable)
   3794         and any(isinstance(x, slice) for x in casted_key)
   3795     ):
   3796         raise InvalidIndexError(key)
-> 3797     raise KeyError(key) from err
   3798 except TypeError:
   3799     # If we have a listlike key, _check_indexing_error will raise
   3800     #  InvalidIndexError. Otherwise we fall through and re-raise
   3801     #  the TypeError.
   3802     self._check_indexing_error(key)

KeyError: 'quarter'

transactions.csv

Full data file

product_id, customer_id, store_id, promotion_id, month_of_year, quarter, the_year, store_sales, store_cost, unit_sales, fact_count
1,157,24,1869,12,'Q4',1997,'8.5500','2.9925','3.0000',1
1,456,15,0,6,'Q2',1997,'11.4000','4.3320','4.0000',1
1,638,11,0,9,'Q3',1997,'8.5500','2.9925','3.0000',1
1,916,7,0,4,'Q2',1997,'11.4000','4.9020','4.0000',1
1,923,15,0,7,'Q3',1997,'8.5500','2.7360','3.0000',1
1,1312,3,0,5,'Q2',1997,'8.5500','3.6765','3.0000',1
1,1565,24,0,9,'Q3',1997,'8.5500','4.1895','3.0000',1
1,2270,11,0,11,'Q4',1997,'8.5500','4.0185','3.0000',1
1,3065,3,0,11,'Q4',1997,'5.7000','2.5080','2.0000',1
1,3441,3,0,8,'Q3',1997,'8.5500','3.4200','3.0000',1
1,3528,17,0,10,'Q4',1997,'8.5500','3.8475','3.0000',1
1,4461,11,0,4,'Q2',1997,'8.5500','2.9925','3.0000',1

Solution

  • Use sep=r'\s*,\s*' to parse a file where the columns may have some number of spaces preceding or following the delimiter (e.g. , ):

    transactions = pd.read_csv('transactions.csv', sep=r'\s*,\s*',
                               header=0, encoding='ascii', engine='python')
    

    Prove:

    print(transactions.columns)
    

    Output:

    Index(['product_id', 'customer_id', 'store_id', 'promotion_id', 'month_of_year', 'quarter', 'the_year', 'store_sales', 'store_cost', 'unit_sales', 'fact_count'], dtype='object')
    

    Alternatively, remove unquoted spaces in the CSV file, and use your command (unchanged).