pythonpandascsvcsv-import

pandas read_csv and filter columns with usecols


I have a csv file which isn't coming in correctly with pandas.read_csv when I filter the columns with usecols and use multiple indexes.

import pandas as pd
csv = r"""dummy,date,loc,x
   bar,20090101,a,1
   bar,20090102,a,3
   bar,20090103,a,5
   bar,20090101,b,1
   bar,20090102,b,3
   bar,20090103,b,5"""

f = open('foo.csv', 'w')
f.write(csv)
f.close()

df1 = pd.read_csv('foo.csv',
        header=0,
        names=["dummy", "date", "loc", "x"], 
        index_col=["date", "loc"], 
        usecols=["dummy", "date", "loc", "x"],
        parse_dates=["date"])
print df1

# Ignore the dummy columns
df2 = pd.read_csv('foo.csv', 
        index_col=["date", "loc"], 
        usecols=["date", "loc", "x"], # <----------- Changed
        parse_dates=["date"],
        header=0,
        names=["dummy", "date", "loc", "x"])
print df2

I expect that df1 and df2 should be the same except for the missing dummy column, but the columns come in mislabeled. Also the date is getting parsed as a date.

In [118]: %run test.py
               dummy  x
date       loc
2009-01-01 a     bar  1
2009-01-02 a     bar  3
2009-01-03 a     bar  5
2009-01-01 b     bar  1
2009-01-02 b     bar  3
2009-01-03 b     bar  5
              date
date loc
a    1    20090101
     3    20090102
     5    20090103
b    1    20090101
     3    20090102
     5    20090103

Using column numbers instead of names give me the same problem. I can workaround the issue by dropping the dummy column after the read_csv step, but I'm trying to understand what is going wrong. I'm using pandas 0.10.1.

edit: fixed bad header usage.


Solution

  • The solution lies in understanding these two keyword arguments:

    So because you have a header row, passing header=0 is sufficient and additionally passing names appears to be confusing pd.read_csv.

    Removing names from the second call gives the desired output:

    import pandas as pd
    from StringIO import StringIO
    
    csv = r"""dummy,date,loc,x
    bar,20090101,a,1
    bar,20090102,a,3
    bar,20090103,a,5
    bar,20090101,b,1
    bar,20090102,b,3
    bar,20090103,b,5"""
    
    df = pd.read_csv(StringIO(csv),
            header=0,
            index_col=["date", "loc"], 
            usecols=["date", "loc", "x"],
            parse_dates=["date"])
    

    Which gives us:

                    x
    date       loc
    2009-01-01 a    1
    2009-01-02 a    3
    2009-01-03 a    5
    2009-01-01 b    1
    2009-01-02 b    3
    2009-01-03 b    5