pythonpandastypesconverterstype-inference

What's the difference between dtype and converters in pandas.read_csv?


pandas function read_csv() reads a .csv file. Its documentation is here

According to documentation, we know:

dtype : Type name or dict of column -> type, default None Data type for data or columns. E.g. {‘a’: np.float64, ‘b’: np.int32} (Unsupported with engine=’python’)

and

converters : dict, default None Dict of functions for converting values in certain columns. Keys can either be integers or column labels

When using this function, I can call either pandas.read_csv('file',dtype=object) or pandas.read_csv('file',converters=object). Obviously, converter, its name can says that data type will be converted but I wonder the case of dtype?


Solution

  • The semantic difference is that dtype allows you to specify how to treat the values, for example, either as numeric or string type.

    Converters allows you to parse your input data to convert it to a desired dtype using a conversion function, e.g, parsing a string value to datetime or to some other desired dtype.

    Here we see that pandas tries to sniff the types:

    In [2]:
    df = pd.read_csv(io.StringIO(t))
    t="""int,float,date,str
    001,3.31,2015/01/01,005"""
    df = pd.read_csv(io.StringIO(t))
    df.info()
    
    <class 'pandas.core.frame.DataFrame'>
    Int64Index: 1 entries, 0 to 0
    Data columns (total 4 columns):
    int      1 non-null int64
    float    1 non-null float64
    date     1 non-null object
    str      1 non-null int64
    dtypes: float64(1), int64(2), object(1)
    memory usage: 40.0+ bytes
    

    You can see from the above that 001 and 005 are treated as int64 but the date string stays as str.

    If we say everything is object then essentially everything is str:

    In [3]:    
    df = pd.read_csv(io.StringIO(t), dtype=object).info()
    
    <class 'pandas.core.frame.DataFrame'>
    Int64Index: 1 entries, 0 to 0
    Data columns (total 4 columns):
    int      1 non-null object
    float    1 non-null object
    date     1 non-null object
    str      1 non-null object
    dtypes: object(4)
    memory usage: 40.0+ bytes
    

    Here we force the int column to str and tell parse_dates to use the date_parser to parse the date column:

    In [6]:
    pd.read_csv(io.StringIO(t), dtype={'int':'object'}, parse_dates=['date']).info()
    
    <class 'pandas.core.frame.DataFrame'>
    Int64Index: 1 entries, 0 to 0
    Data columns (total 4 columns):
    int      1 non-null object
    float    1 non-null float64
    date     1 non-null datetime64[ns]
    str      1 non-null int64
    dtypes: datetime64[ns](1), float64(1), int64(1), object(1)
    memory usage: 40.0+ bytes
    

    Similarly we could've pass the to_datetime function to convert the dates:

    In [5]:
    pd.read_csv(io.StringIO(t), converters={'date':pd.to_datetime}).info()
    
    <class 'pandas.core.frame.DataFrame'>
    Int64Index: 1 entries, 0 to 0
    Data columns (total 4 columns):
    int      1 non-null int64
    float    1 non-null float64
    date     1 non-null datetime64[ns]
    str      1 non-null int64
    dtypes: datetime64[ns](1), float64(1), int64(2)
    memory usage: 40.0 bytes