pythonpandascastingtype-conversiondtype

Pandas reading csv as string type


I have a data frame with alpha-numeric keys which I want to save as a csv and read back later. For various reasons I need to explicitly read this key column as a string format, I have keys which are strictly numeric or even worse, things like: 1234E5 which Pandas interprets as a float. This obviously makes the key completely useless.

The problem is when I specify a string dtype for the data frame or any column of it I just get garbage back. I have some example code here:

df = pd.DataFrame(np.random.rand(2,2),
                  index=['1A', '1B'],
                  columns=['A', 'B'])
df.to_csv(savefile)

The data frame looks like:

           A         B
1A  0.209059  0.275554
1B  0.742666  0.721165

Then I read it like so:

df_read = pd.read_csv(savefile, dtype=str, index_col=0)

and the result is:

   A  B
B  (  <

Is this a problem with my computer, or something I'm doing wrong here, or just a bug?


Solution

  • Update: this has been fixed: from 0.11.1 you passing str/np.str will be equivalent to using object.

    Use the object dtype:

    In [11]: pd.read_csv('a', dtype=object, index_col=0)
    Out[11]:
                          A                     B
    1A  0.35633069074776547     0.745585398803751
    1B  0.20037376323337375  0.013921830784260236
    

    or better yet, just don't specify a dtype:

    In [12]: pd.read_csv('a', index_col=0)
    Out[12]:
               A         B
    1A  0.356331  0.745585
    1B  0.200374  0.013922
    

    but bypassing the type sniffer and truly returning only strings requires a hacky use of converters:

    In [13]: pd.read_csv('a', converters={i: str for i in range(100)})
    Out[13]:
                          A                     B
    1A  0.35633069074776547     0.745585398803751
    1B  0.20037376323337375  0.013921830784260236
    

    where 100 is some number equal or greater than your total number of columns.

    It's best to avoid the str dtype, see for example here.