I'm using pandas to load data from Excel with the resulting DataFrame containing both strings and dates. The columns containing strings are of dtype "object" while the date-columns are of dtype "datetime64[ns]". At some point in my code I need to convert one column from datetime to string for writing back to Excel, but pandas will not let me do that if I try it the way that would be most obvious to me and seems to be the recommended way of doing it according to documentation: using .loc to get the column to be changed and assign them with the same column converted to strings.
I have found ways to circumvent the problem and get pandas to do what I need, but either this is a bug or I do not understand some underlying mechanic which could come back to bite me in the longer run, hence my question.
The code to reproduce this (occurs in both pandas 2.0.0 and 2.0.1, this might cause the problem) is as follows (in the actual DataFrame I am using there are many more than a single column):
import pandas as pd
not_yet_datetime_df = pd.DataFrame([["2023-01-06", "2023-01-06", "2023-01-06", "2023-01-06", "2023-01-06"]]).T
datetime_df = not_yet_datetime_df.astype("datetime64[ns]")
datetime_df.loc[:, 0] = datetime_df.loc[:, 0].dt.strftime("%d.%m.%Y")
datetime_df.loc[:, 0] = datetime_df.loc[:, 0].astype("object") # neither of these two will work for me
print(datetime_df.dtypes) # will return "datetime64[ns]" for this single column
There are multiple ways to circumvent this that work for me, including simply replacing line 5 with
datetime_df[0] = datetime_df.loc[:, 0].dt.strftime("%d.%m.%Y")
(omitting the .loc left of the equals sign) and I at least can get the column to being "object" dtype with datetime_df = datetime_df.astype({0:"object"})
, but I don't quite understand why especially the first solution works and what I misunderstood about .loc - or about datetimes in general.
I read a bit into the pandas 2.0.0 change on returning views vs. copies but to my (limited) understanding this should not be affected by any of the 2.0.0 changes.
Could anyone help me understand what's happening here under the hood? I like using .loc over assigning just with []-brackets and I feel like it's not as intuitive as I had hoped.
See also Does .loc[:, ['A','B']] assignment allow to change the dtype of the columns?
- .loc
tries to cast back to the original type. There's a note in the docs on this, that might be hard to find:
When trying to convert a subset of columns to a specified type using astype() and loc(), upcasting occurs.
Anyways, that's why it keeps being datetime64[ns]. For demonstration, if you change the datetime format to something that pandas' parser doesn't accept, "upcasting" fails and the dtype is kept:
import pandas as pd
not_yet_datetime_df = pd.DataFrame([["2023-01-06", "2023-01-06", "2023-01-06", "2023-01-06", "2023-01-06"]]).T
datetime_df = not_yet_datetime_df.astype("datetime64[ns]")
datetime_df.loc[:, 0] = datetime_df.loc[:, 0].dt.strftime("%d. asdf %m. xx %Y")
print(datetime_df.dtypes)
# 0 object
# dtype: object
In general, I would still argue that in this case, the most basic [ ]
gets you safely where you want to. If you create a new column, you're neither slicing nor selecting nor indexing. Same goes for if you replace a column (e.g. string with datetime dtype Series; ok you're selecting something here...). So I don't see the point of using loc
here in the first place.