I am using pandas.read_csv
to load a CSV file. I want to have my file read mostly as-is to avoid any automatic data type conversions, except for a specific column (send_date
) I want parsed as a datetime.
The reason I want most columns read as strings or objects is to preserve data like zip codes with leading zeros (04321) and Boolean-like values (true
, false
, unknown
) that are stored as strings.
Using read_csv
without specifying dtype
causes unwanted type conversions.
df = pandas.read_csv("test.csv", parse_dates=['send_date'])
# name: Madeline (type: object) - correct
# zip_code: 4321 (type: int64) - wrong (missing leading 0)
# send_date: 2025-04-13 00:00:00 (type: datetime64[ns]) - correct
# is_customer: True (type: bool) - wrong (not a string)
Using dtype=object
correctly preserves zip_code
and is_customer
as string-like values, but it prevents send_date
from being set to type datetime64[ns]
.
df = pandas.read_csv("test.csv", dtype=object, parse_dates=['send_date'])
# name: Madeline (type: object) - correct
# zip_code: 04321 (type: object) - correct
# send_date: 2025-04-13 00:00:00 (type: object) - wrong (not datetime)
# is_customer: true (type: object) - correct
Manually setting the dtype
for send_date
to datetime64
raises an error.
df = pandas.read_csv("test.csv", dtype={"send_date":"datetime64"}, parse_dates=['send_date'])
# TypeError: the dtype datetime64 is not supported for parsing, pass this column using parse_dates instead
Setting dtype=str
causes send_date
to be interpreted as an integer timestamp.
df = pandas.read_csv("test.csv", dtype=str, parse_dates=['send_date'])
# name: Madeline (type: object) - correct
# zip_code: 04321 (type: object) - correct
# send_date: 1744502400000000000 (type: object) - wrong (not a date)
# is_customer: true (type: object) - correct
name | zip_code | send_date | is_customer |
---|---|---|---|
Madeline | 04321 | 2025-04-13 | true |
Theo | 32255 | 2025-04-08 | true |
Granny | 84564 | 2025-04-15 | false |
import pandas
def print_first_row_value_and_dtype(df: pandas.DataFrame):
row = df.iloc[0]
for col in df.columns:
print(f"{col}: {row[col]} (type: {df[col].dtype})")
filename = 'test.csv'
df = pandas.read_csv(filename, parse_dates=['send_date'])
print_first_row_value_and_dtype(df)
df = pandas.read_csv(filename, dtype=object, parse_dates=['send_date'])
print_first_row_value_and_dtype(df)
df = pandas.read_csv(filename, dtype=str, parse_dates=['send_date'])
print_first_row_value_and_dtype(df)
dtypes = {"name":"object", "zip_code":"object", "send_date":"datetime64", "is_customer":"object"}
df = pandas.read_csv(filename, dtype=dtypes, parse_dates=['send_date']) # raises TypeError
How can I make pandas.read_csv()
parse one column (send_date
) as a datetime while treating all other columns as strings or objects to avoid unwanted data type conversions?
Call read_csv
with dtype="string"
and parse_dates=['send_date']
.
Code
import pandas
df = pandas.read_csv("test.csv", dtype="string", parse_dates=['send_date'])
print(df.dtypes)
# name string[python]
# zip_code string[python]
# send_date datetime64[ns]
# is_customer string[python]
# dtype: object
print(df)
# name zip_code send_date is_customer
# 0 Madeline 04321 2025-04-13 true
# 1 Theo 32255 2025-04-08 true
# 2 Granny 84564 2025-04-15 false
Input file (test.csv)
name | zip_code | send_date | is_customer |
---|---|---|---|
Madeline | 04321 | 2025-04-13 | true |
Theo | 32255 | 2025-04-08 | true |
Granny | 84564 | 2025-04-15 | false |