I'm using the Netflix Movies and TV Shows dataset to better understand pandas.
The column date_added is in the format: "September 21, 2024" which, as I understand, would be parsed as "%B-%d-%Y". Simply Reading with pd.read_csv() returns this column as an object type.
Using parse_dates=['date_added']
doesn't seem to work, and neither specifying the format. Here is my code:
df = pd.read_csv('netflix_titles.csv', parse_dates=['date_added'], date_format='%B-%d-%Y')
df.info()
resulting in:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8807 entries, 0 to 8806
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 show_id 8807 non-null object
1 type 8807 non-null object
2 title 8807 non-null object
3 director 6173 non-null object
4 cast 7982 non-null object
5 country 7976 non-null object
6 date_added 8797 non-null object
7 release_year 8807 non-null int64
8 rating 8803 non-null object
9 duration 8804 non-null object
10 listed_in 8807 non-null object
11 description 8807 non-null object
dtypes: int64(1), object(11)
memory usage: 825.8+ KB
What am I doing wrong?
I see several problems here.
date_format
(as wrote @iBeMeltin and @ayki)nan
valuesHow to check:
df = pd.read_csv('netflix_titles.csv')
for rec in df.to_dict('records'):
try:
datetime.strptime(rec['date_added'], '%B %d, %Y')
except (TypeError, ValueError):
print(rec)
# {'show_id': 's6067', 'type': 'TV Show', 'title': "A Young Doctor's Notebook and Other Stories", 'director': nan, 'cast': 'Daniel Radcliffe, Jon Hamm, Adam Godley, Christopher Godwin, Rosie Cavaliero, Vicki Pepperdine, Margaret Clunie, Tim Steed, Shaun Pye', 'country': 'United Kingdom', 'date_added': nan ...
# {'show_id': 's8756', 'type': 'TV Show', 'title': 'Women Behind Bars', 'director': nan, 'cast': nan, 'country': 'United States', 'date_added': ' November 1, 2016'...
...
How to fix:
df = pd.read_csv('netflix_titles.csv')
df['date_added'] = df['date_added'].fillna('January 1, 1970') # default for nan. You can skip if you don't need all values as datetime
df['date_added'] = df['date_added'].str.strip() # remove spaces
df['date_added'] = pd.to_datetime(df['date_added'], format='%B %d, %Y') # convert to dt
df.info()
Output:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8807 entries, 0 to 8806
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 show_id 8807 non-null object
1 type 8807 non-null object
2 title 8807 non-null object
3 director 6173 non-null object
4 cast 7982 non-null object
5 country 7976 non-null object
6 date_added 8807 non-null datetime64[ns]
7 release_year 8807 non-null int64
8 rating 8803 non-null object
9 duration 8804 non-null object
10 listed_in 8807 non-null object
11 description 8807 non-null object
dtypes: datetime64[ns](1), int64(1), object(10)
memory usage: 825.8+ KB