pythonpandas

pd.read_csv() not working with parse_dates


I'm using the Netflix Movies and TV Shows dataset to better understand pandas.

The column date_added is in the format: "September 21, 2024" which, as I understand, would be parsed as "%B-%d-%Y". Simply Reading with pd.read_csv() returns this column as an object type.

Using parse_dates=['date_added'] doesn't seem to work, and neither specifying the format. Here is my code:

df = pd.read_csv('netflix_titles.csv', parse_dates=['date_added'], date_format='%B-%d-%Y')
df.info()

resulting in:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8807 entries, 0 to 8806
Data columns (total 12 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   show_id       8807 non-null   object
 1   type          8807 non-null   object
 2   title         8807 non-null   object
 3   director      6173 non-null   object
 4   cast          7982 non-null   object
 5   country       7976 non-null   object
 6   date_added    8797 non-null   object
 7   release_year  8807 non-null   int64 
 8   rating        8803 non-null   object
 9   duration      8804 non-null   object
 10  listed_in     8807 non-null   object
 11  description   8807 non-null   object
dtypes: int64(1), object(11)
memory usage: 825.8+ KB

What am I doing wrong?


Solution

  • I see several problems here.

    1. incorrect date_format (as wrote @iBeMeltin and @ayki)
    2. nan values
    3. spaces in values

    How to check:

    df = pd.read_csv('netflix_titles.csv')
    for rec in df.to_dict('records'):
        try:
            datetime.strptime(rec['date_added'], '%B %d, %Y')
        except (TypeError, ValueError):
            print(rec)
    
    # {'show_id': 's6067', 'type': 'TV Show', 'title': "A Young Doctor's Notebook and Other Stories", 'director': nan, 'cast': 'Daniel Radcliffe, Jon Hamm, Adam Godley, Christopher Godwin, Rosie Cavaliero, Vicki Pepperdine, Margaret Clunie, Tim Steed, Shaun Pye', 'country': 'United Kingdom', 'date_added': nan ...
    # {'show_id': 's8756', 'type': 'TV Show', 'title': 'Women Behind Bars', 'director': nan, 'cast': nan, 'country': 'United States', 'date_added': ' November 1, 2016'...
    ...
    

    How to fix:

    df = pd.read_csv('netflix_titles.csv')
    df['date_added'] = df['date_added'].fillna('January 1, 1970')           # default for nan. You can skip if you don't need all values ​​as datetime
    df['date_added'] = df['date_added'].str.strip()                         # remove spaces
    df['date_added'] = pd.to_datetime(df['date_added'], format='%B %d, %Y') # convert to dt
    
    df.info()
    

    Output:

    <class 'pandas.core.frame.DataFrame'>
    RangeIndex: 8807 entries, 0 to 8806
    Data columns (total 12 columns):
     #   Column        Non-Null Count  Dtype         
    ---  ------        --------------  -----         
     0   show_id       8807 non-null   object        
     1   type          8807 non-null   object        
     2   title         8807 non-null   object        
     3   director      6173 non-null   object        
     4   cast          7982 non-null   object        
     5   country       7976 non-null   object        
     6   date_added    8807 non-null   datetime64[ns]
     7   release_year  8807 non-null   int64         
     8   rating        8803 non-null   object        
     9   duration      8804 non-null   object        
     10  listed_in     8807 non-null   object        
     11  description   8807 non-null   object        
    dtypes: datetime64[ns](1), int64(1), object(10)
    memory usage: 825.8+ KB