pythonpandasdata-cleaning

Python cleaning dates for conversion to year only in Pandas


I have a large data set which some users put in data on an csv. I converted the CSV into a dataframe with panda. The column is over 1000 entries here is a sample

datestart
5/5/2013
6/12/2013
11/9/2011
4/11/2013
10/16/2011
6/15/2013
6/19/2013
6/16/2013
10/1/2011
1/8/2013
7/15/2013
7/22/2013
7/22/2013
5/5/2013
7/12/2013
7/29/2013
8/1/2013
7/22/2013
3/15/2013
6/17/2013
7/9/2013
3/5/2013
5/10/2013
5/15/2013
6/30/2013
6/30/2013
1/1/2006
00/00/0000
7/1/2013
12/21/2009
8/14/2013
Feb 1 2013

Then I tried converting the dates into years using

df['year']=df['datestart'].astype('timedelta64[Y]')

But it gave me an error:

ValueError: Value cannot be converted into object Numpy Time delta

Using Datetime64

df['year']=pd.to_datetime(df['datestart']).astype('datetime64[Y]')

it gave:

"ValueError: Error parsing datetime string ""03/13/2014"" at position 2"

Since that column was filled in by users, the majority was in this format MM/DD/YYYY but some data was put in like this: Feb 10 2013 and there was one entry like this 00/00/0000. I am guessing the different formats screwed up the processing.

Is there a try loop, if statement, or something that I can skip over problems like these?

If date time fails I will be force to use a str.extract script which also works:

year=df['datestart'].str.extract("(?P<month>[0-9]+)(-|\/)(?P<day>[0-9]+)(-|\/)(?P<year>[0-9]+)")


del df['month'], df['day']  

and use concat to take the year out.

With df['year']=pd.to_datetime(df['datestart'],coerce=True, errors ='ignore').astype('datetime64[Y]') The error message is:

Message File Name   Line    Position    
Traceback               
    <module>    C:\Users\0\Desktop\python\Example.py    23      
    astype  C:\Python33\lib\site-packages\pandas\core\generic.py    2062        
    astype  C:\Python33\lib\site-packages\pandas\core\internals.py  2491        
    apply   C:\Python33\lib\site-packages\pandas\core\internals.py  3728        
    astype  C:\Python33\lib\site-packages\pandas\core\internals.py  1746        
    _astype C:\Python33\lib\site-packages\pandas\core\internals.py  470     
    _astype_nansafe C:\Python33\lib\site-packages\pandas\core\common.py 2222        
TypeError: cannot astype a datetimelike from [datetime64[ns]] to [datetime64[Y]]        

Solution

  • You first have to convert the column with the date values to datetime's with to_datetime():

    df['datestart'] = pd.to_datetime(df['datestart'], coerce=True)
    

    This should normally parse the different formats flexibly (the coerce=True is important here to convert invalid dates to NaT).

    If you then want the year part of the dates, you can do the following (seems doing astype directly on the pandas column gives an error, but with values you can get the underlying numpy array):

    df['datestart'].values.astype('datetime64[Y]')
    

    The problem with this is that it gives again an error when assigning this to a column due to the NaT value (this seems a bug, you can solve this by doing df = df.dropna()). But also, when you assign this to a column, it get converted back to a datetime64[ns] as this is the way pandas stores datetimes. So I personally think if you want a column with the years, you can better do the following:

    df['year'] =  pd.DatetimeIndex(df['datestart']).year
    

    This last one will return the year as an integer.