I have a large data set which some users put in data on an csv. I converted the CSV into a dataframe with panda
. The column is over 1000 entries here is a sample
datestart
5/5/2013
6/12/2013
11/9/2011
4/11/2013
10/16/2011
6/15/2013
6/19/2013
6/16/2013
10/1/2011
1/8/2013
7/15/2013
7/22/2013
7/22/2013
5/5/2013
7/12/2013
7/29/2013
8/1/2013
7/22/2013
3/15/2013
6/17/2013
7/9/2013
3/5/2013
5/10/2013
5/15/2013
6/30/2013
6/30/2013
1/1/2006
00/00/0000
7/1/2013
12/21/2009
8/14/2013
Feb 1 2013
Then I tried converting the dates into years using
df['year']=df['datestart'].astype('timedelta64[Y]')
But it gave me an error:
ValueError: Value cannot be converted into object Numpy Time delta
Using Datetime64
df['year']=pd.to_datetime(df['datestart']).astype('datetime64[Y]')
it gave:
"ValueError: Error parsing datetime string ""03/13/2014"" at position 2"
Since that column was filled in by users, the majority was in this format MM/DD/YYYY but some data was put in like this: Feb 10 2013 and there was one entry like this 00/00/0000. I am guessing the different formats screwed up the processing.
Is there a try loop
, if statement
, or something that I can skip over problems like these?
If date time fails I will be force to use a str.extract
script which also works:
year=df['datestart'].str.extract("(?P<month>[0-9]+)(-|\/)(?P<day>[0-9]+)(-|\/)(?P<year>[0-9]+)")
del df['month'], df['day']
and use concat
to take the year out.
With df['year']=pd.to_datetime(df['datestart'],coerce=True, errors ='ignore').astype('datetime64[Y]')
The error message is:
Message File Name Line Position
Traceback
<module> C:\Users\0\Desktop\python\Example.py 23
astype C:\Python33\lib\site-packages\pandas\core\generic.py 2062
astype C:\Python33\lib\site-packages\pandas\core\internals.py 2491
apply C:\Python33\lib\site-packages\pandas\core\internals.py 3728
astype C:\Python33\lib\site-packages\pandas\core\internals.py 1746
_astype C:\Python33\lib\site-packages\pandas\core\internals.py 470
_astype_nansafe C:\Python33\lib\site-packages\pandas\core\common.py 2222
TypeError: cannot astype a datetimelike from [datetime64[ns]] to [datetime64[Y]]
You first have to convert the column with the date values to datetime's with to_datetime()
:
df['datestart'] = pd.to_datetime(df['datestart'], coerce=True)
This should normally parse the different formats flexibly (the coerce=True
is important here to convert invalid dates to NaT
).
If you then want the year part of the dates, you can do the following (seems doing astype directly on the pandas column gives an error, but with values
you can get the underlying numpy array):
df['datestart'].values.astype('datetime64[Y]')
The problem with this is that it gives again an error when assigning this to a column due to the NaT
value (this seems a bug, you can solve this by doing df = df.dropna()
). But also, when you assign this to a column, it get converted back to a datetime64[ns]
as this is the way pandas stores datetimes. So I personally think if you want a column with the years, you can better do the following:
df['year'] = pd.DatetimeIndex(df['datestart']).year
This last one will return the year as an integer.