I have a dataframe which includes around 50 features. In my experiment I have a classification problem so I want to train the model by "GradientBoostingClassifier". The dataframe (mydata) is considered as a training set. One of those 50 features (feature20) is a date, and I need to consider this feature in my training set as well, so I have tried to convert the date to datetime64 as follows:
mydata['feature20']=pd.to_datetime(mydata['feature20'])
Now, when I tried to train the model with the classifier, it gives me the following error:
float() argument must be a string or a number, not 'Timestamp'
Any idea to solve this problem?
You can easily convert your dates to integers: df["feature20"].astype("int64") // 10**9.
NOTE: but it's not a good idea to leave a datetime feature as it is, unless you are working with the time series. Usually you would want to extract additional information from that datetime - day of week, day of month, week of year, month #, etc
Demo:
In [9]: df = pd.DataFrame({'feature20':pd.date_range('2010-01-01', periods=10)})
In [10]: df["new"] = df["feature20"].astype("int64") // 10**9
In [11]: df
Out[11]:
feature20 new
0 2010-01-01 1262304000
1 2010-01-02 1262390400
2 2010-01-03 1262476800
3 2010-01-04 1262563200
4 2010-01-05 1262649600
5 2010-01-06 1262736000
6 2010-01-07 1262822400
7 2010-01-08 1262908800
8 2010-01-09 1262995200
9 2010-01-10 1263081600
In [12]: df["date"] = pd.to_datetime(df["new"], unit="s")
In [13]: df
Out[13]:
feature20 new date
0 2010-01-01 1262304000 2010-01-01
1 2010-01-02 1262390400 2010-01-02
2 2010-01-03 1262476800 2010-01-03
3 2010-01-04 1262563200 2010-01-04
4 2010-01-05 1262649600 2010-01-05
5 2010-01-06 1262736000 2010-01-06
6 2010-01-07 1262822400 2010-01-07
7 2010-01-08 1262908800 2010-01-08
8 2010-01-09 1262995200 2010-01-09
9 2010-01-10 1263081600 2010-01-10
if you have microsecond precision:
In [28]: df = pd.DataFrame({'feature20':pd.date_range('2010-01-01 01:01:01.123456', freq="123S", periods=10)})
In [29]: df
Out[29]:
feature20
0 2010-01-01 01:01:01.123456
1 2010-01-01 01:03:04.123456
2 2010-01-01 01:05:07.123456
3 2010-01-01 01:07:10.123456
4 2010-01-01 01:09:13.123456
5 2010-01-01 01:11:16.123456
6 2010-01-01 01:13:19.123456
7 2010-01-01 01:15:22.123456
8 2010-01-01 01:17:25.123456
9 2010-01-01 01:19:28.123456
In [30]: df["new"] = df["feature20"].astype("int64") // 10**3
In [31]: df
Out[31]:
feature20 new
0 2010-01-01 01:01:01.123456 1262307661123456
1 2010-01-01 01:03:04.123456 1262307784123456
2 2010-01-01 01:05:07.123456 1262307907123456
3 2010-01-01 01:07:10.123456 1262308030123456
4 2010-01-01 01:09:13.123456 1262308153123456
5 2010-01-01 01:11:16.123456 1262308276123456
6 2010-01-01 01:13:19.123456 1262308399123456
7 2010-01-01 01:15:22.123456 1262308522123456
8 2010-01-01 01:17:25.123456 1262308645123456
9 2010-01-01 01:19:28.123456 1262308768123456
In [32]: df["date"] = pd.to_datetime(df["new"], unit="us")
In [33]: df
Out[33]:
feature20 new date
0 2010-01-01 01:01:01.123456 1262307661123456 2010-01-01 01:01:01.123456
1 2010-01-01 01:03:04.123456 1262307784123456 2010-01-01 01:03:04.123456
2 2010-01-01 01:05:07.123456 1262307907123456 2010-01-01 01:05:07.123456
3 2010-01-01 01:07:10.123456 1262308030123456 2010-01-01 01:07:10.123456
4 2010-01-01 01:09:13.123456 1262308153123456 2010-01-01 01:09:13.123456
5 2010-01-01 01:11:16.123456 1262308276123456 2010-01-01 01:11:16.123456
6 2010-01-01 01:13:19.123456 1262308399123456 2010-01-01 01:13:19.123456
7 2010-01-01 01:15:22.123456 1262308522123456 2010-01-01 01:15:22.123456
8 2010-01-01 01:17:25.123456 1262308645123456 2010-01-01 01:17:25.123456
9 2010-01-01 01:19:28.123456 1262308768123456 2010-01-01 01:19:28.123456