Problem description:
I'm trying to read historical stock prices from CSV format into pandas Dataframe, but one funny thing I have noticed so far - when reading certain row numbers, the dates column type is changed from pandas.Timestamp
into datetime
- how does that work? How can I read pandas.Timestamp
then?
Minimum reproduction example:
I have inspected my CSV file, and here is a minimal needed data example from it.
import pandas as pd
file = open('temp.csv', 'w')
file.write(
"""Local time,Open,High,Low,Close,Volume
28.02.2014 02:00:00.000 GMT+0200,1.37067,1.38250,1.36943,1.38042,176839.0313
01.04.2014 03:00:00.000 GMT+0300,1.37742,1.38156,1.37694,1.37937,95386.0703""")
file.close()
data = pd.read_csv('temp.csv', parse_dates = ["Local time"])
print(type(data['Local time'][0]))
Result: <class 'datetime.datetime'>
VERSUS
import pandas as pd
file = open('temp.csv', 'w')
file.write(
"""Local time,Open,High,Low,Close,Volume
28.02.2014 02:00:00.000 GMT+0200,1.37067,1.38250,1.36943,1.38042,176839.0313""")
file.close()
data = pd.read_csv('temp.csv', parse_dates = ["Local time"])
print(type(data['Local time'][0]))
file = open('temp.csv', 'w')
file.write(
"""Local time,Open,High,Low,Close,Volume
01.04.2014 03:00:00.000 GMT+0300,1.37742,1.38156,1.37694,1.37937,95386.0703""")
file.close()
data = pd.read_csv('temp.csv', parse_dates = ["Local time"])
print(type(data['Local time'][0]))
file = open('temp.csv', 'w')
file.write(
"""Local time,Open,High,Low,Close,Volume
02.03.2014 02:00:00.000 GMT+0200,1.37620,1.37882,1.37586,1.37745,5616.04
03.03.2014 02:00:00.000 GMT+0200,1.37745,1.37928,1.37264,1.37357,136554.6563
04.03.2014 02:00:00.000 GMT+0200,1.37356,1.37820,1.37211,1.37421,124863.8203""")
file.close()
data = pd.read_csv('temp.csv', parse_dates = ["Local time"])
print(type(data['Local time'][0]))
Result: <class 'pandas._libs.tslibs.timestamps.Timestamp'>
Result: <class 'pandas._libs.tslibs.timestamps.Timestamp'>
Result: <class 'pandas._libs.tslibs.timestamps.Timestamp'>
Versions:
pandas==1.2.3
pandas-datareader==0.9.0
Summary:
I need to read pandas.Timestamp
because of some latter data manipulations, not datetime
, and have no ideas what's wrong here - hope you guys can help...
I have created a GitHub issue as well, but it's not triaged yet.
You can specify which date_parser
function to be used:
data = pd.read_csv('temp.csv',
parse_dates = ["Local time"],
date_parser=pd.Timestamp)
Output:
>>> data
Local time Open High Low Close Volume
0 2014-02-03 02:00:00-02:00 1.37620 1.37882 1.37586 1.37745 5616.0400
1 2014-03-03 02:00:00-03:00 1.37745 1.37928 1.37264 1.37357 136554.6563
2 2014-04-03 02:00:00-02:00 1.37356 1.37820 1.37211 1.37421 124863.8203
>>> type(data['Local time'][0])
<class 'pandas._libs.tslibs.timestamps.Timestamp'>
By my observation pandas automatically parses each entry as datetime when timezone are different for individual observation.
The above should work if you really need to use pd.Timestamp
.
Running the above however also gives me a FutureWarning, which I researched and found to be harmless as of now.
EDIT
After a bit more research:
pandas tries to convert a date type column to DatetimeIndex
for more efficiency in datetime based operations.
But for this pandas needs to have a common timezone for the entire column.
On explicitly trying to convert to pd.DatetimeIndex
>>> data
Local time Open High Low Close Volume
0 2014-02-03 02:00:00-02:00 1.37620 1.37882 1.37586 1.37745 5616.0400
1 2014-03-03 02:00:00-03:00 1.37745 1.37928 1.37264 1.37357 136554.6563
2 2014-04-03 02:00:00-04:00 1.37356 1.37820 1.37211 1.37421 124863.8203
>>> pd.DatetimeIndex(data['Local time'])
ValueError: Array must be all same time zone
During handling of the above exception, another exception occurred:
ValueError: Tz-aware datetime.datetime cannot be converted to datetime64 unless utc=True
So when converting to DatetimeIndex
fails pandas then keeps the data as strings (dtype : object) internally and individual entries to be processed as datetime
.
Documentation recommends that if timezones in the data are different specify UTC=True, so the timezone would be set as UTC and time values would be changed accordingly.
From Documentation:
pandas cannot natively represent a column or index with mixed timezones. If your CSV file contains columns with a mixture of timezones, the default result will be an object-dtype column with strings, even with parse_dates.
To parse the mixed-timezone values as a datetime column, pass a partially-applied to_datetime() with utc=True
In a data that already has the same timezone DatetimeIndex works seamlessly:
>>> data
Local time Open High Low Close Volume
0 2014-02-03 02:00:00-02:00 1.37620 1.37882 1.37586 1.37745 5616.0400
1 2014-03-03 02:00:00-02:00 1.37745 1.37928 1.37264 1.37357 136554.6563
2 2014-04-03 02:00:00-02:00 1.37356 1.37820 1.37211 1.37421 124863.8203
>>> pd.DatetimeIndex(data['Local time'])
DatetimeIndex(['2014-02-03 02:00:00-02:00', '2014-03-03 02:00:00-02:00',
'2014-04-03 02:00:00-02:00'],
dtype='datetime64[ns, pytz.FixedOffset(-120)]', name='Local time', freq=None)
>>> type(pd.DatetimeIndex(data['Local time'])[0])
<class 'pandas._libs.tslibs.timestamps.Timestamp'>
References: