I am using win32com
to pull some data out of large excel files and am running into an issue with pywintypes.datetime
and subsequent Pandas DataFrame creation. DataBody = tbl.DataBodyRange()
gives a large tuple which I convert into a numpy array using DataBody_array = np.array(DataBody)
.
The issue is that the .DataBodyRange()
from win32com
gets the time data column as pywintypes.datetime
and when I try and get this array into a pandas dataframe directly, there is an AttributeError: 'NoneType' object has no attribute 'total_seconds'
Based on my understanding, pywintypes.datetime
are COM objects representing date/time in windows, but when working with NumPy, these should get converted to numpy.datetime64
. What would be the best method to convert these objects from pywintypes.datetime
to numpy.datetime64
? Since the excel tables contain a large amount of data I would like to avoid iteration, I was thinking of using something like np.where
but I am struggling to get it to do the conversion/replacements.
Below is a full sample of the code and an image of the very simple excel table for the example.
Python Code
import win32com.client as MyWinCOM
import numpy as np
import pandas as pd
# Open excel instance
xl = MyWinCOM.gencache.EnsureDispatch('Excel.Application')
xl.Visible = True
# Open workbook
wb = xl.Workbooks.Open('Test Excel.xlsx')
# Get worksheet object
ws = wb.Worksheets('Sheet1')
# Get table object
tbl = ws.ListObjects('Table1')
# Get the table headers, convert to list
ColumnNames = tbl.HeaderRowRange()
ColumnNamesList = list(np.array(ColumnNames))
# Get the data body range, convery to np array
DataBody = tbl.DataBodyRange()
DataBody_array = np.array(DataBody)
# Print databody tuple
for item in DataBody:
print (item)
# Print databody array
for item in DataBody_array:
print (item)
# Attempt to put into dataframe
Data_df = pd.DataFrame(DataBody_array, columns=ColumnNamesList)
print (Data_df)
Output:
('a', pywintypes.datetime(2022, 6, 22, 0, 0, tzinfo=TimeZoneInfo('GMT Standard Time', True)), 1.0)
('b', pywintypes.datetime(2023, 10, 18, 0, 0, tzinfo=TimeZoneInfo('GMT Standard Time', True)), None)
('c', pywintypes.datetime(2022, 6, 21, 0, 0, tzinfo=TimeZoneInfo('GMT Standard Time', True)), 3.0)
['a'
pywintypes.datetime(2022, 6, 22, 0, 0, tzinfo=TimeZoneInfo('GMT Standard Time', True))
1.0]
['b'
pywintypes.datetime(2023, 10, 18, 0, 0, tzinfo=TimeZoneInfo('GMT Standard Time', True))
None]
['c'
pywintypes.datetime(2022, 6, 21, 0, 0, tzinfo=TimeZoneInfo('GMT Standard Time', True))
3.0]
Traceback
Traceback (most recent call last):
File "***.py", line 32, in <module>
print (Data_df)
File "C:\Program Files\Python39\lib\site-packages\pandas\core\frame.py", line 1011, in __repr__
return self.to_string(**repr_params)
File "C:\Program Files\Python39\lib\site-packages\pandas\core\frame.py", line 1192, in to_string
return fmt.DataFrameRenderer(formatter).to_string(
File "C:\Program Files\Python39\lib\site-packages\pandas\io\formats\format.py", line 1128, in to_string
string = string_formatter.to_string()
File "C:\Program Files\Python39\lib\site-packages\pandas\io\formats\string.py", line 25, in to_string
text = self._get_string_representation()
File "C:\Program Files\Python39\lib\site-packages\pandas\io\formats\string.py", line 40, in _get_string_representation
strcols = self._get_strcols()
File "C:\Program Files\Python39\lib\site-packages\pandas\io\formats\string.py", line 31, in _get_strcols
strcols = self.fmt.get_strcols()
File "C:\Program Files\Python39\lib\site-packages\pandas\io\formats\format.py", line 611, in get_strcols
strcols = self._get_strcols_without_index()
File "C:\Program Files\Python39\lib\site-packages\pandas\io\formats\format.py", line 875, in _get_strcols_without_index
fmt_values = self.format_col(i)
File "C:\Program Files\Python39\lib\site-packages\pandas\io\formats\format.py", line 889, in format_col
return format_array(
File "C:\Program Files\Python39\lib\site-packages\pandas\io\formats\format.py", line 1316, in format_array
return fmt_obj.get_result()
File "C:\Program Files\Python39\lib\site-packages\pandas\io\formats\format.py", line 1347, in get_result
fmt_values = self._format_strings()
File "C:\Program Files\Python39\lib\site-packages\pandas\io\formats\format.py", line 1810, in _format_strings
values = self.values.astype(object)
File "C:\Program Files\Python39\lib\site-packages\pandas\core\arrays\datetimes.py", line 666, in astype
return dtl.DatetimeLikeArrayMixin.astype(self, dtype, copy)
File "C:\Program Files\Python39\lib\site-packages\pandas\core\arrays\datetimelike.py", line 415, in astype
converted = ints_to_pydatetime(
File "pandas\_libs\tslibs\vectorized.pyx", line 158, in pandas._libs.tslibs.vectorized.ints_to_pydatetime
File "pandas\_libs\tslibs\timezones.pyx", line 266, in pandas._libs.tslibs.timezones.get_dst_info
AttributeError: 'NoneType' object has no attribute 'total_seconds'
# Fix all the timestamp columns
from pandas.api.types import is_datetime64_any_dtype as is_datetime
for column in Data_df.columns:
if is_datetime(Data_df[column]):
Data_df[column] = Data_df[column].astype('int64').astype('datetime64[ns]')
I took Mark's answer and combined it with this answer to come up with the above code that seems to work and be quite performant.