pythonpandasnumpywin32com

win32com pywintypes.datetime in Pandas DataFrame


I am using win32com to pull some data out of large excel files and am running into an issue with pywintypes.datetime and subsequent Pandas DataFrame creation. DataBody = tbl.DataBodyRange() gives a large tuple which I convert into a numpy array using DataBody_array = np.array(DataBody).

The issue is that the .DataBodyRange() from win32com gets the time data column as pywintypes.datetime and when I try and get this array into a pandas dataframe directly, there is an AttributeError: 'NoneType' object has no attribute 'total_seconds'

Based on my understanding, pywintypes.datetime are COM objects representing date/time in windows, but when working with NumPy, these should get converted to numpy.datetime64. What would be the best method to convert these objects from pywintypes.datetime to numpy.datetime64? Since the excel tables contain a large amount of data I would like to avoid iteration, I was thinking of using something like np.where but I am struggling to get it to do the conversion/replacements.

Below is a full sample of the code and an image of the very simple excel table for the example.

Python Code

import win32com.client as MyWinCOM
import numpy as np
import pandas as pd

# Open excel instance
xl = MyWinCOM.gencache.EnsureDispatch('Excel.Application')
xl.Visible = True
# Open workbook
wb = xl.Workbooks.Open('Test Excel.xlsx')
# Get worksheet object
ws = wb.Worksheets('Sheet1')
# Get table object
tbl = ws.ListObjects('Table1')
# Get the table headers, convert to list
ColumnNames = tbl.HeaderRowRange()
ColumnNamesList = list(np.array(ColumnNames))

# Get the data body range, convery to np array
DataBody = tbl.DataBodyRange()
DataBody_array = np.array(DataBody)

# Print databody tuple
for item in DataBody:
    print (item)
# Print databody array
for item in DataBody_array:
    print (item)

# Attempt to put into dataframe
Data_df = pd.DataFrame(DataBody_array, columns=ColumnNamesList)

print (Data_df)

Output:

('a', pywintypes.datetime(2022, 6, 22, 0, 0, tzinfo=TimeZoneInfo('GMT Standard Time', True)), 1.0)
('b', pywintypes.datetime(2023, 10, 18, 0, 0, tzinfo=TimeZoneInfo('GMT Standard Time', True)), None)
('c', pywintypes.datetime(2022, 6, 21, 0, 0, tzinfo=TimeZoneInfo('GMT Standard Time', True)), 3.0)
['a'
 pywintypes.datetime(2022, 6, 22, 0, 0, tzinfo=TimeZoneInfo('GMT Standard Time', True))
 1.0]
['b'
 pywintypes.datetime(2023, 10, 18, 0, 0, tzinfo=TimeZoneInfo('GMT Standard Time', True))
 None]
['c'
 pywintypes.datetime(2022, 6, 21, 0, 0, tzinfo=TimeZoneInfo('GMT Standard Time', True))
 3.0]

Excel File/Table Test Excel.xlsx

Traceback

Traceback (most recent call last):
  File "***.py", line 32, in <module>
    print (Data_df)
  File "C:\Program Files\Python39\lib\site-packages\pandas\core\frame.py", line 1011, in __repr__
    return self.to_string(**repr_params)
  File "C:\Program Files\Python39\lib\site-packages\pandas\core\frame.py", line 1192, in to_string
    return fmt.DataFrameRenderer(formatter).to_string(
  File "C:\Program Files\Python39\lib\site-packages\pandas\io\formats\format.py", line 1128, in to_string
    string = string_formatter.to_string()
  File "C:\Program Files\Python39\lib\site-packages\pandas\io\formats\string.py", line 25, in to_string
    text = self._get_string_representation()
  File "C:\Program Files\Python39\lib\site-packages\pandas\io\formats\string.py", line 40, in _get_string_representation
    strcols = self._get_strcols()
  File "C:\Program Files\Python39\lib\site-packages\pandas\io\formats\string.py", line 31, in _get_strcols
    strcols = self.fmt.get_strcols()
  File "C:\Program Files\Python39\lib\site-packages\pandas\io\formats\format.py", line 611, in get_strcols
    strcols = self._get_strcols_without_index()
  File "C:\Program Files\Python39\lib\site-packages\pandas\io\formats\format.py", line 875, in _get_strcols_without_index
    fmt_values = self.format_col(i)
  File "C:\Program Files\Python39\lib\site-packages\pandas\io\formats\format.py", line 889, in format_col
    return format_array(
  File "C:\Program Files\Python39\lib\site-packages\pandas\io\formats\format.py", line 1316, in format_array
    return fmt_obj.get_result()
  File "C:\Program Files\Python39\lib\site-packages\pandas\io\formats\format.py", line 1347, in get_result
    fmt_values = self._format_strings()
  File "C:\Program Files\Python39\lib\site-packages\pandas\io\formats\format.py", line 1810, in _format_strings
    values = self.values.astype(object)
  File "C:\Program Files\Python39\lib\site-packages\pandas\core\arrays\datetimes.py", line 666, in astype
    return dtl.DatetimeLikeArrayMixin.astype(self, dtype, copy)
  File "C:\Program Files\Python39\lib\site-packages\pandas\core\arrays\datetimelike.py", line 415, in astype
    converted = ints_to_pydatetime(
  File "pandas\_libs\tslibs\vectorized.pyx", line 158, in pandas._libs.tslibs.vectorized.ints_to_pydatetime
  File "pandas\_libs\tslibs\timezones.pyx", line 266, in pandas._libs.tslibs.timezones.get_dst_info
AttributeError: 'NoneType' object has no attribute 'total_seconds'

Solution

  • # Fix all the timestamp columns
    from pandas.api.types import is_datetime64_any_dtype as is_datetime
    for column in Data_df.columns:
        if is_datetime(Data_df[column]):
            Data_df[column] = Data_df[column].astype('int64').astype('datetime64[ns]')
    

    I took Mark's answer and combined it with this answer to come up with the above code that seems to work and be quite performant.