pythonpandasairflowpendulum

Airflow error with pandas: AttributeError: 'Pendulum' object has no attribute 'nanosecond'


I have a pandas.DataFrame df with df.index which yeilds something like this:

DatetimeIndex(['2014-10-06 00:55:11.357899904',
               '2014-10-06 00:56:39.046799898',
               '2014-10-06 00:56:39.057499886',
               '2014-10-06 00:56:40.684299946',
               '2014-10-06 00:56:41.115299940',
               '2014-10-06 01:03:52.764300108',
               '2014-10-06 01:21:18.448499918',
               '2014-10-06 01:21:18.457200050',
               '2014-10-06 01:21:18.584199905',
               '2014-10-06 01:21:18.594700098',
               ...
               '2014-11-05 00:25:47.996000051',
               '2014-11-05 00:56:45.081799984',
               '2014-11-05 00:56:45.096899986',
               '2014-11-05 05:50:57.639699936',
               '2014-11-05 06:08:56.365000010',
               '2014-11-05 06:11:20.519099950',
               '2014-11-05 06:15:03.470400095',
               '2014-11-05 06:15:03.981600046',
               '2014-11-05 06:25:31.514300108',
               '2014-11-05 06:25:59.310400009'],
              dtype='datetime64[ns]', name='time', length=1000, freq=None)

I am running a DAG on airflow, which stops at the following line df.loc[start_date:end_date], saying that:

AttributeError: 'Pendulum' object has no attribute 'nanosecond'

I cannot reproduce the error without running the code in Airflow. The same code runs just fine without Airflow.

The start_date is the Airflow macro execution_date and end_date is the next_execution_date.

I guess the issues is to do with the date-time dtype of the df not being compatable with the ones from the start_date & end_date, but I have no idea how to address it.

I tried removing time zones, changing the dtype but nothing worked.


Solution

  • After some searching, I found the source of the problem and a solution.

    the problem

    The issue is caused by the two macros passed down from Airflow:

    The types of them are pendulum.datetime, and not datetime.datetime, as the Airflow documentation says. This causes the clash with pandas.DataFrame.

    pandas and pendulum currently don't work well together and the problem is well described in this StackOverflow asnwer.

    the solution

    The solution seesm to convery the start_date and end_date from pendulum.datetime to datetime.datetime.

    For this I created this simple function, which converts from to string beofore converting to datetime.datetime. I am sure they are better ways to do it, but this was quite simple and safe, hence why I used it.

    Here is the function itself:

    def pendulum_to_datetime(pendulum_date):
        """
        Convert pendulum to datetime format.
    
        The conversion is done from pendulum -> string -> dateime.
    
        Args:
            pendulum_date (pendulum): The date you wish to convert.
    
        Returns:
            (datetime) The converted date.
        """
        fmt = '%Y-%m-%dT%H:%M:%S%z'
        string_date = pendulum_date.strftime(fmt)
        return datetime.strptime(string_date, fmt)