pythonpandasdataframedatepython-holidays

Use the holidays library on python timestamp type under specific conditions


I have the following dataframe:

|      ID      |     date                                |
|--------------|-----------------------------------------|
|      0       |         2022-01-01 12:00:01+05:00       |
|      1       |         2022-01-30 21:30:01+03:00       |
|      2       |         2022-02-15 13:04:02+02:00       |
|      3       |         2022-09-05 15:30:01+00:00       |
|      4       |         2022-04-21 13:18:02+02:00       |

The date column is a python TimeStamp. I am using the python holidays library, I would like to use the following code:

  from datetime import date
  import holidays
    
  usa_holidays = holidays.country_holidays('US')
  texas_holidays = holidays.country_holidays('US', subdiv='TX') 
  florida_holidays = holidays.country_holidays('US', subdiv='FL')
  california_holidays = holidays.country_holidays('US', subdiv='CA')
    
  # df is the dataframe above 
  # It doesn't work.
  df['only_date'] = df['date'].apply(lambda x: x.date())
  df['federal_holiday'] = df['only_date'].isin(usa_holidays)
  # Returns holiday name 'New Year's Day'
  print(usa_holidays.get('2022-01-01'))

I would like to add the following columns:

  1. federal_holiday: True or False if the day is a bank holiday (country dictionary).
  2. holiday_state: True if it is a holiday in at least one of the state-related dictionaries. False in other case.
  3. name_state: names of the states in which that day is a public holiday, if it falls on all days, write all.
  4. holiday_name: Name of the festival.

The resulting dataframe would look as follows:

| ID | date                      | federal_holiday | holiday_state | name_state | holiday_name         |
|----|---------------------------|-----------------|---------------|------------|----------------------|
| 0  | 2022-01-01 12:00:01+05:00 | True            | True          | all        | New Year's Day       |
| 1  | 2022-01-30 21:30:01+03:00 | False           | False         | NaN        | NaN                  |
| 2  | 2022-02-15 13:04:02+02:00 | False           | True          | FL,CA      | Susan B. Anthony Day |
| 3  | 2022-09-05 15:30:01+00:00 | True            | True          | all        | Labor Day            |
| 4  | 2022-04-21 13:18:02+02:00 | False           | True          | TX         | San Jacinto Day      |

Solution

  • With the following dataframe:

    import holidays
    import pandas as pd
    
    pd.options.display.max_columns = 500
    
    df = pd.DataFrame(
        {
            "ID": [0, 1, 2, 3, 4],
            "date": [
                "2022-01-01 12:00:01+05:00",
                "2022-01-30 21:30:01+03:00",
                "2022-02-15 13:04:02+02:00",
                "2022-09-05 15:30:01+00:00",
                "2022-04-21 13:18:02+02:00",
            ],
        }
    )
    

    You could try this:

    cal = {
        "USA": holidays.country_holidays("US"),
        "TX": holidays.country_holidays("US", subdiv="TX"),
        "FL": holidays.country_holidays("US", subdiv="FL"),
        "CA": holidays.country_holidays("US", subdiv="CA"),
    }
    
    fmt = "%Y-%m-%d"
    
    
    df = (
        df.assign(
            date=lambda df_: pd.to_datetime(
                df_["date"], format="%Y-%m-%d %H:%M:%S", utc=True
            )
        )  # convert values to datetime
        .assign(
            federal_holiday=lambda df: df["date"].apply(
                lambda x: True if cal["USA"].get(x.strftime(fmt)) else False
            )
        )  # add a new column for federal holidays
        .assign(
            holiday_name=lambda df: df["date"].apply(
                lambda x: cal["USA"].get(x.strftime(fmt))
            )
        )  # add a new column for holiday name
        .assign(
            name_state=lambda df: df["date"].apply(
                lambda x: [
                    state
                    for state, calendar in cal.items()
                    if calendar.get(x.strftime(fmt)) and state != "USA"
                ]
            )
        )  # add a new column for state names
        .assign(
            holiday_name=lambda df: df["date"].apply(
                lambda x: list(
                    set(
                        [
                            calendar.get(x.strftime(fmt))
                            for calendar in cal.values()
                            if calendar.get(x.strftime(fmt))
                        ]
                    )
                )
            )
        )  # add state holiday names
        .assign(
            holiday_name=lambda df: df["holiday_name"].apply(
                lambda x: ", ".join(x) if len(x) > 0 else pd.NA
            )
        )  # convert list of names to string
        .assign(
            name_state=lambda df: df["name_state"]
            .apply(lambda x: ", ".join(x) if len(x) > 0 else pd.NA)
            .str.replace("TX, FL, CA", "all")
        )  # convert list of names to string and replace with 'all'
        .assign(holiday_state=lambda df: ~df["name_state"].isna())  # add new column
        .reindex(
            [
                "ID",
                "date",
                "federal_holiday",
                "holiday_state",
                "name_state",
                "holiday_name",
            ],
            axis=1,
        )  # reorder columns order
    )
    

    And so:

    print(df)
    
    # Output
       ID                      date federal_holiday  holiday_state name_state  \
    0   0 2022-01-01 07:00:01+00:00            True           True        all   
    1   1 2022-01-30 18:30:01+00:00           False          False       <NA>   
    2   2 2022-02-15 11:04:02+00:00           False           True     FL, CA   
    3   3 2022-09-05 15:30:01+00:00            True           True        all   
    4   4 2022-04-21 11:18:02+00:00           False           True         TX   
    
               holiday_name  
    0        New Year's Day  
    1                  <NA>  
    2  Susan B. Anthony Day  
    3             Labor Day  
    4       San Jacinto Day