I'm trying to add a column that indicates whether a date is a holiday or not. I found some code online, but I believe there's a more efficient way to do it, possibly using a polar method instead of map elements and lambda.
Example code:
import polars as pl
import holidays
# Initialize the holidays for Chile
cl_holidays = holidays.CL()
# Sample data
data = {
"Date": ["2024-06-20 00:00:00", "2024-06-21 00:00:00", "2024-06-22 00:00:00", "2024-06-23 00:00:00", "2024-06-24 00:00:00"],
"Amount": [100, 200, 300, 400, 500],
"User_Count" : [1, 2, 3, 4, 5]
}
# Create DataFrame
df = pl.DataFrame(data)
# Add a new column 'Is_Holiday' based on the Date column
df = df.with_columns(
(pl.col("Date").map_elements(lambda x: x.split(" ")[0] in cl_holidays, return_dtype=pl.Boolean)).alias("Is_Holiday")
).with_columns(pl.col("Date").str.strptime(pl.Datetime))
df
Expected output:
shape: (5, 4)
┌─────────────────────┬────────┬────────────┬────────────┐
│ Date ┆ Amount ┆ User_Count ┆ Is_Holiday │
│ --- ┆ --- ┆ --- ┆ --- │
│ datetime[μs] ┆ i64 ┆ i64 ┆ bool │
╞═════════════════════╪════════╪════════════╪════════════╡
│ 2024-06-20 00:00:00 ┆ 100 ┆ 1 ┆ true │
│ 2024-06-21 00:00:00 ┆ 200 ┆ 2 ┆ false │
│ 2024-06-22 00:00:00 ┆ 300 ┆ 3 ┆ false │
│ 2024-06-23 00:00:00 ┆ 400 ┆ 4 ┆ false │
│ 2024-06-24 00:00:00 ┆ 500 ┆ 5 ┆ false │
└─────────────────────┴────────┴────────────┴────────────┘
UPDATE: i tried with @ignoring_gravity aproach, and also tried changing the date format but i keep getting false instead of true
UPDATE2: If i try @Hericks aproach i keep getting false. (I'm using polars 0.20.31 )
import polars as pl
import holidays
# Initialize the holidays for Chile
cl_holidays = holidays.CL()
# Sample data
data = {
"Date": ["2024-06-20 00:00:00", "2024-06-21 00:00:00", "2024-06-22 00:00:00", "2024-06-23 00:00:00", "2024-06-24 00:00:00"],
"Amount": [100, 200, 300, 400, 500],
"User_Count" : [1, 2, 3, 4, 5]
}
# Create DataFrame
df = pl.DataFrame(data)
# Add a new column 'Is_Holiday' based on the Date column
df.with_columns(
Is_Holiday=pl.col('Date').str.to_datetime().dt.date().is_in(cl_holidays.keys())
)
Output:
shape: (5, 4)
┌─────────────────────┬────────┬────────────┬────────────┐
│ Date ┆ Amount ┆ User_Count ┆ Is_Holiday │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ i64 ┆ bool │
╞═════════════════════╪════════╪════════════╪════════════╡
│ 2024-06-20 00:00:00 ┆ 100 ┆ 1 ┆ false │
│ 2024-06-21 00:00:00 ┆ 200 ┆ 2 ┆ false │
│ 2024-06-22 00:00:00 ┆ 300 ┆ 3 ┆ false │
│ 2024-06-23 00:00:00 ┆ 400 ┆ 4 ┆ false │
│ 2024-06-24 00:00:00 ┆ 500 ┆ 5 ┆ false │
└─────────────────────┴────────┴────────────┴────────────┘
After doing some research, I found that I need to pass the years into my 'holidays.CL' constructor
import polars as pl
import holidays
# Initialize the holidays for Chile
cl_holidays = holidays.CL()
# Sample data
data = {
"Date": ["2024-06-20 00:00:00", "2024-06-21 00:00:00", "2024-06-22 00:00:00", "2024-06-23 00:00:00", "2024-06-24 00:00:00"],
"Amount": [100, 200, 300, 400, 500],
"User_Count" : [1, 2, 3, 4, 5]
}
# Create DataFrame
df = pl.DataFrame(data)
# Add a new column 'Is_Holiday' based on the Date column
df.with_columns(
pl.col("Date").str.to_date(format="%Y-%m-%d %H:%M:%S")
).with_columns(
pl.col("Date").is_in(list(holidays.CL(years=[2024]).keys())).alias("holiday")
)
Output :
shape: (5, 4)
┌────────────┬────────┬────────────┬─────────┐
│ Date ┆ Amount ┆ User_Count ┆ holiday │
│ --- ┆ --- ┆ --- ┆ --- │
│ date ┆ i64 ┆ i64 ┆ bool │
╞════════════╪════════╪════════════╪═════════╡
│ 2024-06-20 ┆ 100 ┆ 1 ┆ true │
│ 2024-06-21 ┆ 200 ┆ 2 ┆ false │
│ 2024-06-22 ┆ 300 ┆ 3 ┆ false │
│ 2024-06-23 ┆ 400 ┆ 4 ┆ false │
│ 2024-06-24 ┆ 500 ┆ 5 ┆ false │
└────────────┴────────┴────────────┴─────────┘