Data engineer here, just loading Excel xlsx files to a SQL Server database table. The problem is, the person creating the Excel files enters dates as dd/mm/yyyy (European format). If the day is > 12, the resultant xlsx reads the date as Text (eg. 31/03/2024). But if the day is 12 or less, the xlsx reads the date as a Date (eg. 06/04/2024 looks like June 4, but it's really April 6). It seems Pandas respects these datatype choices when loading the file into a DataFrame with pd.read_excel(). This means the above two dates end up in the database table as "2024-03-31" (correct) and "2024-06-04" (wrong).
How do I get these Excel files into the table, but correctly parse the date every time? I don't have control over the Excel files before they're saved and dropped into the target file folder.
Thanks to @Temunel, whose solution almost works. I modified it with an idea from here: Is there a Python equivalent to C#'s DateTime.TryParse()?
The real problem is, if Excel saves as a date (i.e. the first value is < 13), then Pandas reads it as a date, then converts it to a string, not an Excel date integer - it ends up as 'yyyy-mm-dd HH:MM:SS'. If it doesn't recognize the date (i.e. first value > 12), Pandas just reads it as a string - it ends up as 'dd/mm/yyyy'.
from datetime import datetime, timedelta
DATE_PATTERNS = ["%Y-%d-%m %H:%M:%S", "%d/%m/%Y"]
def excel_date_parser(val):
""" Parse Excel serial dates and text dates in dd/mm/yyyy format. """
# Check if it's a float or an int (Excel serial date)
if isinstance(val, float) or isinstance(val, int):
# Convert Excel date serial to datetime
return datetime(1899, 12, 30) + timedelta(days=val)
else:
# Try parsing as text in one of the formats from DATE_PATTERNS list above
try:
return try_to_datetime(val)
except Exception as e:
print(f"Error parsing {val}: {e}")
return None
def try_to_datetime(val):
for fmt in DATE_PATTERNS:
try:
return pd.to_datetime(val, format=fmt, errors='raise')
except:
continue
raise ValueError
df = pd.read_excel("file_with_Euro_dates.xlsx", dtype={"Date": str})
df["Date"] = df["Date"].apply(excel_date_parser)
df
You can use pd.to_datetime()
to manually convert the columns to a consistent date format after reading the file.
Here's the code.
import pandas as pd
from datetime import datetime, timedelta
# Function to parse Excel dates
def excel_date_parser(val):
""" Parse Excel serial dates and text dates in dd/mm/yyyy format. """
# Check if it's a float or an int (Excel serial date)
if isinstance(val, float) or isinstance(val, int):
# Convert Excel date serial to datetime
return datetime(1899, 12, 30) + timedelta(days=val)
else:
# Try parsing as text in the format 'dd/mm/yyyy'
try:
return pd.to_datetime(val, format='%d/%m/%Y', errors='raise')
except Exception as e:
print(f"Error parsing {val}: {e}")
return None
# Read the Excel file
df = pd.read_excel('your_file.xlsx', dtype=object)
# Convert the 'Date' column to datetime
df['Date'] = df['Date'].apply(excel_date_parser)
print(df)
I hope this will help you a little.