I have a huge number of excel files i need to extract data from, hopefully into a pandas df. The file contains a number of columns, one of which is time as a string like "16:30"
The filenames are "Monday 21st September 2020.xlsx" for example
I'm trying to loop through the files, and add a datetime column that includes the date from the filename, and the time from the column in the excel file. I've tried the following with the loop limited to one file
import pandas as pd
import datetime
import dateutil
import glob
import pathlib
folder = r"C:\temp\Friday 1st April 2022 (SB).xlsx"
for file in glob.glob(folder, recursive=False):
#read in the excel file
df = pd.read_excel(file, sheet_name="SB", usecols="B,I,J")
#workout the date from the file name
filedate = dateutil.parser.parse(pathlib.Path(file).stem.replace(" (SB)",""))
#print filedate because it doesnt end up in the df correctly!
print(type(filedate))
print(filedate)
df.insert(0,'Date', racedate.strftime('%d-$m-%Y'))
print(df)
And that gives, this..... So the date add to the df is going wrong somewhere
C:\temp\Friday 1st April 2022 (SB).xlsx
<class 'datetime.datetime'>
2022-04-01 00:00:00
Date Time R1 R2
0 01-$m-2022 16:30 9 5
1 01-$m-2022 16:30 5 5
2 01-$m-2022 16:30 6 5
3 01-$m-2022 16:30 3 6
4 01-$m-2022 16:30 3 3
.. ... ... .. ..
446 01-$m-2022 16:15 3 4
447 01-$m-2022 16:15 3 3
448 01-$m-2022 16:15 3 3
449 01-$m-2022 16:15 5 3
450 01-$m-2022 16:15 5 4
[451 rows x 4 columns]
Also once i get this sorted i want to merge the two Date and Time columns into one datetime object.
You can extract the date from the folder
variable like this.
from dateutil import parser
import re
folder = r"C:\temp\Friday 1st April 2022 (SB).xlsx"
date_string = re.findall(r'\d{1,2}[a-z]{2} \w+ \d{4}', folder)[0]
date_obj = parser.parse(date_string)
print(date_obj.strftime("%d-%m-%Y"))
This will print 01-04-2022
The date_string
variable is calculated by regex and you get it as 1st April 2022
which the dateutil library can easily parse then.
Regex explanation:
\d{1,2}
: one or two digits to match the day of month number
[a-z]{2}
: two lowercase letters to match the "st", "nd" and "rd"
\w+
: one or more word characters to match the month name
\d{4}
: four digits to match the year number
With the above, you can get the 1st April 2022
from the whole path.