pythonpandaspython-dateutil

Python - adding datetime column to dataframe


I have a huge number of excel files i need to extract data from, hopefully into a pandas df. The file contains a number of columns, one of which is time as a string like "16:30"

The filenames are "Monday 21st September 2020.xlsx" for example

I'm trying to loop through the files, and add a datetime column that includes the date from the filename, and the time from the column in the excel file. I've tried the following with the loop limited to one file

import pandas as pd
import datetime
import dateutil
import glob
import pathlib

folder = r"C:\temp\Friday 1st April 2022 (SB).xlsx"

for file in glob.glob(folder, recursive=False):
    #read in the excel file
    df = pd.read_excel(file, sheet_name="SB", usecols="B,I,J")
    
    #workout the date from the file name
    filedate = dateutil.parser.parse(pathlib.Path(file).stem.replace(" (SB)",""))
    
    #print filedate because it doesnt end up in the df correctly!
    print(type(filedate))
    print(filedate)
    df.insert(0,'Date', racedate.strftime('%d-$m-%Y'))
  
    print(df)
    

And that gives, this..... So the date add to the df is going wrong somewhere

C:\temp\Friday 1st April 2022 (SB).xlsx
<class 'datetime.datetime'>
2022-04-01 00:00:00
           Date   Time  R1  R2
0    01-$m-2022  16:30   9   5
1    01-$m-2022  16:30   5   5
2    01-$m-2022  16:30   6   5
3    01-$m-2022  16:30   3   6
4    01-$m-2022  16:30   3   3
..          ...    ...  ..  ..
446  01-$m-2022  16:15   3   4
447  01-$m-2022  16:15   3   3
448  01-$m-2022  16:15   3   3
449  01-$m-2022  16:15   5   3
450  01-$m-2022  16:15   5   4

[451 rows x 4 columns]

Also once i get this sorted i want to merge the two Date and Time columns into one datetime object.


Solution

  • You can extract the date from the folder variable like this.

    from dateutil import parser
    import re 
    
    folder = r"C:\temp\Friday 1st April 2022 (SB).xlsx"
    
    date_string = re.findall(r'\d{1,2}[a-z]{2} \w+ \d{4}', folder)[0]
    date_obj = parser.parse(date_string)
    print(date_obj.strftime("%d-%m-%Y"))
    

    This will print 01-04-2022

    The date_string variable is calculated by regex and you get it as 1st April 2022 which the dateutil library can easily parse then.

    Regex explanation:

    \d{1,2}: one or two digits to match the day of month number

    [a-z]{2}: two lowercase letters to match the "st", "nd" and "rd"

    \w+: one or more word characters to match the month name

    \d{4}: four digits to match the year number

    With the above, you can get the 1st April 2022 from the whole path.