rregexdatetimelubridatedata-wrangling

Working with difficult AM/PM formats and REGEX with lubridate in R


I am trying to work with some AM/PM formats in lubridate on R but I can't seem to come up with a proper solution.

I have a HUGE dataset that has date_time formats in a very rare way; the format goes as follows:

First a number that represents the day, second an abbreviation of the month OR even the month fully spelled out a 12H time format and the strings " a. m." OR "p. m." or even a combination of more spaces between or missing "dots" then such as "a. m".

To set an example please take a look at this vector:

dates<-c("02 dec 05:47 a. m", 
"7 November 09:47 p. m.",
"3  jul 12:28 a.m.", 
"23 sept 08:53 a m.", 
"7 may 09:05 PM")

These make up for more than 95% of the rare formats of datetime in the data set I have been trying to use lubridate on R I am trying to use the function

ydm_hm(paste(2021,dates))

This is because all dates are form 2021 but I get always:

[1] NA                        NA                        NA                       
[4] NA                        "2021-05-07 21:05:00 UTC"
Warning message:
 4 failed to parse. 

The 4 that fail to parse give me NAS and the only one that parses is correct I do notice that this one has PM or AM as uppercase letters without dots but most of the time my formats will be like this:

ydm_hm("7 may 09:05 p.m.")

and this gives me NAS...

So I feel as though the only way to get this dates to workout is to change the structure and using REGEX so convert all "a. m." combinations into "AM" and "PM" only after analyzing the data I realized all "p.m" or "a. m." strings come after ONE or TWO spaces after the 12H time format that always have a length of 5 characters and so what should be considered to come up with the patter of the REGEX is the following.

The string will begins with one or two numbers then spaces and then letters (for the month abbreviated or fully spelled out after that will have spaces) and then 5 characters (that's the 12H time format) and then will have letters spaces and dots for all possible a.m and p.m formats.

I have tried with no luck to convert the structure of the date. What can I try next?

My desired output will be:

"2021-12-02 05;47:00 UTC"
"2021-11-07 09:47:00 UTC"
"2021-07-03 12:28:00 UTC"
"2021-09-23 08:53:00 UTC"
"2021-05-07 21:05:00 UTC"

Solution

  • In this case, parse_date from parsedate works

    library(parsedate)
    parse_date(paste(2021, dates))
    

    -output

    [1] "2021-12-02 05:47:00 UTC"
    [2] "2021-11-07 09:47:00 UTC" 
    [3] "2021-07-03 12:28:00 UTC"
    [4] "2021-09-23 08:53:00 UTC" 
    [5] "2021-05-07 21:05:00 UTC"
    

    Or if the second value should be PM, use str_remove to remove the space

    library(stringr)
    parse_date(paste(2021, str_remove_all(dates,
        "(?<=[A-Za-z])[. ]+(?=[A-Za-z])")))
    [1] "2021-12-02 05:47:00 UTC" 
    [2] "2021-11-07 21:47:00 UTC" 
    [3] "2021-07-03 00:28:00 UTC" 
    [4] "2021-09-23 08:53:00 UTC" 
    [5] "2021-05-07 21:05:00 UTC"
    

    With ydm_hm, the issue is that one of the am/pm format showed spaces without the . and this may not get parsed. We could change the format by removing the spaces

    library(lubridate)
    library(stringr)
    ydm_hm(paste(2021, str_remove_all(dates,
        "(?<=[A-Za-z])[. ]+(?=[A-Za-z])")))
    [1] "2021-12-02 05:47:00 UTC" 
    [2] "2021-11-07 21:47:00 UTC" 
    [3] "2021-07-03 00:28:00 UTC" 
    [4] "2021-09-23 08:53:00 UTC" 
    [5] "2021-05-07 21:05:00 UTC"