rdplyrlubridateopenxlsx

How can I convert the date pattern like "40940" to 01/02/2012 in a R dataframe, given that there are other types of date format in it?


I'm trying to clean this messy database. Here is the link: https://www.dropbox.com/scl/fi/1crwk2mcag7udiyb5owvl/base_final_informex.xlsx?rlkey=ml66yz7qhr4g1n2ui5tvsfwcv&dl=0

The main problem is that the variable "inicio" has all sorts of different date input, like: "MAR 23", "1º Sem 2021", "OUT/21", "41365". I want to extract only the year of the dates. I've managed to do that with the function:

# Extracting the last two numbers (only the year)
pattern <- "\\b\\d{2}$"

# Function to extract years using regex
extract_years <- function(text, pattern) {
  matches <- regmatches(text, gregexpr(pattern, text))
  years <- sapply(matches, function(x) as.numeric(x))
  return(years)
}

# Use dplyr to extract years from the date_column
informex_clean_2 <- informex_clean |> 
  mutate(ano = extract_years(inicio, pattern))

That worked perfectly to the patterns that are not like "41365", which returned NA's.

I know that the problem is to somehow convert this number to date, but I don't know how to do that since there are dates in the dataframe that can't be numeric, or even as a normal date pattern.

Well, I've tried this function

# Custom function to convert to date if possible
convert_to_date_if_possible <- function(x) {
  if (is.numeric(x)) {
    # If the input is numeric, assume it's a date representation and try to convert it
    x_date <- as.Date(as.numeric(x), origin = "1899-12-30") # Excel's date origin is on 1899-12-30
    if (!is.na(x_date)) {
      # Conversion was successful; return the date object
      return(x_date)
    } else {
      # Conversion failed; return the original value as character
      return(x)
    }
  } else {
    # If the input is not numeric, return the original value as character
    return(x)
  }
}

I expected that it would give me what I wanted, but instead gave me NA's values to the "41365" pattern.


Solution

  • What about converting the date to a format that your regex would like and then pulling the years:

    x <- c("MAR 23", "1º Sem 2021", "OUT/21", "41365")
    
    
    get_year <- function(x){
      
      date <- suppressWarnings(
        as.Date(x = as.numeric(x), origin = "1899-12-30") |>
                  format("%m-%d-%y") |>
                  as.character()
        )
      date <- ifelse(is.na(date), x, date)
      sub(".*(\\d{2}$)", "\\1", date)
    }
    
    get_year(x)
    #> [1] "23" "21" "21" "13"