I'm trying to clean this messy database. Here is the link: https://www.dropbox.com/scl/fi/1crwk2mcag7udiyb5owvl/base_final_informex.xlsx?rlkey=ml66yz7qhr4g1n2ui5tvsfwcv&dl=0
The main problem is that the variable "inicio" has all sorts of different date input, like: "MAR 23", "1º Sem 2021", "OUT/21", "41365". I want to extract only the year of the dates. I've managed to do that with the function:
# Extracting the last two numbers (only the year)
pattern <- "\\b\\d{2}$"
# Function to extract years using regex
extract_years <- function(text, pattern) {
matches <- regmatches(text, gregexpr(pattern, text))
years <- sapply(matches, function(x) as.numeric(x))
return(years)
}
# Use dplyr to extract years from the date_column
informex_clean_2 <- informex_clean |>
mutate(ano = extract_years(inicio, pattern))
That worked perfectly to the patterns that are not like "41365", which returned NA's.
I know that the problem is to somehow convert this number to date, but I don't know how to do that since there are dates in the dataframe that can't be numeric, or even as a normal date pattern.
Well, I've tried this function
# Custom function to convert to date if possible
convert_to_date_if_possible <- function(x) {
if (is.numeric(x)) {
# If the input is numeric, assume it's a date representation and try to convert it
x_date <- as.Date(as.numeric(x), origin = "1899-12-30") # Excel's date origin is on 1899-12-30
if (!is.na(x_date)) {
# Conversion was successful; return the date object
return(x_date)
} else {
# Conversion failed; return the original value as character
return(x)
}
} else {
# If the input is not numeric, return the original value as character
return(x)
}
}
I expected that it would give me what I wanted, but instead gave me NA's values to the "41365" pattern.
What about converting the date to a format that your regex would like and then pulling the years:
x <- c("MAR 23", "1º Sem 2021", "OUT/21", "41365")
get_year <- function(x){
date <- suppressWarnings(
as.Date(x = as.numeric(x), origin = "1899-12-30") |>
format("%m-%d-%y") |>
as.character()
)
date <- ifelse(is.na(date), x, date)
sub(".*(\\d{2}$)", "\\1", date)
}
get_year(x)
#> [1] "23" "21" "21" "13"