rdateformatcomplextype

Complex date formatting in r


I would like to know what is a best practice or a neat code if I have a very complex format recorded in Excel. For example

   bad_format = c(1969*,--1979--,1618, 19.42, 1111983, 1981, 1-9-3-2, 1983, 
                 “1977”,“1954”, “1943”, 1968, 2287 BC, 1998, ..1911.., 1961)

There are all sort of issues some years are recorded as string, others are incorrectly stored such as 1111983 (3 extra 1), other in BC etc.

The output should like this:

   correct_format = c(1969,1979, 1618, 1942, 1983, 1981, 1932, 1983, 1977, 
                   1954, 1943, 1968, -2287, 1998, 1911, 1961)

I have no idea as how to approach this task or have the capability to write a code in r that could solve it, but I hope someone might have an idea as how to write a neat code which could find these issues and correct it.


Solution

  • First set BC to TRUE if the string ends in "BC" and FALSE otherwise. Then remove non-digits and convert to numeric giving digits. Finally use modulo to take the last 4 digits multiplying by -1 if BC is TRUE and +1 otherwise.

    bad_format <- c("1969*", "--1979--", "1618", "19.42", "1111983", "1981", 
      "1-9-3-2", "1983", "1977", "1954", "1943", "1968", "2287 BC", "1998", 
      "..1911..", "1961")
    
    BC <- grepl("BC$", bad_format)
    digits <- as.numeric(gsub("\\D", "", bad_format))
    ifelse(BC, -1, 1) * (digits %% 10000)
    

    giving:

     [1]  1969  1979  1618  1942  1983  1981  1932  1983  1977  1954  1943  1968
    [13] -2287  1998  1911  1961