I would like to know what is a best practice or a neat code if I have a very complex format recorded in Excel. For example
bad_format = c(1969*,--1979--,1618, 19.42, 1111983, 1981, 1-9-3-2, 1983,
“1977”,“1954”, “1943”, 1968, 2287 BC, 1998, ..1911.., 1961)
There are all sort of issues some years are recorded as string, others are incorrectly stored such as 1111983 (3 extra 1), other in BC etc.
The output should like this:
correct_format = c(1969,1979, 1618, 1942, 1983, 1981, 1932, 1983, 1977,
1954, 1943, 1968, -2287, 1998, 1911, 1961)
I have no idea as how to approach this task or have the capability to write a code in r that could solve it, but I hope someone might have an idea as how to write a neat code which could find these issues and correct it.
First set BC
to TRUE if the string ends in "BC"
and FALSE otherwise. Then remove non-digits and convert to numeric giving digits
. Finally use modulo to take the last 4 digits multiplying by -1 if BC
is TRUE and +1 otherwise.
bad_format <- c("1969*", "--1979--", "1618", "19.42", "1111983", "1981",
"1-9-3-2", "1983", "1977", "1954", "1943", "1968", "2287 BC", "1998",
"..1911..", "1961")
BC <- grepl("BC$", bad_format)
digits <- as.numeric(gsub("\\D", "", bad_format))
ifelse(BC, -1, 1) * (digits %% 10000)
giving:
[1] 1969 1979 1618 1942 1983 1981 1932 1983 1977 1954 1943 1968
[13] -2287 1998 1911 1961