I have this sample dataframe
df<-data.frame(old_farm=c("Yes", "Yes","Yes", "No", "No", "No", NA ),
env_year=c(2011, 2020,2019,2010,2010,2010, NA),
global_M=c("Yes", "Yes", "No", "Yes","Yes", "No",NA ),
audit_year=c(2014, NA,NA,2010,NA,NA, NA))
df
old_farm env_year global_M audit_year
1 Yes 2011 Yes 2014
2 Yes 2020 Yes NA
3 Yes 2019 No NA
4 No 2010 Yes 2010
5 No 2010 Yes NA
6 No 2010 No NA
7 <NA> NA <NA> NA
I want to get this output df_N which is the same as the previous one df, with the addition of the last column "Situation"
df_N<-df<-data.frame(old_farm=c("Yes", "Yes","Yes", "No", "No", "No" , NA ),
env_year=c(2011, 2020,2019,2010,2010,2010, NA),
global_M=c("Yes", "Yes", "No", "Yes","Yes", "No", NA ),
audit_year=c(2014, NA,NA,2010,NA,NA,NA),
Situation=c("New costat 2011, Closed antex 2014",
"New costat 2020",
"New costat 2019",
"Closed antex 2010",
"missing entry",
"Full exit", "Not at all"))
df_N
old_farm env_year global_M audit_year Situation
1 Yes 2011 Yes 2014 New costat 2011, Closed antex 2014
2 Yes 2020 Yes NA New costat 2020
3 Yes 2019 No NA New costat 2019
4 No 2010 Yes 2010 Closed antex 2010
5 No 2010 Yes NA missing entry
6 No 2010 No NA Full exit
7 <NA> NA <NA> NA Not at all
Explanation of how the situation column is created:
First row: When old_farm=Yes AND global_M=Yes AND I have a non-NA value in column audit_year, then in situation column, we must always have "New costat" and "Closed antex" under this condition, and we just add the year 2011 from "env_year" column next to "New costat" and the year 2014 from "audit_year" next to "Closed antex".
Second row: When old_farm=Yes AND global_M=Yes AND I have an-NA value in column audit_year, then in situation column, we must always have " "New costat" under this condition, and we just add the year 2020 from "env_year" column next to "New costat".
Third row: When old_farm=Yes AND global_M=No AND I have an-NA value in column audit_year, then in situation column, we must always have " "New costat" under this condition, and we just add the year 2019 from "env_year" column next to "New costat".
Fourth row: When old_farm=No AND global_M=Ye AND I have a non-NA value in column audit_year, then in situation column, we must always have "Closed antex" under this condition, and we just add the year 2010 from "audit_year" column next to "Closed antex".
Fifth row: When old_farm=No AND global_M=Yes AND I have an NA value in column audit_year, then in situation column, we must always have "missing entry"
Sixth row: When old_farm=No AND global_M=No AND I have an NA value in column audit_year, then in situation column, we must always have "Full exit". And the last row whic is self-explanatory.
I suspect that we can use the apply function but I am not sure how to do that.
In reality, I have thousands of such rows where these 7 conditions are repeated and so a general code would be convenient
You can use dplyr::case_when()
to control which values are assigned to a new Situation column. The dplyr::mutate()
function creates the new column. The .default = NA
will return NA where none of the conditions in case_when()
are met.
library(dplyr)
df_N <- df |>
mutate(Situation = case_when(
old_farm == "Yes" & global_M == "Yes" & !is.na(audit_year) ~
paste0("New costat ", env_year, ", Closed antex ", audit_year),
old_farm == "Yes" & global_M == "Yes" & is.na(audit_year) ~
paste0("New costat ", env_year),
old_farm == "Yes" & global_M == "No" & is.na(audit_year) ~
paste0("New costat ", env_year),
old_farm == "No" & global_M == "Yes" & !is.na(audit_year) ~
paste0("Closed antex ", audit_year),
old_farm == "No" & global_M == "Yes" & is.na(audit_year) ~
paste0("missing entry"),
old_farm == "No" & global_M == "No" & is.na(audit_year) ~
paste0("Full exit"),
rowSums(across(everything(), is.na)) == ncol(df) ~
"Not at all",
.default = NA
))
df_N
# old_farm env_year global_M audit_year Situation
# 1 Yes 2011 Yes 2014 New costat 2011, Closed antex 2014
# 2 Yes 2020 Yes NA New costat 2020
# 3 Yes 2019 No NA New costat 2019
# 4 No 2010 Yes 2010 Closed antex 2010
# 5 No 2010 Yes NA missing entry
# 6 No 2010 No NA Full exit
# 7 <NA> NA <NA> NA Not at all
This will work on your sample dataset, but as is often the case (pun intended), there may be data entry errors where the "Yes" and "No" strings are inconsistent in the old_farm and global_M columns. To counter this, you can use something like tolower()
:
# Example df with inconsistent values in old_farm and global_M
df <- data.frame(old_farm=c("Yes", "yes","YeS", "no", "nO", "No", NA),
env_year=c(2011, 2020,2019,2010,2010,2010,NA),
global_M=c("yes", "YEs", "no", "Yes","Yes", "no", NA),
audit_year=c(2014, NA,NA,2010,NA,NA,NA))
df_N <- df |>
mutate(Situation = case_when(
tolower(old_farm) == "yes" & tolower(global_M) == "yes" & !is.na(audit_year) ~
paste0("New costat ", env_year, ", Closed antex ", audit_year),
tolower(old_farm) == "yes" & tolower(global_M) == "yes" & is.na(audit_year) ~
paste0("New costat ", env_year),
tolower(old_farm) == "yes" & tolower(global_M) == "no" & is.na(audit_year) ~
paste0("New costat ", env_year),
tolower(old_farm) == "no" & tolower(global_M) == "yes" & !is.na(audit_year) ~
paste0("Closed antex ", audit_year),
tolower(old_farm) == "no" & tolower(global_M) == "yes" & is.na(audit_year) ~
paste0("missing entry"),
tolower(old_farm) == "no" & tolower(global_M) == "no" & is.na(audit_year) ~
paste0("Full exit"),
rowSums(across(everything(), is.na)) == ncol(df) ~
"Not at all",
.default = NA
))