rdatabase

Create a column conditional on multiple other character and numerical columns


I have this sample dataframe

df<-data.frame(old_farm=c("Yes", "Yes","Yes", "No", "No", "No",  NA ),
env_year=c(2011, 2020,2019,2010,2010,2010, NA),
global_M=c("Yes", "Yes", "No", "Yes","Yes", "No",NA ),
audit_year=c(2014, NA,NA,2010,NA,NA, NA))
 
df
old_farm env_year global_M audit_year
1      Yes     2011      Yes       2014
2      Yes     2020      Yes         NA
3      Yes     2019       No         NA
4       No     2010      Yes       2010
5       No     2010      Yes         NA
6       No     2010       No         NA
7     <NA>       NA     <NA>         NA

I want to get this output df_N which is the same as the previous one df, with the addition of the last column "Situation"

df_N<-df<-data.frame(old_farm=c("Yes", "Yes","Yes", "No", "No", "No" , NA ),
env_year=c(2011, 2020,2019,2010,2010,2010, NA),
global_M=c("Yes", "Yes", "No", "Yes","Yes", "No", NA ),
audit_year=c(2014, NA,NA,2010,NA,NA,NA),
Situation=c("New costat 2011, Closed antex 2014",
   "New costat 2020",
 "New costat 2019",
"Closed antex 2010",
"missing entry",
"Full exit", "Not at all"))



  df_N
      old_farm env_year global_M audit_year                          Situation
1      Yes     2011      Yes       2014 New costat 2011, Closed antex 2014
2      Yes     2020      Yes         NA                    New costat 2020
3      Yes     2019       No         NA                    New costat 2019
4       No     2010      Yes       2010                  Closed antex 2010
5       No     2010      Yes         NA                      missing entry
6       No     2010       No         NA                          Full exit
7     <NA>       NA     <NA>         NA                         Not at all

Explanation of how the situation column is created:

First row: When old_farm=Yes AND global_M=Yes AND I have a non-NA value in column audit_year, then in situation column, we must always have "New costat" and "Closed antex" under this condition, and we just add the year 2011 from "env_year" column next to "New costat" and the year 2014 from "audit_year" next to "Closed antex".

Second row: When old_farm=Yes AND global_M=Yes AND I have an-NA value in column audit_year, then in situation column, we must always have " "New costat" under this condition, and we just add the year 2020 from "env_year" column next to "New costat".

Third row: When old_farm=Yes AND global_M=No AND I have an-NA value in column audit_year, then in situation column, we must always have " "New costat" under this condition, and we just add the year 2019 from "env_year" column next to "New costat".

Fourth row: When old_farm=No AND global_M=Ye AND I have a non-NA value in column audit_year, then in situation column, we must always have "Closed antex" under this condition, and we just add the year 2010 from "audit_year" column next to "Closed antex".

Fifth row: When old_farm=No AND global_M=Yes AND I have an NA value in column audit_year, then in situation column, we must always have "missing entry"

Sixth row: When old_farm=No AND global_M=No AND I have an NA value in column audit_year, then in situation column, we must always have "Full exit". And the last row whic is self-explanatory.

I suspect that we can use the apply function but I am not sure how to do that.

In reality, I have thousands of such rows where these 7 conditions are repeated and so a general code would be convenient


Solution

  • You can use dplyr::case_when() to control which values are assigned to a new Situation column. The dplyr::mutate() function creates the new column. The .default = NA will return NA where none of the conditions in case_when() are met.

    library(dplyr)
    
    df_N <- df |>
      mutate(Situation = case_when(
        old_farm == "Yes" & global_M == "Yes" & !is.na(audit_year) ~
          paste0("New costat ", env_year, ", Closed antex ", audit_year),
        old_farm == "Yes" & global_M == "Yes" & is.na(audit_year) ~
          paste0("New costat ", env_year),
        old_farm == "Yes" & global_M == "No" & is.na(audit_year) ~
          paste0("New costat ", env_year),
        old_farm == "No" & global_M == "Yes" & !is.na(audit_year) ~
          paste0("Closed antex ", audit_year),
        old_farm == "No" & global_M == "Yes" & is.na(audit_year) ~
          paste0("missing entry"),
        old_farm == "No" & global_M == "No" & is.na(audit_year) ~
          paste0("Full exit"),
        rowSums(across(everything(), is.na)) == ncol(df) ~ 
          "Not at all",
        .default = NA
      ))
    
    df_N
    #   old_farm env_year global_M audit_year                          Situation
    # 1      Yes     2011      Yes       2014 New costat 2011, Closed antex 2014
    # 2      Yes     2020      Yes         NA                    New costat 2020
    # 3      Yes     2019       No         NA                    New costat 2019
    # 4       No     2010      Yes       2010                  Closed antex 2010
    # 5       No     2010      Yes         NA                      missing entry
    # 6       No     2010       No         NA                          Full exit
    # 7     <NA>       NA     <NA>         NA                         Not at all
    

    This will work on your sample dataset, but as is often the case (pun intended), there may be data entry errors where the "Yes" and "No" strings are inconsistent in the old_farm and global_M columns. To counter this, you can use something like tolower():

    # Example df with inconsistent values in old_farm and global_M
    df <- data.frame(old_farm=c("Yes", "yes","YeS", "no", "nO", "No", NA),
                     env_year=c(2011, 2020,2019,2010,2010,2010,NA),
                     global_M=c("yes", "YEs", "no", "Yes","Yes", "no", NA),
                     audit_year=c(2014, NA,NA,2010,NA,NA,NA))
    
    df_N <- df |>
      mutate(Situation = case_when(
        tolower(old_farm) == "yes" & tolower(global_M) == "yes" & !is.na(audit_year) ~
          paste0("New costat ", env_year, ", Closed antex ", audit_year),
        tolower(old_farm) == "yes" & tolower(global_M) == "yes" & is.na(audit_year) ~
          paste0("New costat ", env_year),
        tolower(old_farm) == "yes" & tolower(global_M) == "no" & is.na(audit_year) ~
          paste0("New costat ", env_year),
        tolower(old_farm) == "no" & tolower(global_M) == "yes" & !is.na(audit_year) ~
          paste0("Closed antex ", audit_year),
        tolower(old_farm) == "no" & tolower(global_M) == "yes" & is.na(audit_year) ~
          paste0("missing entry"),
        tolower(old_farm) == "no" & tolower(global_M) == "no" & is.na(audit_year) ~
          paste0("Full exit"),
        rowSums(across(everything(), is.na)) == ncol(df) ~ 
          "Not at all",
        .default = NA
      ))