dplyrmergetidyversenamissing-data

Replacing NA values with data from another Dataset


I have this dataset, here is an example:

df1

State    Year   ColA   ColB ColC 
OH         2000   200   190  180
OH         2002   218   200  176
OH         2010   218   205  185
KY         2000   200   190  NA
KY         2002   218   200  NA
KY         2010   218   205  NA
PA         2000   200   190  180
PA         2002   214   200  178
PA         2010   219   205  185
MI         2000   200   190  NA
MI         2002   218   200  NA
MI         2010   218   205  NA

As you see, I have a few missing values in ColC. I have these values in a different dataset:

df2

 State    Year   ColB ColC_Fill
 KY        2000   180  200
 KY        2002   176  210
 KY        2010   185  211
 MI        2000   200  211   
 MI        2002   200  206   
 MI        2010   200  205   

I want to get this:

df1

State    Year   ColA   ColB ColC 
OH         2000   200   190  180
OH         2002   218   200  176
OH         2010   218   205  185
KY         2000   200   190  200
KY         2002   218   200  210
KY         2010   218   205  211
PA         2000   200   190  180
PA         2002   214   200  178
PA         2010   219   205  185
MI         2000   200   190  211
MI         2002   218   200  206
MI         2010   218   205  205

I want to fill in the missing values of df1 with those in df2. I have tried this code:

 df$ColC[is.na(df$ColC)]<-left_join(df1[,c(1:2,5)], df2[,c(1:2,4)])

This code seemed to work a few days ago, but when I tried it again yesterday, it is replacing ColC with a list of all the variables. I have tried several iterations, but have had no luck. Any help would be appreciated! Thanks in advance.


Solution

  • Many roads lead to Rome. Here is a potential solution:

    dataset1 <- dataset1 %>%
      left_join(dataset2 %>% select(State, Year, ColC_Fill), by = c("State", "Year")) %>%
      mutate(ColC = ifelse(is.na(ColC), ColC_Fill, ColC)  ) %>%
      select(-ColC_Fill)  
    

    First, I select the relevant columns by name from dataset2.

    I then do a left_join() on "State", "Year" .

    Third, I replace ColC wil ColC_fill wehere ColC is NA.

    Finally, I drop ColC_Fill as it is not needed anymore.