Below are my two dataframes, df1 and df2
df1 <- data.frame(id=c("632592651","633322173","634703802","634927873","635812953","636004739","636101211","636157799","636263106","636752420"),text=c("asdf","cat","dog","mouse","elephant","goose","rat","mice","kitty","kitten"),response=c("y","y","y","n","n","y","y","n","n","y"))
id text response
1 632592651 asdf y
2 633322173 cat y
3 634703802 dog y
4 634927873 mouse n
5 635812953 elephant n
6 636004739 goose y
7 636101211 rat y
8 636157799 mice n
9 636263106 kitty n
10 636752420 kitten y
df2 <- data.frame(id=c("632592651","633322173","634703802","634927873","635812953","636004739","636101211","636157799","636263106","636752420","636809222","2004722036","2004894388","2005045755","2005535472","2005630542","2005788781","2005809679","2005838317","2005866692"),
text=c("asdf_xyz","cat","dog","mouse","elephant","goose","rat","mice","kitty","kitten","tiger_xyz","lion","leopard","ostrich","kangaroo","platypus","fish","reptile","mammals","amphibians_xyz"),
volume=c("1234","432","324","333","2223","412346","7456","3456","2345","2345","6","345","23","2","4778","234","8675","3459","8","9"))
id text volume
1 632592651 asdf_xyz 1234
2 633322173 cat 432
3 634703802 dog 324
4 634927873 mouse 333
5 635812953 elephant 2223
6 636004739 goose 412346
7 636101211 rat 7456
8 636157799 mice 3456
9 636263106 kitty 2345
10 636752420 kitten 2345
11 636809222 tiger_xyz 6
12 2004722036 lion 345
13 2004894388 leopard 23
14 2005045755 ostrich 2
15 2005535472 kangaroo 4778
16 2005630542 platypus 234
17 2005788781 fish 8675
18 2005809679 reptile 3459
19 2005838317 mammals 8
20 2005866692 amphibians_xyz 9
I would like to create a dataframe, df3, that return row values that match column 'id' and 'text' in both df1 and df2 but return NA to the mismatch in column 'text'
Desired output of df3:
id text response volume
1 632592651 NA y 1234
2 633322173 cat y 432
3 634703802 dog y 324
4 634927873 mouse n 333
5 635812953 elephant n 2223
6 636004739 goose y 412346
7 636101211 rat y 7456
8 636157799 mice n 3456
9 636263106 kitty n 2345
10 636752420 kitten y 2345
What I have tried:
df3 <- df1 %>%
anti_join(df2, by = c("id","text"))
I then change all non zero value in column 'text' to NA
df3$text[df3$text != 0] <- NA
Did inner_join between df1 and df2 to get df4 (match both "id" and "text")
df4 <- df1 %>%
inner_join(df2, by = c("id","text"))
And inner_join df5 with df3 to add "volume" column
df5 <- df3 %>%
inner_join(df2, by= c("id")) %>%
select(id,text.x,response,volume)
Change column name 'text.x' to 'text'
colnames(df5)[colnames(df5)%in%c("text.x")] <- c("text")
And finally binding the df4 and df5 to get the desired output:
df6 <- rbind(df5,df4)
id text response volume
1 632592651 <NA> y 1234
2 633322173 cat y 432
3 634703802 dog y 324
4 634927873 mouse n 333
5 635812953 elephant n 2223
6 636004739 goose y 412346
7 636101211 rat y 7456
8 636157799 mice n 3456
9 636263106 kitty n 2345
10 636752420 kitten y 2345
It's getting extremely difficult to deal with if I have many columns. Can someone advise how to simplify this method? Thanking in advance.
Using dplyr
and only joining by
"id" you can simplify it like this:
library(dplyr)
inner_join(x = df1,
y = df2,
by = "id") %>%
mutate_if(is.factor, as.character) %>%
mutate(text = ifelse(test = text.x != text.y,
yes = NA,
no = text.x)) %>%
select(id, text, response, volume)
#> id text response volume
#> 1 632592651 <NA> y 1234
#> 2 633322173 cat y 432
#> 3 634703802 dog y 324
#> 4 634927873 mouse n 333
#> 5 635812953 elephant n 2223
#> 6 636004739 goose y 412346
#> 7 636101211 rat y 7456
#> 8 636157799 mice n 3456
#> 9 636263106 kitty n 2345
#> 10 636752420 kitten y 2345