I have two data tables with species data. One, which we can call df1, looks like this:
name biovar breadth
Aegotheles novaezealandiae 01 5.2594280
Aegotheles novaezealandiae 05 1.7984211
Aegotheles novaezealandiae 06 7.1559822
Alauda arvensis 01 3.2075287
Alauda arvensis 05 5.2834796
Alauda arvensis 06 1.7818565
Alopochen kervazoi 01 0.0000000
Alopochen kervazoi 05 0.0000000
Alopochen kervazoi 06 0.0000000
Anas chlorotis 01 4.2080102
Anas chlorotis 05 6.6606529
Anas chlorotis 06 4.6723743
Anas crecca 01 14.0501483
Anas crecca 05 6.9732017
Anas crecca 06 37.8976837
And another, which we can call df2, which looks like this:
name Locomotion_Category biovar breadth
Aegotheles_novazelandiae Flightless 1 5.2594280
Aegotheles_novazelandiae Flightless 5 1.7984211
Aegotheles_novazelandiae Flightless 6 7.1559822
Alauda_arvensis True Fliers 1 3.2075287
Alauda_arvensis True Fliers 5 5.2834796
Alauda_arvensis True Fliers 6 1.7818565
Anas_chlorotis True Fliers 1 4.2080102
Anas_chlorotis True Fliers 5 6.6606529
Anas_chlorotis True Fliers 6 4.6723743
Anas_platyrhynchos True Fliers 1 29.5229514
Anas_platyrhynchos True Fliers 5 16.7722113
Anas_platyrhynchos True Fliers 6 51.6343828
As you can see, df1 has more species in it, but df2 included information of the Locomotion_Category. I want to merge the datasets to get something like this:
name Locomotion_Category biovar breadth
Aegotheles_novazelandiae Flightless 1 5.2594280
Aegotheles_novazelandiae Flightless 5 1.7984211
Aegotheles_novazelandiae Flightless 6 7.1559822
Alauda_arvensis True Fliers 1 3.2075287
Alauda_arvensis True Fliers 5 5.2834796
Alauda_arvensis True Fliers 6 1.7818565
Alopochen kervazoi NA 01 0.0000000
Alopochen kervazoi NA 05 0.0000000
Alopochen kervazoi NA 06 0.0000000
Anas_chlorotis True Fliers 1 4.2080102
Anas_chlorotis True Fliers 5 6.6606529
Anas_chlorotis True Fliers 6 4.6723743
Anas crecca NA 01 14.0501483
Anas crecca NA 05 6.9732017
Anas crecca NA 06 37.8976837
Anas_platyrhynchos True Fliers 1 29.5229514
Anas_platyrhynchos True Fliers 5 16.7722113
Anas_platyrhynchos True Fliers 6 51.6343828
When I try
merge(df1, df2)
I get an empty dataset. I also tried to see if I could filter out the data from df1 which was in df2, in order to try to at least isolate the species in df1 which are not included in df1, by doing
OldNames <- df2$name
df3 <- filter(df1, ! name %in% OldNames)
But that returns a dataset identify to df1. I also tried setting the row names to the species names, but I got an error message there, since row names cannot repeat.
After fixing the spelling error in df1$name
(Aegotheles novazelandiae), and replacing spaces with underscores, you can obtain your desired output by specifying all=TRUE
, which is an outer join.
df1$name <- sub(" ", "_", df1$name)
merge(df1, df2, all=TRUE)
name biovar breadth Locomotion_Category
1 Aegotheles_novazelandiae 1 5.259428 Flightless
2 Aegotheles_novazelandiae 5 1.798421 Flightless
3 Aegotheles_novazelandiae 6 7.155982 Flightless
4 Alauda_arvensis 1 3.207529 True Fliers
5 Alauda_arvensis 5 5.283480 True Fliers
6 Alauda_arvensis 6 1.781856 True Fliers
7 Alopochen_kervazoi 1 0.000000 <NA>
8 Alopochen_kervazoi 5 0.000000 <NA>
9 Alopochen_kervazoi 6 0.000000 <NA>
10 Anas_chlorotis 1 4.208010 True Fliers
11 Anas_chlorotis 5 6.660653 True Fliers
12 Anas_chlorotis 6 4.672374 True Fliers
13 Anas_crecca 1 14.050148 <NA>
14 Anas_crecca 5 6.973202 <NA>
15 Anas_crecca 6 37.897684 <NA>
16 Anas_platyrhynchos 1 29.522951 True Fliers
17 Anas_platyrhynchos 5 16.772211 True Fliers
18 Anas_platyrhynchos 6 51.634383 True Fliers
However, it is recommended to specify the "by" columns explicity, and not let R guess. if not specified, R uses the common columns, here they are name
, biovar
, and breadth
. Your actual data may have more columns.
Data:
df1 <- structure(list(name = c("Aegotheles novazelandiae", "Aegotheles novazelandiae",
"Aegotheles novazelandiae", "Alauda arvensis", "Alauda arvensis",
"Alauda arvensis", "Alopochen kervazoi", "Alopochen kervazoi",
"Alopochen kervazoi", "Anas chlorotis", "Anas chlorotis", "Anas chlorotis",
"Anas crecca", "Anas crecca", "Anas crecca"), biovar = c(1, 5,
6, 1, 5, 6, 1, 5, 6, 1, 5, 6, 1, 5, 6), breadth = c(5.259428,
1.7984211, 7.1559822, 3.2075287, 5.2834796, 1.7818565, 0, 0,
0, 4.2080102, 6.6606529, 4.6723743, 14.0501483, 6.9732017, 37.8976837
)), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA,
-15L))
df2 <- structure(list(name = c("Aegotheles_novazelandiae", "Aegotheles_novazelandiae",
"Aegotheles_novazelandiae", "Alauda_arvensis", "Alauda_arvensis",
"Alauda_arvensis", "Anas_chlorotis", "Anas_chlorotis", "Anas_chlorotis",
"Anas_platyrhynchos", "Anas_platyrhynchos", "Anas_platyrhynchos"
), Locomotion_Category = c("Flightless", "Flightless", "Flightless",
"True Fliers", "True Fliers", "True Fliers", "True Fliers", "True Fliers",
"True Fliers", "True Fliers", "True Fliers", "True Fliers"),
biovar = c(1, 5, 6, 1, 5, 6, 1, 5, 6, 1, 5, 6), breadth = c(5.259428,
1.7984211, 7.1559822, 3.2075287, 5.2834796, 1.7818565, 4.2080102,
6.6606529, 4.6723743, 29.5229514, 16.7722113, 51.6343828)), class = c("tbl_df",
"tbl", "data.frame"), row.names = c(NA, -12L))