rdplyrmergedataset

How to merge two datasets by column value, including all new columns from both datasets and including NA values?


I have two data tables with species data. One, which we can call df1, looks like this:

                       name biovar      breadth
Aegotheles novaezealandiae     01    5.2594280
Aegotheles novaezealandiae     05    1.7984211
Aegotheles novaezealandiae     06    7.1559822
           Alauda arvensis     01    3.2075287
           Alauda arvensis     05    5.2834796
           Alauda arvensis     06    1.7818565
        Alopochen kervazoi     01    0.0000000
        Alopochen kervazoi     05    0.0000000
        Alopochen kervazoi     06    0.0000000
            Anas chlorotis     01    4.2080102
            Anas chlorotis     05    6.6606529
            Anas chlorotis     06    4.6723743
               Anas crecca     01   14.0501483
               Anas crecca     05    6.9732017
               Anas crecca     06   37.8976837

And another, which we can call df2, which looks like this:

                    name Locomotion_Category biovar      breadth
Aegotheles_novazelandiae          Flightless      1    5.2594280
Aegotheles_novazelandiae          Flightless      5    1.7984211
Aegotheles_novazelandiae          Flightless      6    7.1559822
         Alauda_arvensis         True Fliers      1    3.2075287
         Alauda_arvensis         True Fliers      5    5.2834796
         Alauda_arvensis         True Fliers      6    1.7818565
          Anas_chlorotis         True Fliers      1    4.2080102
          Anas_chlorotis         True Fliers      5    6.6606529
          Anas_chlorotis         True Fliers      6    4.6723743
      Anas_platyrhynchos         True Fliers      1   29.5229514
      Anas_platyrhynchos         True Fliers      5   16.7722113
      Anas_platyrhynchos         True Fliers      6   51.6343828

As you can see, df1 has more species in it, but df2 included information of the Locomotion_Category. I want to merge the datasets to get something like this:

                    name Locomotion_Category biovar      breadth
Aegotheles_novazelandiae          Flightless      1    5.2594280
Aegotheles_novazelandiae          Flightless      5    1.7984211
Aegotheles_novazelandiae          Flightless      6    7.1559822
         Alauda_arvensis         True Fliers      1    3.2075287
         Alauda_arvensis         True Fliers      5    5.2834796
         Alauda_arvensis         True Fliers      6    1.7818565
      Alopochen kervazoi                 NA      01    0.0000000
      Alopochen kervazoi                 NA      05    0.0000000
      Alopochen kervazoi                 NA      06    0.0000000
          Anas_chlorotis         True Fliers      1    4.2080102
          Anas_chlorotis         True Fliers      5    6.6606529
          Anas_chlorotis         True Fliers      6    4.6723743
             Anas crecca                  NA     01   14.0501483
             Anas crecca                  NA     05    6.9732017
             Anas crecca                  NA     06   37.8976837
      Anas_platyrhynchos         True Fliers      1   29.5229514
      Anas_platyrhynchos         True Fliers      5   16.7722113
      Anas_platyrhynchos         True Fliers      6   51.6343828

When I try

merge(df1, df2) 

I get an empty dataset. I also tried to see if I could filter out the data from df1 which was in df2, in order to try to at least isolate the species in df1 which are not included in df1, by doing

OldNames <- df2$name
df3 <- filter(df1, ! name %in% OldNames)

But that returns a dataset identify to df1. I also tried setting the row names to the species names, but I got an error message there, since row names cannot repeat.


Solution

  • After fixing the spelling error in df1$name (Aegotheles novazelandiae), and replacing spaces with underscores, you can obtain your desired output by specifying all=TRUE, which is an outer join.

    df1$name <- sub(" ", "_", df1$name)
    
    merge(df1, df2, all=TRUE)
    

                           name biovar   breadth Locomotion_Category
    1  Aegotheles_novazelandiae      1  5.259428          Flightless
    2  Aegotheles_novazelandiae      5  1.798421          Flightless
    3  Aegotheles_novazelandiae      6  7.155982          Flightless
    4           Alauda_arvensis      1  3.207529         True Fliers
    5           Alauda_arvensis      5  5.283480         True Fliers
    6           Alauda_arvensis      6  1.781856         True Fliers
    7        Alopochen_kervazoi      1  0.000000                <NA>
    8        Alopochen_kervazoi      5  0.000000                <NA>
    9        Alopochen_kervazoi      6  0.000000                <NA>
    10           Anas_chlorotis      1  4.208010         True Fliers
    11           Anas_chlorotis      5  6.660653         True Fliers
    12           Anas_chlorotis      6  4.672374         True Fliers
    13              Anas_crecca      1 14.050148                <NA>
    14              Anas_crecca      5  6.973202                <NA>
    15              Anas_crecca      6 37.897684                <NA>
    16       Anas_platyrhynchos      1 29.522951         True Fliers
    17       Anas_platyrhynchos      5 16.772211         True Fliers
    18       Anas_platyrhynchos      6 51.634383         True Fliers
    

    However, it is recommended to specify the "by" columns explicity, and not let R guess. if not specified, R uses the common columns, here they are name, biovar, and breadth. Your actual data may have more columns.


    Data:

    df1 <- structure(list(name = c("Aegotheles novazelandiae", "Aegotheles novazelandiae", 
    "Aegotheles novazelandiae", "Alauda arvensis", "Alauda arvensis", 
    "Alauda arvensis", "Alopochen kervazoi", "Alopochen kervazoi", 
    "Alopochen kervazoi", "Anas chlorotis", "Anas chlorotis", "Anas chlorotis", 
    "Anas crecca", "Anas crecca", "Anas crecca"), biovar = c(1, 5, 
    6, 1, 5, 6, 1, 5, 6, 1, 5, 6, 1, 5, 6), breadth = c(5.259428, 
    1.7984211, 7.1559822, 3.2075287, 5.2834796, 1.7818565, 0, 0, 
    0, 4.2080102, 6.6606529, 4.6723743, 14.0501483, 6.9732017, 37.8976837
    )), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, 
    -15L))
    
    df2 <- structure(list(name = c("Aegotheles_novazelandiae", "Aegotheles_novazelandiae", 
    "Aegotheles_novazelandiae", "Alauda_arvensis", "Alauda_arvensis", 
    "Alauda_arvensis", "Anas_chlorotis", "Anas_chlorotis", "Anas_chlorotis", 
    "Anas_platyrhynchos", "Anas_platyrhynchos", "Anas_platyrhynchos"
    ), Locomotion_Category = c("Flightless", "Flightless", "Flightless", 
    "True Fliers", "True Fliers", "True Fliers", "True Fliers", "True Fliers", 
    "True Fliers", "True Fliers", "True Fliers", "True Fliers"), 
        biovar = c(1, 5, 6, 1, 5, 6, 1, 5, 6, 1, 5, 6), breadth = c(5.259428, 
        1.7984211, 7.1559822, 3.2075287, 5.2834796, 1.7818565, 4.2080102, 
        6.6606529, 4.6723743, 29.5229514, 16.7722113, 51.6343828)), class = c("tbl_df", 
    "tbl", "data.frame"), row.names = c(NA, -12L))