rr-sf

filter rows in sf where geometry and several attributes are equal


all!

I have two sf tables with points of field observations from different sources. They may overlap but not necessarily be duplicates (e. g., the same species and date, but different author).

      gbifID date            year order family genus species author    Id name           geometry
 *    <int64> <chr>          <int> <chr> <chr>  <chr> <chr>   <chr>  <int> <chr>       <POINT [m]>
 1  899938536 2012-12-16T13…  2012 Arti… Cervi… Alces Alces … Kiril…    47 NA    (4194534 7522373)
 2  891131236 2013-03-12T18…  2013 Arti… Cervi… Alces Alces … Kiril…    47 NA    (4194126 7523307)
 3  891085934 2012-06-16T18…  2012 Arti… Cervi… Alces Alces … Kiril…    47 NA    (4193913 7523139)
 4  891072950 2012-02-18T14…  2012 Arti… Cervi… Alces Alces … Kiril…    47 NA    (4195533 7522884)
 5 4909248678 2024-07-08T18…  2024 Sori… Talpi… Talpa Talpa … Марга…    47 NA    (4194898 7522782)
 6 4846782212 2024-03-31T13…  2024 Carn… Muste… Must… Mustel… Danil…    47 NA    (4194659 7522416)
 7 4606861558 2024-03-31T13…  2024 Rode… Crice… Onda… Ondatr… Danil…    47 NA    (4194818 7522231)
 8 4516828012 2024-01-22T12…  2024 Rode… Casto… Cast… Castor… Георг…    47 NA    (4193830 7523007)
 9 4507665182 2023-12-29T15…  2023 Rode… Casto… Cast… Castor… batwo…    47 NA    (4193680 7523186)
10 4453881351 2023-11-27T12…  2023 Rode… Casto… Cast… Castor… Георг…    47 NA    (4194139 7522522)

   date       species               author                        Id name           geometry
 * <date>     <chr>                 <chr>                      <int> <chr>       <POINT [m]>
 1 2013-03-12 Alces alces           bushman_k                     47 NA    (4194129 7523300)
 2 2019-11-06 Apodemus agrarius     Субачев В.                    47 NA    (4194908 7522250)
 3 2022-07-26 Pipistrellus nathusii Крускоп С.В., Крускоп А.С.    47 NA    (4194207 7522606)
 4 2019-06-15 Rattus norvegicus     Субачев В.                    47 NA    (4195030 7522289)
 5 2012-05-06 Sciurus vulgaris      Irina Bobyleva                47 NA    (4195175 7523875)
 6 2021-06-16 Castor fiber          Мазаева Елена                 47 NA    (4194196 7522587)
 7 2012-02-18 Alces alces           bushman_k                     47 NA    (4195531 7522884)
 8 2022-07-26 Myotis daubentonii    Крускоп С.В., Крускоп А.С.    47 NA    (4193594 7523399)
 9 2018-06-17 Sorex araneus         Крускоп С.В.                  47 NA    (4195442 7522151)
10 2015-09-11 Sciurus vulgaris      melodi_96                     47 NA    (4194619 7523042)

How can I get a subset of such rows? Desired output:

date        species      author     Id name  geometry
2011-02-19  Alces Alces  Kiril…     50 NA    (4201767 7523209)
2011-02-19  Alces alces  bushman_k  50 NA    (4201765 7523201)

Solution

  • At the end I've found an optimal solution.

    Here are two tables as if from GBIF and iNaturalist:

    library(tidyverse)
    library(ggrepel)
    library(sf)
    
    gbif <- tribble(
      ~species,     ~date,       ~author,    ~x,  ~y,
      "hare",      "2024-01-01",  "Alicia",   1,   5,
      "tit",       "2024-01-01",  "Alicia",   10,  15,
      "squirrell", "2024-01-02",  "Robert",   15,  10,
      "hare",      "2024-01-02",  "Charles",  30,  40,
    ) |>
      mutate(date = as_date(date)) |>
      st_as_sf(coords = c("x", "y"))
    
    iNat <- tribble(
      ~species,     ~date,       ~author,    ~x,  ~y,
      "hare",      "2024-01-01",  "Alice",    2,   6,
      "tit",       "2024-02-01",  "Alice",    40,  35,
      "squirrell", "2024-01-03",  "Bob",      9,  15,
      "hare",      "2024-01-03",  "Charlie",  50,  60,
    ) |>
      mutate(date = as_date(date)) |>
      st_as_sf(coords = c("x", "y"))
    
    ggplot() +
      geom_sf(data = gbif, alpha = .5) +
      geom_text_repel(data = gbif,
                      aes(label = paste(author, "\n", species, "\n", date),
                          geometry = geometry),
                      stat = "sf_coordinates",
                      size = 2, min.segment.length = 0) +
      geom_sf(data = iNat, color = "navy", alpha = .5) +
      geom_text_repel(data = iNat,
                      aes(label = paste(author, "\n", species, "\n", date),
                          geometry = geometry),
                      stat = "sf_coordinates",
                      size = 2, color = "navy", min.segment.length = 0)
    

    illustration of our points

    In this case the same authors use the full names in GBIF and the short ones in iNaturalist DBs. Real duplicate here is a hare from 2024-01-01 from Alicia(Alice). We can catch it in a such way:

    duplicate <- st_join(gbif, iNat, left = FALSE, join = st_is_within_distance, dist = 20) |>
      filter(date.x == date.y & species.x == species.y)
    
    # GBIF is more respectable, so let'clean iNaturelist table
    iNat <- anti_join(as.data.frame(iNat), as.data.frame(duplicate),
              by = join_by(species == species.y, date == date.y)) |>
            st_as_sf()
    
    rm(duplicate)
    

    ...and make common table without duplicates:

    bind_rows(gbif, iNat)
    

    More code but more control.