all!
I have two sf tables with points of field observations from different sources. They may overlap but not necessarily be duplicates (e. g., the same species and date, but different author).
gbifID date year order family genus species author Id name geometry
* <int64> <chr> <int> <chr> <chr> <chr> <chr> <chr> <int> <chr> <POINT [m]>
1 899938536 2012-12-16T13… 2012 Arti… Cervi… Alces Alces … Kiril… 47 NA (4194534 7522373)
2 891131236 2013-03-12T18… 2013 Arti… Cervi… Alces Alces … Kiril… 47 NA (4194126 7523307)
3 891085934 2012-06-16T18… 2012 Arti… Cervi… Alces Alces … Kiril… 47 NA (4193913 7523139)
4 891072950 2012-02-18T14… 2012 Arti… Cervi… Alces Alces … Kiril… 47 NA (4195533 7522884)
5 4909248678 2024-07-08T18… 2024 Sori… Talpi… Talpa Talpa … Марга… 47 NA (4194898 7522782)
6 4846782212 2024-03-31T13… 2024 Carn… Muste… Must… Mustel… Danil… 47 NA (4194659 7522416)
7 4606861558 2024-03-31T13… 2024 Rode… Crice… Onda… Ondatr… Danil… 47 NA (4194818 7522231)
8 4516828012 2024-01-22T12… 2024 Rode… Casto… Cast… Castor… Георг… 47 NA (4193830 7523007)
9 4507665182 2023-12-29T15… 2023 Rode… Casto… Cast… Castor… batwo… 47 NA (4193680 7523186)
10 4453881351 2023-11-27T12… 2023 Rode… Casto… Cast… Castor… Георг… 47 NA (4194139 7522522)
date species author Id name geometry
* <date> <chr> <chr> <int> <chr> <POINT [m]>
1 2013-03-12 Alces alces bushman_k 47 NA (4194129 7523300)
2 2019-11-06 Apodemus agrarius Субачев В. 47 NA (4194908 7522250)
3 2022-07-26 Pipistrellus nathusii Крускоп С.В., Крускоп А.С. 47 NA (4194207 7522606)
4 2019-06-15 Rattus norvegicus Субачев В. 47 NA (4195030 7522289)
5 2012-05-06 Sciurus vulgaris Irina Bobyleva 47 NA (4195175 7523875)
6 2021-06-16 Castor fiber Мазаева Елена 47 NA (4194196 7522587)
7 2012-02-18 Alces alces bushman_k 47 NA (4195531 7522884)
8 2022-07-26 Myotis daubentonii Крускоп С.В., Крускоп А.С. 47 NA (4193594 7523399)
9 2018-06-17 Sorex araneus Крускоп С.В. 47 NA (4195442 7522151)
10 2015-09-11 Sciurus vulgaris melodi_96 47 NA (4194619 7523042)
How can I get a subset of such rows? Desired output:
date species author Id name geometry
2011-02-19 Alces Alces Kiril… 50 NA (4201767 7523209)
2011-02-19 Alces alces bushman_k 50 NA (4201765 7523201)
At the end I've found an optimal solution.
Here are two tables as if from GBIF and iNaturalist:
library(tidyverse)
library(ggrepel)
library(sf)
gbif <- tribble(
~species, ~date, ~author, ~x, ~y,
"hare", "2024-01-01", "Alicia", 1, 5,
"tit", "2024-01-01", "Alicia", 10, 15,
"squirrell", "2024-01-02", "Robert", 15, 10,
"hare", "2024-01-02", "Charles", 30, 40,
) |>
mutate(date = as_date(date)) |>
st_as_sf(coords = c("x", "y"))
iNat <- tribble(
~species, ~date, ~author, ~x, ~y,
"hare", "2024-01-01", "Alice", 2, 6,
"tit", "2024-02-01", "Alice", 40, 35,
"squirrell", "2024-01-03", "Bob", 9, 15,
"hare", "2024-01-03", "Charlie", 50, 60,
) |>
mutate(date = as_date(date)) |>
st_as_sf(coords = c("x", "y"))
ggplot() +
geom_sf(data = gbif, alpha = .5) +
geom_text_repel(data = gbif,
aes(label = paste(author, "\n", species, "\n", date),
geometry = geometry),
stat = "sf_coordinates",
size = 2, min.segment.length = 0) +
geom_sf(data = iNat, color = "navy", alpha = .5) +
geom_text_repel(data = iNat,
aes(label = paste(author, "\n", species, "\n", date),
geometry = geometry),
stat = "sf_coordinates",
size = 2, color = "navy", min.segment.length = 0)
In this case the same authors use the full names in GBIF and the short ones in iNaturalist DBs. Real duplicate here is a hare from 2024-01-01 from Alicia(Alice). We can catch it in a such way:
duplicate <- st_join(gbif, iNat, left = FALSE, join = st_is_within_distance, dist = 20) |>
filter(date.x == date.y & species.x == species.y)
# GBIF is more respectable, so let'clean iNaturelist table
iNat <- anti_join(as.data.frame(iNat), as.data.frame(duplicate),
by = join_by(species == species.y, date == date.y)) |>
st_as_sf()
rm(duplicate)
...and make common table without duplicates:
bind_rows(gbif, iNat)
More code but more control.