I want to merge the information from 18 different databases providing information on the status of many species in different locations of the world. However, I only want to merge the rows where establishmentMeans
for a specific taxonID
and locationID
are both introduced
and uncertain
. For these rows, I would then like to concatenate the values in the remaining columns of the data frame: degreeOfEstablishment
, pathway
, and source
.
I have been trying different solutions, for I end up either missing some values or adding values where they do not exist.
Here is an example of the data:
df1 <- data.frame(
taxonID = c(1, 1, 1, 1, 2, 2, 2),
locationID = c(1, 1, 1, 2, 3, 3, 4),
establishmentMeans = c("introduced", "uncertain", "vagrant", "uncertain", "introduced", "uncertain", "introduced"),
degreeOfEstablishment = c("established", "reproducing", NA, NA, "invasive", "failing", NA),
pathway = c("releasedForUse", "otherEscape", NA, NA, "unaided", "unaided", NA),
source = c("x", "y", "y", "x", "x", "x", "y"),
stringsAsFactors = FALSE
)
df2 <- data.frame(
taxonID = c(1, 1, 2, 2),
locationID = c(1, 2, 3, 5),
establishmentMeans = c("native", "native", "native", "native"),
source = c("z", "z", "z", "z"),
stringsAsFactors = FALSE
)
My code:
# merge data
dat <- merge(df1, df2, by = c("locationID", "taxonID","establishmentMeans"), all = TRUE)
# merge rows where a taxon is reported as introduced and uncertain in the same location
dat2 <- dat |>
group_by(locationID, taxonID) |>
mutate(
establishmentMeans = if ("introduced" %in% establishmentMeans & "uncertain" %in% establishmentMeans) {
"introduced; uncertain"
} else {
establishmentMeans
}
)
# merge the remaining information corresponding to the status of a species when introduced and uncertain
dat3 <- dat2 |>
group_by(locationID, taxonID, establishmentMeans) |>
mutate(
across(
c(starts_with("degreeOfEstablishment"), starts_with("pathway"), starts_with("source")),
~ paste(unique(na.omit(.)), collapse = "; "),
.names = "{.col}"
)
)|>
ungroup()
And how the output should look like
out <- data.frame(
taxonID = c(1, 1, 1, 1, 1, 2, 2, 2, 2),
locationID = c(1, 1, 1, 2, 2, 3, 3, 4, 5),
establishmentMeans = c("introduced; uncertain", "native", "vagrant", "uncertain", "native", "introduced; uncertain", "native", "introduced", "native"),
degreeOfEstablishment = c("established; reproducing", NA, NA, NA, NA, "invasive; failing", NA, NA, NA),
pathway = c("releasedForUse; otherEscape", NA, NA, NA, NA, "unaided", NA, NA, NA),
source = c("x; y", "z", "y", "z", "x", "x", "z", "y", "z"),
stringsAsFactors = FALSE
)
The first step when merging the rows for a species reported as introduced
and uncertain
is already causing issues, as some values go missing.
Another issue is that I am explicitly saying that I want to concatenate the values in the columns degreeOfEstablishment
, pathway
, and source
. I should be concatenating all remaining values from the columns from both these rows, I am just not sure how to do this. Maybe using mutate(across(everything))
? There are some of the original databases to merge which have columns that other databases do not have, so I am not sure if this would cause issues at some point.
EDIT: fixed expected output, keeping the row with taxonID=1
, locationID=2
, establishmentMeans="uncertain"
Assuming that df1
's row 4 should really be retained in the results, here's a dplyr pipe:
library(dplyr)
df1 |>
mutate(.by = c(taxonID, locationID), g = (establishmentMeans %in% c("introduced", "uncertain") & all(c("introduced", "uncertain") %in% establishmentMeans))) |>
reframe(.by = c(taxonID, locationID, g), across(everything(), ~ if (first(g)) paste(.x, collapse = "; ") else .x)) |>
select(-g) |>
bind_rows(df2) |>
arrange(taxonID, locationID)
# taxonID locationID establishmentMeans degreeOfEstablishment pathway source
# 1 1 1 introduced; uncertain established; reproducing releasedForUse; otherEscape x; y
# 2 1 1 vagrant <NA> <NA> y
# 3 1 1 native <NA> <NA> z
# 4 1 2 uncertain <NA> <NA> x
# 5 1 2 native <NA> <NA> z
# 6 2 3 introduced; uncertain invasive; failing unaided; unaided x; x
# 7 2 3 native <NA> <NA> z
# 8 2 4 introduced <NA> <NA> y
# 9 2 5 native <NA> <NA> z
The arrange(..)
is only for rendering here, not required for the process.
Data:
df1 <- structure(list(taxonID = c(1, 1, 1, 1, 2, 2, 2), locationID = c(1, 1, 1, 2, 3, 3, 4), establishmentMeans = c("introduced", "uncertain", "vagrant", "uncertain", "introduced", "uncertain", "introduced"), degreeOfEstablishment = c("established", "reproducing", NA, NA, "invasive", "failing", NA), pathway = c("releasedForUse", "otherEscape", NA, NA, "unaided", "unaided", NA), source = c("x", "y", "y", "x", "x", "x", "y")), class = "data.frame", row.names = c(NA, -7L))
df2 <- structure(list(taxonID = c(1, 1, 2, 2), locationID = c(1, 2, 3, 5), establishmentMeans = c("native", "native", "native", "native"), source = c("z", "z", "z", "z")), class = "data.frame", row.names = c(NA, -4L))