rdataframe

Concatenate values depending on variable value


I want to merge the information from 18 different databases providing information on the status of many species in different locations of the world. However, I only want to merge the rows where establishmentMeans for a specific taxonID and locationIDare both introduced and uncertain. For these rows, I would then like to concatenate the values in the remaining columns of the data frame: degreeOfEstablishment, pathway, and source.

I have been trying different solutions, for I end up either missing some values or adding values where they do not exist.

Here is an example of the data:

df1 <- data.frame(
  taxonID = c(1, 1, 1, 1, 2, 2, 2),
  locationID = c(1, 1, 1, 2, 3, 3, 4),
  establishmentMeans = c("introduced", "uncertain", "vagrant", "uncertain", "introduced", "uncertain", "introduced"),
  degreeOfEstablishment = c("established", "reproducing", NA, NA, "invasive", "failing", NA),
  pathway = c("releasedForUse", "otherEscape", NA, NA, "unaided", "unaided", NA),
  source = c("x", "y", "y", "x", "x", "x", "y"),
  stringsAsFactors = FALSE 
)
df2 <- data.frame(
  taxonID = c(1, 1, 2, 2),
  locationID = c(1, 2, 3, 5),
  establishmentMeans = c("native", "native", "native", "native"),
  source = c("z", "z", "z", "z"),
  stringsAsFactors = FALSE
)

My code:

# merge data 
dat <- merge(df1, df2, by = c("locationID", "taxonID","establishmentMeans"), all = TRUE)

# merge rows where a taxon is reported as introduced and uncertain in the same location
dat2 <- dat |> 
  group_by(locationID, taxonID) |> 
  mutate(
    establishmentMeans = if ("introduced" %in% establishmentMeans & "uncertain" %in% establishmentMeans) {
      "introduced; uncertain"
    } else {
      establishmentMeans
    }
  )

# merge the remaining information corresponding to the status of a species when introduced and uncertain
dat3 <- dat2 |> 
  group_by(locationID, taxonID, establishmentMeans) |> 
  mutate(
    across(
      c(starts_with("degreeOfEstablishment"), starts_with("pathway"), starts_with("source")), 
      ~ paste(unique(na.omit(.)), collapse = "; "),
      .names = "{.col}"
    )
  )|> 
  ungroup()

And how the output should look like

out <- data.frame(
  taxonID = c(1, 1, 1, 1, 1, 2, 2, 2, 2),
  locationID = c(1, 1, 1, 2, 2, 3, 3, 4, 5),
  establishmentMeans = c("introduced; uncertain", "native", "vagrant", "uncertain", "native", "introduced; uncertain", "native", "introduced", "native"),
  degreeOfEstablishment = c("established; reproducing", NA, NA, NA, NA, "invasive; failing", NA, NA, NA),
  pathway = c("releasedForUse; otherEscape", NA, NA, NA, NA, "unaided", NA, NA, NA),
  source = c("x; y", "z", "y", "z", "x", "x", "z", "y", "z"),
  stringsAsFactors = FALSE
)

The first step when merging the rows for a species reported as introduced and uncertain is already causing issues, as some values go missing. Another issue is that I am explicitly saying that I want to concatenate the values in the columns degreeOfEstablishment, pathway, and source. I should be concatenating all remaining values from the columns from both these rows, I am just not sure how to do this. Maybe using mutate(across(everything))? There are some of the original databases to merge which have columns that other databases do not have, so I am not sure if this would cause issues at some point.

EDIT: fixed expected output, keeping the row with taxonID=1, locationID=2, establishmentMeans="uncertain"


Solution

  • Assuming that df1's row 4 should really be retained in the results, here's a dplyr pipe:

    library(dplyr)
    df1 |>
      mutate(.by = c(taxonID, locationID), g = (establishmentMeans %in% c("introduced", "uncertain") & all(c("introduced", "uncertain") %in% establishmentMeans))) |>
      reframe(.by = c(taxonID, locationID, g), across(everything(), ~ if (first(g)) paste(.x, collapse = "; ") else .x)) |>
      select(-g) |>
      bind_rows(df2) |>
      arrange(taxonID, locationID)
    #   taxonID locationID    establishmentMeans    degreeOfEstablishment                     pathway source
    # 1       1          1 introduced; uncertain established; reproducing releasedForUse; otherEscape   x; y
    # 2       1          1               vagrant                     <NA>                        <NA>      y
    # 3       1          1                native                     <NA>                        <NA>      z
    # 4       1          2             uncertain                     <NA>                        <NA>      x
    # 5       1          2                native                     <NA>                        <NA>      z
    # 6       2          3 introduced; uncertain        invasive; failing            unaided; unaided   x; x
    # 7       2          3                native                     <NA>                        <NA>      z
    # 8       2          4            introduced                     <NA>                        <NA>      y
    # 9       2          5                native                     <NA>                        <NA>      z
    

    The arrange(..) is only for rendering here, not required for the process.


    Data:

    df1 <- structure(list(taxonID = c(1, 1, 1, 1, 2, 2, 2), locationID = c(1, 1, 1, 2, 3, 3, 4), establishmentMeans = c("introduced", "uncertain", "vagrant", "uncertain", "introduced", "uncertain", "introduced"), degreeOfEstablishment = c("established", "reproducing", NA, NA, "invasive", "failing", NA), pathway = c("releasedForUse", "otherEscape", NA, NA, "unaided", "unaided", NA), source = c("x", "y", "y", "x", "x", "x", "y")), class = "data.frame", row.names = c(NA, -7L))
    df2 <- structure(list(taxonID = c(1, 1, 2, 2), locationID = c(1, 2, 3, 5), establishmentMeans = c("native", "native", "native", "native"), source = c("z", "z", "z", "z")), class = "data.frame", row.names = c(NA, -4L))