rdatabasedataset

How do I find the number of instances two values co-occur across a dataset (from most to least)?


I have a relatively complex question that I have been trying to find a solution for hours, with no luck. I will do my best to explain it.

I have a dataset with the following columns:

$ bma_compound_name     : chr  "atrazine" "atrazine" "atrazine" "atrazine" ...
$ date                  : Date, format: "2020-06-24" "2020-05-20" "2020-01-09" ...
$ monitoring_location_id: chr  "NL14_20181" "NL14_20181" "NL14_20932" "NL14_20932" ...

where 'bma_compound_name' can be one of 82 different pesticides, 'date' is the date the pesticide was detected, and 'monitoring_location_id' is the location where the pesticide was detected.

Now, the problem: I first need to find how many times each combination of two pesticides co-occurs with each other (on the same date and location). Then, I want to order each combination of two pesticides from the most frequently co-occurring to the least frequently co-occurring.

For example, if pesticide A and pesticide B are detected on the same dates (throughout the year) 5 times in location 1, 4 times in location 2, and 1 time in location 3, then 5 + 4 + 1 = 10 times total.

For each combination, "glyphosate and AMPA" and "AMPA and glyphosate" for example, should not count as two unique combinations. Further, any combination of two pesticides that never co-occur should be removed.

I hope it all makes sense. I am really unsure how to solve this...any help in the right direction would be hugely appreciated.

Thank you in advance!


Solution

  • This is not a very elegant answer, but maybe it will spark an idea for you and/or others. My thinking is fist to map out all the combinations of chemicals. Then, to join the data to that key and filter out any occurrences where the two chemicals where not detected together. Lastly, recount all of the co-occurrences and arrange the data. Note that this is probably not the most efficient method. Also note, that it is helpful if you post a little data to get things going. I made up some data here as an example.

    library(tidyverse)
    
    #set up some example data
    set.seed(3252)
    dates <- lubridate::date(
      c("2022-01-01", "2022-02-01", "2022-01-04", "2023-01-01", "2023-02-01") 
    )
    locs <- c("my house", "your house", "their house", "kentucky")
      
    df <- tibble(
      bma_compound_name = sample(letters[1:4], 25, replace = TRUE),
      date = sample(dates, 25, replace = TRUE),
      monitoring_location_id = sample(locs, 25, replace = TRUE)
    )
    
    head(df)
    #> # A tibble: 6 x 3
    #>   bma_compound_name date       monitoring_location_id
    #>   <chr>             <date>     <chr>                 
    #> 1 a                 2023-02-01 kentucky              
    #> 2 a                 2022-01-04 your house            
    #> 3 b                 2022-02-01 their house           
    #> 4 c                 2022-01-01 my house              
    #> 5 a                 2022-02-01 their house           
    #> 6 c                 2023-02-01 my house
    
    #answer to question
    combo_key <- unique(df$bma_compound_name) |>
      combn(2) |>
      t() |>
      `colnames<-`(c("Chem_1", "Chem_2"))|>
      as_tibble()
    
    
    full_join(combo_key, df, 
              by = c('Chem_1' = 'bma_compound_name'),
              relationship = "many-to-many") |>
      left_join(df |> mutate(count = 1), 
                by = c('Chem_2' = 'bma_compound_name', 
                           'monitoring_location_id', 
                           'date'),
                relationship = "many-to-many") |>
      filter(count == 1) |>
      summarise(count = sum(count), 
                .by = c(Chem_1, Chem_2, date, monitoring_location_id)) |>
      mutate(total = sum(count), .by = c(Chem_1, Chem_2)) |>
      arrange(desc(total))
    #> # A tibble: 8 x 6
    #>   Chem_1 Chem_2 date       monitoring_location_id count total
    #>   <chr>  <chr>  <date>     <chr>                  <dbl> <dbl>
    #> 1 a      b      2022-02-01 their house                2     3
    #> 2 a      b      2022-01-01 your house                 1     3
    #> 3 a      c      2022-01-04 kentucky                   2     2
    #> 4 b      c      2022-01-01 my house                   1     2
    #> 5 b      c      2023-02-01 their house                1     2
    #> 6 c      d      2023-02-01 their house                1     2
    #> 7 c      d      2023-02-01 your house                 1     2
    #> 8 b      d      2023-02-01 their house                1     1