rtidyverse

Re-write tidyverse function for clarity and brevity


I use chatgpt for some coding (shame me) and I have this function I don't understand:

escalc.df <- allmetadata %>%
  group_by(!!sym(testvariable)) %>%
  summarise(group = first(!!sym(testvariable)),
         ai = sum(!!sym(hydratelevel) == "Present" & modelpresence > 0),
         bi = sum(!!sym(hydratelevel) == "Present" & modelpresence == 0),
         ci = sum(!!sym(hydratelevel) == "Absent" & modelpresence > 0),
         di = sum(!!sym(hydratelevel) == "Absent" & modelpresence == 0))

And I don't know where to begin since I don't understand the function fully. Could someone explain this code to me or even share a shortened/more simplified version?

I am expecting the df allmetadata to be grouped by testvariable (set earlier to match a specific allmetadata column) then I want to sum the instances when allmetadata$modelpresence is > or = to 0 for each variable in allmetadata$hydratelevel. The values should be output into a new df escalc.df with four columns: $ai, $bi, $ci, and $di.

For example, testvariable can be $surfacelith and hydratelevel can be $AreaKnownHydrate.

> dput(head(allmetadata))
structure(list(feature.id = c("AB094456", "AB094457", "AB094458", 
"AB094459", "AB094460", "AB094461"), seq = c("cct", "cct", "cct", 
"cct", "cct", "cct"), author = c("Inagaki", "Inagaki", "Inagaki", 
"Inagaki", "Inagaki", "Inagaki"), yearPub = c(2003L, 2003L, 2003L, 
2003L, 2003L, 2003L), yearCollected = c(2001L, 2001L, 2001L, 
2001L, 2001L, 2001L), ocean = c("Pacific", "Pacific", "Pacific", 
"Pacific", "Pacific", "Pacific"), region = c("SeaOkhotsk", "SeaOkhotsk", 
"SeaOkhotsk", "SeaOkhotsk", "SeaOkhotsk", "SeaOkhotsk"), location = c("ShiretokoPeninsula", 
"ShiretokoPeninsula", "ShiretokoPeninsula", "ShiretokoPeninsula", 
"ShiretokoPeninsula", "ShiretokoPeninsula"), waterType = c("marine", 
"marine", "marine", "marine", "marine", "marine"), methaneForm = c("HYD", 
"HYD", "HYD", "HYD", "HYD", "HYD"), waterDepth = c(1225, 1225, 
1225, 1225, 1225, 1225), sedDepth = c("UNK", "UNK", "UNK", "UNK", 
"UNK", "UNK"), latitude = c(44.5275, 44.5275, 44.5275, 44.5275, 
44.5275, 44.5275), longitude = c(145.0041, 145.0041, 145.0041, 
145.0041, 145.0041, 145.0041), sedProfile = c("UNK", "UNK", "UNK", 
"UNK", "UNK", "UNK"), sampleType = c("sediment", "sediment", 
"sediment", "sediment", "sediment", "sediment"), porosity = c(81.5954, 
81.5954, 81.5954, 81.5954, 81.5954, 81.5954), surfaceTOC = c(2.1923, 
2.1923, 2.1923, 2.1923, 2.1923, 2.1923), surfacelith = c("clay", 
"clay", "clay", "clay", "clay", "clay"), locationUSGSdatabase = c(NA_character_, 
NA_character_, NA_character_, NA_character_, NA_character_, NA_character_
), LatUSGSdatabase = c(NA_real_, NA_real_, NA_real_, NA_real_, 
NA_real_, NA_real_), LongUSGSdatabase = c(NA_real_, NA_real_, 
NA_real_, NA_real_, NA_real_, NA_real_), AreaKnownHydrate = c("Absent", 
"Absent", "Absent", "Absent", "Absent", "Absent"), ExactHydratePresent = c("UNK", 
"UNK", "UNK", "UNK", "UNK", "UNK"), hydInfoSource = c("UNK", 
"UNK", "UNK", "UNK", "UNK", "UNK"), modelpresence = c(0, 0, 0, 
0, 0, 0)), row.names = c(NA, 6L), class = "data.frame")

Solution

  • A variation using unquoted column names and the "curly curly" {{ }} syntax, explained more at https://dplyr.tidyverse.org/articles/programming.html.

    myFun <- function(testvariable, hydratelevel) {
      allmetadata %>%
        summarise(group = first( {{testvariable}} ),
                  ai = sum( {{hydratelevel}} == "Present" & modelpresence > 0),
                  bi = sum( {{hydratelevel}} == "Present" & modelpresence == 0),
                  ci = sum( {{hydratelevel}} == "Absent" & modelpresence > 0),
                  di = sum( {{hydratelevel}} == "Absent" & modelpresence == 0),
                  .by = {{testvariable}})
    }
    myFun(surfacelith, AreaKnownHydrate)
    

    Result

      surfacelith group ai bi ci di
    1        clay  clay  0  0  0  6