I use chatgpt for some coding (shame me) and I have this function I don't understand:
escalc.df <- allmetadata %>%
group_by(!!sym(testvariable)) %>%
summarise(group = first(!!sym(testvariable)),
ai = sum(!!sym(hydratelevel) == "Present" & modelpresence > 0),
bi = sum(!!sym(hydratelevel) == "Present" & modelpresence == 0),
ci = sum(!!sym(hydratelevel) == "Absent" & modelpresence > 0),
di = sum(!!sym(hydratelevel) == "Absent" & modelpresence == 0))
And I don't know where to begin since I don't understand the function fully. Could someone explain this code to me or even share a shortened/more simplified version?
I am expecting the df allmetadata
to be grouped by testvariable
(set earlier to match a specific allmetadata
column) then I want to sum the instances when allmetadata$modelpresence
is > or = to 0 for each variable in allmetadata$hydratelevel
. The values should be output into a new df escalc.df
with four columns: $ai
, $bi
, $ci
, and $di
.
For example, testvariable
can be $surfacelith
and hydratelevel
can be $AreaKnownHydrate
.
> dput(head(allmetadata))
structure(list(feature.id = c("AB094456", "AB094457", "AB094458",
"AB094459", "AB094460", "AB094461"), seq = c("cct", "cct", "cct",
"cct", "cct", "cct"), author = c("Inagaki", "Inagaki", "Inagaki",
"Inagaki", "Inagaki", "Inagaki"), yearPub = c(2003L, 2003L, 2003L,
2003L, 2003L, 2003L), yearCollected = c(2001L, 2001L, 2001L,
2001L, 2001L, 2001L), ocean = c("Pacific", "Pacific", "Pacific",
"Pacific", "Pacific", "Pacific"), region = c("SeaOkhotsk", "SeaOkhotsk",
"SeaOkhotsk", "SeaOkhotsk", "SeaOkhotsk", "SeaOkhotsk"), location = c("ShiretokoPeninsula",
"ShiretokoPeninsula", "ShiretokoPeninsula", "ShiretokoPeninsula",
"ShiretokoPeninsula", "ShiretokoPeninsula"), waterType = c("marine",
"marine", "marine", "marine", "marine", "marine"), methaneForm = c("HYD",
"HYD", "HYD", "HYD", "HYD", "HYD"), waterDepth = c(1225, 1225,
1225, 1225, 1225, 1225), sedDepth = c("UNK", "UNK", "UNK", "UNK",
"UNK", "UNK"), latitude = c(44.5275, 44.5275, 44.5275, 44.5275,
44.5275, 44.5275), longitude = c(145.0041, 145.0041, 145.0041,
145.0041, 145.0041, 145.0041), sedProfile = c("UNK", "UNK", "UNK",
"UNK", "UNK", "UNK"), sampleType = c("sediment", "sediment",
"sediment", "sediment", "sediment", "sediment"), porosity = c(81.5954,
81.5954, 81.5954, 81.5954, 81.5954, 81.5954), surfaceTOC = c(2.1923,
2.1923, 2.1923, 2.1923, 2.1923, 2.1923), surfacelith = c("clay",
"clay", "clay", "clay", "clay", "clay"), locationUSGSdatabase = c(NA_character_,
NA_character_, NA_character_, NA_character_, NA_character_, NA_character_
), LatUSGSdatabase = c(NA_real_, NA_real_, NA_real_, NA_real_,
NA_real_, NA_real_), LongUSGSdatabase = c(NA_real_, NA_real_,
NA_real_, NA_real_, NA_real_, NA_real_), AreaKnownHydrate = c("Absent",
"Absent", "Absent", "Absent", "Absent", "Absent"), ExactHydratePresent = c("UNK",
"UNK", "UNK", "UNK", "UNK", "UNK"), hydInfoSource = c("UNK",
"UNK", "UNK", "UNK", "UNK", "UNK"), modelpresence = c(0, 0, 0,
0, 0, 0)), row.names = c(NA, 6L), class = "data.frame")
A variation using unquoted column names and the "curly curly" {{ }}
syntax, explained more at https://dplyr.tidyverse.org/articles/programming.html.
myFun <- function(testvariable, hydratelevel) {
allmetadata %>%
summarise(group = first( {{testvariable}} ),
ai = sum( {{hydratelevel}} == "Present" & modelpresence > 0),
bi = sum( {{hydratelevel}} == "Present" & modelpresence == 0),
ci = sum( {{hydratelevel}} == "Absent" & modelpresence > 0),
di = sum( {{hydratelevel}} == "Absent" & modelpresence == 0),
.by = {{testvariable}})
}
myFun(surfacelith, AreaKnownHydrate)
Result
surfacelith group ai bi ci di
1 clay clay 0 0 0 6