rreshapecalculated-columns

calculate annual cases after reshaping dataframe in R


I am working in R and I want to calculate the last column Annual_Cases based on the available years for each surgeon. The variable Years was reported per each 3-year time interval. Some surgeons were working in more than 1 hospital.

Here is my code that needs edits and my data

library(dplyr);library(tidyr)
df %>% drop_na(Cases) %>% group_by(Surgeon)%>% summarise(Annaul_Cases= sum(Cases))/n,  na.rm=T)-> df # I need to calculate n column (number of years the surgeon were in service)


structure(list(Serial.ID = c(215, 522, 903, 1210, 1591, 1898, 
2279, 2586, 2967, 3274, 3655, 3962, 4343, 4650, 5031, 5338, 330, 
1018, 1706, 2394, 3082, 3770, 4458, 5146), Surgeon = c("A", "A", 
"A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", 
"A", "B", "B", "B", "B", "B", "B", "B", "B"), Center = c("Maimonides Medical Center", 
"St. Vincents Hospital and Medical Center", "Maimonides Medical Center", 
"St. Vincents Hospital and Medical Center", "Maimonides Medical Center", 
"St. Vincents Hospital and Medical Center", "Maimonides Medical Center", 
"St. Vincents Hospital and Medical Center", "Maimonides Medical Center", 
"St. Vincents Hospital and Medical Center", "Maimonides Medical Center", 
"St. Vincents Hospital and Medical Center", "Maimonides Medical Center", 
"St. Vincents Hospital and Medical Center", "Maimonides Medical Center", 
"St. Vincents Hospital and Medical Center", "Mount Sinai Hospital", 
"Mount Sinai Hospital", "Mount Sinai Hospital", "Mount Sinai Hospital", 
"Mount Sinai Hospital", "Mount Sinai Hospital", "Mount Sinai Hospital", 
"Mount Sinai Hospital"), Years = c("1996-1998", "1996-1998", 
"1999-2001", "1999-2001", "2002-2004", "2002-2004", "2005-2007", 
"2005-2007", "2008-2010", "2008-2010", "2011-2013", "2011-2013", 
"2014-2016", "2014-2016", "2017-2019", "2017-2019", "1996-1998", 
"1999-2001", "2002-2004", "2005-2007", "2008-2010", "2011-2013", 
"2014-2016", "2017-2019"), Cases = c(377, 19, 223, NA, 27, NA, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 110, 18, 13, 
17, 4, 1), Deaths = c("10", "1", "6", "NA", "0", "NA", "NA", 
"NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA", 
"1", "0", "0", "0", "0", "0"), Annual_Cases = c("NA", "NA", "NA", 
"NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA", 
"NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA")), row.names = c(NA, 
-24L), class = c("tbl_df", "tbl", "data.frame"))

Here is my desired output and my used manual formula for that.
enter image description here

Any advice will be greatly appreciated


Solution

  • First summarize to get sums for each year span, then summarize again to aggregate over all year spans.

    library(dplyr)
    library(tidyr)
    
    df %>% 
      drop_na(Cases) %>% 
      summarize(Cases = sum(Cases), .by = c(Surgeon, Years)) %>%
      summarize(Annual_Cases = sum(Cases) / (n() * 3), .by = Surgeon)
    

    Result:

    # A tibble: 2 × 2
      Surgeon Annual_Cases
      <chr>          <dbl>
    1 A              71.8 
    2 B               9.06
    

    This solution assumes each year span is always 3 years. If not, you’ll need to use an approach like @JonSpring’s using separate().