rdplyrdata-wrangling

Add values across dataframe columns


I have a dataframe where missingness in indicated by "Z" (there may also be some "z" and NA entries present in the data), and values are entered as characters ("0", "1", etc). I need to create scores ("updrs1", "updrs2", updrs3") that add up the non-missing values across columns selected by colname prefix ("NP1", "NP2", "NP3").

dummy data:

dummy_df <- data.frame(
  subject_id = seq(1,6,1), 
  OTHERV1 = c(1,1,0,0,1,1),
  NP1VAR1 = c("Z","0","Z","Z","Z","Z"),
  NP1VAR2 = c("Z","0","Z","Z","Z","Z"),
  NP1VAR3 = c("Z","3","Z","Z","Z","Z"),
  NP2VAR1 = c("Z","2","Z","Z","Z","Z"), 
  NP2VAR2 = c("Z","0","Z","Z","Z","Z"),
  NP2VAR3 = c("Z","0","Z","Z","Z","Z"),
  NP3VAR1 = c("Z","4","Z","Z","Z","Z"),
  NP3VAR2 = c("Z","0","Z","Z","z","Z"),
  NP3VAR3 = c("Z","0","Z","Z","Z",NA),
  OTHERV2 = c(NA,NA,NA,NA,NA,NA)
)

desired output:

subject_id updrs1 updrs2 updrs3
1 1 Z Z Z
2 2 3 2 4
3 3 Z Z Z
4 4 Z Z Z
5 5 Z Z Z
6 6 Z Z Z

NOTE: all values in the output are characters

NOTE: treating NA/Z/z as 0 (i.e., transforming all values with as.numeric()) is problematic.

I've tried variations on this answer, with no luck.

I have used a combination of tidyr::pivot_longer(), dplyr::group_by(), and summarize():

desired_output <- select(dummy_df, c(subject_id, starts_with("NP"))) %>%
  mutate(across(all_of(everything()), ~ifelse(. %in% c("Z", "z"), NA, .))) %>%
  pivot_longer(cols = starts_with("NP"), names_to = c(".value", "np_var"), names_sep = "VAR") %>%
  group_by(subject_id) %>%
  summarize(updrs1 = sum(as.numeric(NP1), na.rm = FALSE),
            updrs2 = sum(as.numeric(NP2), na.rm = FALSE), 
            updrs3 = sum(as.numeric(NP3), na.rm = FALSE), .groups = "drop") %>%
  mutate(across(all_of(everything()), as.character)) %>%
  replace(is.na(.), "Z")

This works. L Tyrone's answer also works, so I've accepted it. Thanks all.


Solution

  • If you don't care whether:

    then this works. There is likely to be a more direct way to do this, but I find a step-wise approach easier to follow. Note that it will return a warning, which you can ignore:

    library(dplyr)
    library(tidyr)
    
    dummy_df |>
      pivot_longer(
        cols = starts_with("NP"),
        names_to = c("grp", "var"),
        names_pattern = "^(NP\\d+)(VAR\\d+)$",
        values_to = "value") |>
      mutate(value = if_else(grepl("^[0-9]+$", value), as.integer(value), NA)) |>
      summarise(value = sum(value), .by = c(subject_id, grp)) |>
      pivot_wider(id_cols = subject_id,
                  names_from = grp,
                  values_from = value) |>
      rename(updrs1 = NP1, updrs2 = NP2, updrs3 = NP3) |>
      mutate(across(where(is.numeric), as.character),
             across(everything(), ~replace_na(.x, "Z")))
        
    # # A tibble: 6 × 4
    #   subject_id updrs1 updrs2 updrs3
    #   <chr>      <chr>  <chr>  <chr> 
    # 1 1          Z      Z      Z     
    # 2 2          3      2      4     
    # 3 3          Z      Z      Z     
    # 4 4          Z      Z      Z     
    # 5 5          Z      Z      Z     
    # 6 6          Z      Z      Z
    

    Because you want to combine (sum) multiple instances of the NP groups, you need some way of creating common values to group by. However, the NP groups have different suffixes. names_pattern = is useful in your case as it allows you break up the column names into two (or more) columns.

    The regular expression (regex) ^(NP\\d+)(VAR\\d+)$" works by using brackets () to define the boundaries of the patterns to 'capture' from the column names: