Add values across dataframe columns

I have a dataframe where missingness in indicated by "Z" (there may also be some "z" and NA entries present in the data), and values are entered as characters ("0", "1", etc). I need to create scores ("updrs1", "updrs2", updrs3") that add up the non-missing values across columns selected by colname prefix ("NP1", "NP2", "NP3").

dummy data:

dummy_df <- data.frame(
  subject_id = seq(1,6,1), 
  OTHERV1 = c(1,1,0,0,1,1),
  NP1VAR1 = c("Z","0","Z","Z","Z","Z"),
  NP1VAR2 = c("Z","0","Z","Z","Z","Z"),
  NP1VAR3 = c("Z","3","Z","Z","Z","Z"),
  NP2VAR1 = c("Z","2","Z","Z","Z","Z"), 
  NP2VAR2 = c("Z","0","Z","Z","Z","Z"),
  NP2VAR3 = c("Z","0","Z","Z","Z","Z"),
  NP3VAR1 = c("Z","4","Z","Z","Z","Z"),
  NP3VAR2 = c("Z","0","Z","Z","z","Z"),
  NP3VAR3 = c("Z","0","Z","Z","Z",NA),
  OTHERV2 = c(NA,NA,NA,NA,NA,NA)
)

desired output:

	subject_id	updrs1	updrs2	updrs3
1	1	Z	Z	Z
2	2	3	2	4
3	3	Z	Z	Z
4	4	Z	Z	Z
5	5	Z	Z	Z
6	6	Z	Z	Z

NOTE: all values in the output are characters

NOTE: treating NA/Z/z as 0 (i.e., transforming all values with as.numeric()) is problematic.

I've tried variations on this answer, with no luck.

I have used a combination of tidyr::pivot_longer(), dplyr::group_by(), and summarize():

desired_output <- select(dummy_df, c(subject_id, starts_with("NP"))) %>%
  mutate(across(all_of(everything()), ~ifelse(. %in% c("Z", "z"), NA, .))) %>%
  pivot_longer(cols = starts_with("NP"), names_to = c(".value", "np_var"), names_sep = "VAR") %>%
  group_by(subject_id) %>%
  summarize(updrs1 = sum(as.numeric(NP1), na.rm = FALSE),
            updrs2 = sum(as.numeric(NP2), na.rm = FALSE), 
            updrs3 = sum(as.numeric(NP3), na.rm = FALSE), .groups = "drop") %>%
  mutate(across(all_of(everything()), as.character)) %>%
  replace(is.na(.), "Z")

This works. L Tyrone's answer also works, so I've accepted it. Thanks all.

Solution

If you don't care whether:

the missingness values are upper or lower case
Z/z are the only missingness values
existing NA values can be represented as Z

then this works. There is likely to be a more direct way to do this, but I find a step-wise approach easier to follow. Note that it will return a warning, which you can ignore:

library(dplyr)
library(tidyr)

dummy_df |>
  pivot_longer(
    cols = starts_with("NP"),
    names_to = c("grp", "var"),
    names_pattern = "^(NP\\d+)(VAR\\d+)$",
    values_to = "value") |>
  mutate(value = if_else(grepl("^[0-9]+$", value), as.integer(value), NA)) |>
  summarise(value = sum(value), .by = c(subject_id, grp)) |>
  pivot_wider(id_cols = subject_id,
              names_from = grp,
              values_from = value) |>
  rename(updrs1 = NP1, updrs2 = NP2, updrs3 = NP3) |>
  mutate(across(where(is.numeric), as.character),
         across(everything(), ~replace_na(.x, "Z")))
    
# # A tibble: 6 × 4
#   subject_id updrs1 updrs2 updrs3
#   <chr>      <chr>  <chr>  <chr> 
# 1 1          Z      Z      Z     
# 2 2          3      2      4     
# 3 3          Z      Z      Z     
# 4 4          Z      Z      Z     
# 5 5          Z      Z      Z     
# 6 6          Z      Z      Z

Because you want to combine (sum) multiple instances of the NP groups, you need some way of creating common values to group by. However, the NP groups have different suffixes. names_pattern = is useful in your case as it allows you break up the column names into two (or more) columns.

The regular expression (regex) ^(NP\\d+)(VAR\\d+)$" works by using brackets () to define the boundaries of the patterns to 'capture' from the column names:

^ the location to start looking from
(NP\\d+) searches for literal "NP" followed by one or more digits d+. The \\ is needed so R knows d is not literal e.g. treat d as a special character. The () will capture characters until the first non-digit is found
(VAR\\d+) searches for literal "VAR" followed by one or more digits until the end $ of the string. Given the use of (), technically $ is not needed in this example