rpivotreshape

Sum up replicate columns in dataframe, keep rows unique


I thought this would be a relatively easy task to do but I couldn't find examples here that weren't focused on summing up rows based on column conditions. What I'm trying to achieve is to sum up column replicates, but keeping rows unique.

Here's what I mean:

                                 MKC100.1 MKC100.2 MKC100.3 MKC103.1 MKC103.2 MKC103.3 MKC104.2 MKC104.3
299fc0ac11fb4afd0da849a2c45583b3        0        0        0        0        0        0        0        1
9bc2bacdfadf4c1352ffbc991803287c     1183     1666     1318        0        0        0       10       20
38b782d9f01c69c3570fe0edd5864dc0      493      626      543       10        0        0        5        5
6d078397349f7d39c34d237a6ef4cb75    43735    51511    46876        0        0        0        1        0
c22e752b441ee4190f27a3690c5d1206        0        0        0     2795     1128     1956        1        1
f6513affb198fb9845741b61ece8db4b       59       58       82        0        0        0        0        0



structure(list(MKC100.1 = c(0L, 1183L, 493L, 43735L, 0L, 59L), 
    MKC100.2 = c(0L, 1666L, 626L, 51511L, 0L, 58L), MKC100.3 = c(0L, 
    1318L, 543L, 46876L, 0L, 82L), MKC103.1 = c(0L, 0L, 10L, 
    0L, 2795L, 0L), MKC103.2 = c(0L, 0L, 0L, 0L, 1128L, 0L), 
    MKC103.3 = c(0L, 0L, 0L, 0L, 1956L, 0L), MKC104.2 = c(0L, 
    10L, 5L, 1L, 1L, 0L), MKC104.3 = c(1L, 20L, 5L, 0L, 1L, 0L
    )), class = "data.frame", row.names = c("299fc0ac11fb4afd0da849a2c45583b3", 
"9bc2bacdfadf4c1352ffbc991803287c", "38b782d9f01c69c3570fe0edd5864dc0", 
"6d078397349f7d39c34d237a6ef4cb75", "c22e752b441ee4190f27a3690c5d1206", 
"f6513affb198fb9845741b61ece8db4b"))

The replicates typically come in three but will have an ".1" or ".2" or ".3" appended to the end of it. I need to essentially merge these all up so that I end up with a dataframe like below:

                                MKC100 MKC103   MKC104
299fc0ac11fb4afd0da849a2c45583b3     0        0       1
9bc2bacdfadf4c1352ffbc991803287c     4167     0       30
38b782d9f01c69c3570fe0edd5864dc0     1662     10      10
6d078397349f7d39c34d237a6ef4cb75     142122   0       1 
c22e752b441ee4190f27a3690c5d1206     0        5879    2   
f6513affb198fb9845741b61ece8db4b     199      0       0

EDIT Sometimes replicates are removed and I won't have all three replicates per sample in the dataframe. Any replicates should still be merged together. I've updated the sample data to reflect this scenario.


Solution

  • base R

    out <- lapply(split(names(data), sub("\\..*", "", names(data))),
                  function(nms) rowSums(data[,nms])) |>
      data.frame()
    out
    #                                  MKC100 MKC103
    # 299fc0ac11fb4afd0da849a2c45583b3      0      0
    # 9bc2bacdfadf4c1352ffbc991803287c   4167      0
    # 38b782d9f01c69c3570fe0edd5864dc0   1662     10
    # 6d078397349f7d39c34d237a6ef4cb75 142122      0
    # c22e752b441ee4190f27a3690c5d1206      0   5879
    # f6513affb198fb9845741b61ece8db4b    199      0
    

    dplyr+tidyr

    A slightly different approach, we'll pivot/summarize/unpivot:

    library(dplyr)
    library(tidyr)
    data |>
      tibble::rownames_to_column() |>
      pivot_longer(cols = -rowname) |>
      mutate(name = sub("\\..*", "", name)) |>
      summarize(value = sum(value), .by = c(rowname, name)) |>
      pivot_wider(id_cols = rowname)
    # # A tibble: 6 × 3
    #   rowname                          MKC100 MKC103
    #   <chr>                             <int>  <int>
    # 1 299fc0ac11fb4afd0da849a2c45583b3      0      0
    # 2 9bc2bacdfadf4c1352ffbc991803287c   4167      0
    # 3 38b782d9f01c69c3570fe0edd5864dc0   1662     10
    # 4 6d078397349f7d39c34d237a6ef4cb75 142122      0
    # 5 c22e752b441ee4190f27a3690c5d1206      0   5879
    # 6 f6513affb198fb9845741b61ece8db4b    199      0
    

    You can add |> tibble::column_to_rownames() if you need them back instead of a column.

    purrr

    library(purrr)
    out <- split(names(data), sub("\\..*", "", names(data))) |>
      map_dfc(~ rowSums(data[,.x])) |>
      as.data.frame()
    rownames(out) <- rownames(data)
    out
    #                                  MKC100 MKC103
    # 299fc0ac11fb4afd0da849a2c45583b3      0      0
    # 9bc2bacdfadf4c1352ffbc991803287c   4167      0
    # 38b782d9f01c69c3570fe0edd5864dc0   1662     10
    # 6d078397349f7d39c34d237a6ef4cb75 142122      0
    # c22e752b441ee4190f27a3690c5d1206      0   5879
    # f6513affb198fb9845741b61ece8db4b    199      0
    

    The use of as.data.frame is to change it from being a tibble, since tidyverse and tibbles specifically are biased against row names.

    You may want to consider either discarding the row names entirely, or bringing them into the frame as a column (as I did with rownames_to-column above, easily done in base R with data$rowname <- rownames(data)).