rmissing-data

How to handle complementary pairs of rows and fill missing values based on reference column?


I have a genetic dataset as shown below, which contains replicates (same genomic positions) in the column pos. I want to group the data by pos and fill missing cells within each group using information from corresponding non-missing cells. If the alleles differ between two duplicate rows, the missing cell should be matched also with its corresponding alleles information after filling. The rule for this is based on the DNA complementary sequence:

TT ↔ AA
GG ↔ CC

Dataset containing missing cells:

df <- data.frame(
  chr = c(1, 1, 2, 2),
  alleles = c("G/A", "C/T", "T/C", "T/C"),
  pos = c(13276, 13276, 56329, 56329),
  B_005 = c("GG", "CC", "TT", "NN"),
  B_087 = c("AA", "TT", "TT", "TT"),
  B_013 = c("GA", "NN", "TT", "TT"),
  B_140 = c("NN", "CC", "NN", "CC")
)

Desired output after filling:

df_filled <- data.frame(
  chr = c(1, 1, 2, 2),
  alleles = c("G/A", "C/T", "T/C", "T/C"),
  pos = c(13276, 13276, 56329, 56329),
  B_005 = c("GG", "CC", "TT", "TT"),
  B_087 = c("AA", "TT", "TT", "TT"),
  B_013 = c("GA", "CT", "TT", "TT"),
  B_140 = c("GG", "CC", "CC", "CC")
)

Solution

  • This approach gives the desired output, but is a little kludgey. Curious to see more elegant approaches.

    library(tidyverse)
    # replace "NN" with <NA>
    df2 <- df |>
      mutate(across(-(1:3), ~if_else(.x == "NN", NA_character_, .x))) |>
      mutate(row = row_number(), .before = 1)
    
    # table with corresponding values for everything 
    df_corresp <- df2 |>
      mutate(across(-(1:4), ~str_replace_all(    # great suggestion from @I_O
        .x, '[ACGT]', \(x) c(T='A', A='T', G='C', C='G')[x]))) |>
      group_by(chr, pos) |>
      fill(everything(), .direction = "downup") |>
      ungroup()
    
    df2 |>
      group_by(chr, pos, alleles) |>  # 1) fill matching alleles with same values
      fill(everything(), .direction = "downup") |>
      group_by(chr, pos) |>           # 2) fill non-matching alleles with corresponding values
      rows_patch(df_corresp) |>
      ungroup() 
    

    Result

    # A tibble: 4 × 8
        row   chr alleles   pos B_005 B_087 B_013 B_140
      <int> <dbl> <chr>   <dbl> <chr> <chr> <chr> <chr>
    1     1     1 G/A     13276 GG    AA    GA    GG   
    2     2     1 C/T     13276 CC    TT    CT    CC   
    3     3     2 T/C     56329 TT    TT    TT    CC   
    4     4     2 T/C     56329 TT    TT    TT    CC