I have a genetic dataset as shown below, which contains replicates (same genomic positions) in the column pos
. I want to group the data by pos
and fill missing cells within each group using information from corresponding non-missing cells. If the alleles differ between two duplicate rows, the missing cell should be matched also with its corresponding alleles
information after filling. The rule for this is based on the DNA complementary sequence:
Dataset containing missing cells:
df <- data.frame(
chr = c(1, 1, 2, 2),
alleles = c("G/A", "C/T", "T/C", "T/C"),
pos = c(13276, 13276, 56329, 56329),
B_005 = c("GG", "CC", "TT", "NN"),
B_087 = c("AA", "TT", "TT", "TT"),
B_013 = c("GA", "NN", "TT", "TT"),
B_140 = c("NN", "CC", "NN", "CC")
Desired output after filling:
df_filled <- data.frame(
chr = c(1, 1, 2, 2),
alleles = c("G/A", "C/T", "T/C", "T/C"),
pos = c(13276, 13276, 56329, 56329),
B_005 = c("GG", "CC", "TT", "TT"),
B_087 = c("AA", "TT", "TT", "TT"),
B_013 = c("GA", "CT", "TT", "TT"),
B_140 = c("GG", "CC", "CC", "CC")
This approach gives the desired output, but is a little kludgey. Curious to see more elegant approaches.
# replace "NN" with <NA>
df2 <- df |>
mutate(across(-(1:3), ~if_else(.x == "NN", NA_character_, .x))) |>
mutate(row = row_number(), .before = 1)
# table with corresponding values for everything
df_corresp <- df2 |>
mutate(across(-(1:4), ~str_replace_all( # great suggestion from @I_O
.x, '[ACGT]', \(x) c(T='A', A='T', G='C', C='G')[x]))) |>
group_by(chr, pos) |>
fill(everything(), .direction = "downup") |>
df2 |>
group_by(chr, pos, alleles) |> # 1) fill matching alleles with same values
fill(everything(), .direction = "downup") |>
group_by(chr, pos) |> # 2) fill non-matching alleles with corresponding values
rows_patch(df_corresp) |>
# A tibble: 4 × 8
row chr alleles pos B_005 B_087 B_013 B_140
<int> <dbl> <chr> <dbl> <chr> <chr> <chr> <chr>
1 1 1 G/A 13276 GG AA GA GG
2 2 1 C/T 13276 CC TT CT CC
3 3 2 T/C 56329 TT TT TT CC
4 4 2 T/C 56329 TT TT TT CC