I have two files.
File A with intervals (regions on the genome)
chr startpos endpos nMajor nMinor
1 1 762273 120612006 1 0
2 1 144854594 187610698 2 1
3 1 193051685 249120684 1 1
4 2 45895 242836535 1 1
5 3 361508 197566254 1 1
6 4 86022 190862182 1 1
File B with positions (of mutations)
mutation_id chr start end ref_counts var_counts
1 1_3649563_G/T 1 3649563 3649564 551 159
2 1_6196895_G/C 1 6196895 6196895 85 30
3 1_10678395_C/T 1 10678395 10678395 274 60
4 1_11090913_G/C 1 11090913 11090914 70 41
5 1_11772423_G/A 1 11772423 11772423 146 55
6 1_12316528_C/G 1 12316528 12316528 110 88
Now I want to combine the two files to File C that adds the info of nMajor and nMinor from FileA to FileB if the position falls into the according interval.
So I will need to first check if the chromosome is the same and then if the start and end position in FileB are within the interval of FileA.
My output should be: File C
mutation_id chr start end ref_counts var_counts nMajor nMinor
1 1_3649563_G/T 1 3649563 3649563 551 159 1 0
2 1_6196895_G/C 1 6196895 6196895 85 30 1 0
3 1_10678395_C/T 1 10678395 10678395 274 60 1 0
4 1_11090913_G/C 1 11090913 11090913 70 41 1 0
5 1_11772423_G/A 1 11772423 11772423 146 55 1 0
6 1_12316528_C/G 1 12316528 12316528 110 88 1 0
For lines of FileB that cannot be found in an interval of FileB, I want to print "X" as a placeholder.
You could use a fuzzyjoin
for this task:
library(dplyr)
library(fuzzyjoin)
file_b %>%
fuzzy_left_join(file_a,
by = c("chr" = "chr",
"start" = "startpos",
"end" = "endpos",
"start" = "endpos",
"end" = "startpos"),
match_fun = list(`==`, `>`, `<`, `<`, `>`)) %>%
select(-startpos, -endpos, -chr.y) %>%
rename(chr = chr.x)
I didn't create an X
for a non-match, because this breaks the class of columns nMajor
, nMinor
and converts them into a string/character. I don't think it's a good idea here and NA
values can easily be handled.
This returns
# A tibble: 7 x 8
mutation_id chr start end ref_counts var_counts nMajor nMinor
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1_3649563_G/T 1 3649563 3649564 551 159 1 0
2 1_6196895_G/C 1 6196895 6196895 85 30 1 0
3 1_10678395_C/T 1 10678395 10678395 274 60 1 0
4 1_11090913_G/C 1 11090913 11090914 70 41 1 0
5 1_11772423_G/A 1 11772423 11772423 146 55 1 0
6 1_12316528_C/G 1 12316528 12316528 110 88 1 0
7 ABC 2 123 456 2 3 NA NA
file_a <- structure(list(chr = c(1, 1, 1, 2, 3, 4), startpos = c(762273,
144854594, 193051685, 45895, 361508, 86022), endpos = c(120612006,
187610698, 249120684, 242836535, 197566254, 190862182), nMajor = c(1,
2, 1, 1, 1, 1), nMinor = c(0, 1, 1, 1, 1, 1)), row.names = c(NA,
-6L), class = c("tbl_df", "tbl", "data.frame"))
In file_b
I added a fake dataset to provoke a non-match:
file_b <- structure(list(mutation_id = c("1_3649563_G/T", "1_6196895_G/C",
"1_10678395_C/T", "1_11090913_G/C", "1_11772423_G/A", "1_12316528_C/G",
"ABC"), chr = c(1, 1, 1, 1, 1, 1, 2), start = c(3649563, 6196895,
10678395, 11090913, 11772423, 12316528, 123), end = c(3649564,
6196895, 10678395, 11090914, 11772423, 12316528, 456), ref_counts = c(551,
85, 274, 70, 146, 110, 2), var_counts = c(159, 30, 60, 41, 55,
88, 3)), row.names = c(NA, -7L), class = c("tbl_df", "tbl", "data.frame"
))