rconditional-statementsbioinformaticsgwas

Examining position intersect and conditionally drop row based on P-values


I'm working my way through post-analysis of GWAS, where I have defined a list of independent SNPs associated with my trait. I'm trying to group my indep. SNPs in loci of +- 500KB. There are some independent SNP's which locus will overlap each other and I'm trying to write a code that will drop the SNP with the higher P-value, if the two loci overlap.

   SNP       bp            p
1: rs1  8537289  1.33785e-08
2: rs2  12263919 5.03939e-11
3: rs3  19771438 1.83669e-20
4: rs4  19845279 3.60144e-18

Afterward adding two columns with position +- 500kb, where it becomes evident, that row 3 and 4 have overlapping genomic positions.

   SNP       bp           p    start     stop
1: rs1  8537289 1.33785e-08  8037289  9037289
2: rs2 12263919 5.03939e-11 11763919 12763919
3: rs3 19771438 1.83669e-20 19271438 20271438
4: rs4 19845279 3.60144e-18 19345279 20345279

What I am trying to do from here, is work my way in to a conditional code, that will keep the SNP with the lowest P-value (in this example rs3), and then drop the fourth row containing rs4.

So that i will get the following result:

  SNP       bp           p    start     stop
1: rs1  8537289 1.33785e-08  8037289  9037289
2: rs2 12263919 5.03939e-11 11763919 12763919
3: rs3 19771438 1.83669e-20 19271438 20271438

I'm well aware that plink clumping is the usual procedure for defining loci, but I'm trying this as an alternative and robust way to establish associations to the trait. I hope someone is willing to share a piece of their brilliant skill and I promise the best of good karma in return. Cheers

EDIT

Looking through the results, I noticed an unintentional fault produced with the great solution provided by @Axeman. My apologies if I didn't express myself properly. I ran the following code to print the "g-column" without dropping SNPs

d1 <- d %>% group_by(Chr,p, locus = cumsum(minus > lag(plus, default=1)))
%>% slice_min(p)

And noticed when the chromosome changes from n to n2, the non-overlapping genomic region is merged into one locus. So that the last locus at chromosome n and the first locus at n2 is "merged" (or dropped). Would you know how to work your way around it?


Solution

  • You can define groups for overlapping sections, then filter for the lowest p-value. E.g.:

    library(dplyr)
    d %>% 
      group_by(g = cumsum(start > lag(stop, default = 1))) %>% 
      slice_min(p)
    
    # A tibble: 3 × 6
    # Groups:   g [3]
      SNP         bp        p    start     stop     g
      <chr>    <int>    <dbl>    <int>    <int> <int>
    1 rs1    8537289 1.34e- 8  8037289  9037289     1
    2 rs2   12263919 5.04e-11 11763919 12763919     2
    3 rs3   19771438 1.84e-20 19271438 20271438     3
    

    This is greedy in the sense that it will group together SNPs that are not overlapping if there is a SNP between them that overlaps with both. Not sure what you want to do in the case of sequences of partially overlapping SNPs.