bashawkgenetics

extract rows across multiple files using conditions and variables


I have a text file of 5,000 rows. Each row corresponds to a unique FILE all labelled gwas*:

CHR     BP          FILE
chr1    12345678    gwas1
chr2    87654321    gwas2
...

I have 5,000 gwas* files all with unique file names (as found in column 4 above - FILE), for example, gwas1 looks like this:

CHR     BP                  
chr1    12345678    
chr1    12345679        
chr1    12356777    
...

I want to extract all rows from each gwas* file where the BP value is within 500,000 (up and down) of the BP value in the text file. The CHR value must also match. For example:

  1. gwas1 in the text file has the CHR value chr1 and the BP value 12,345,678.
  2. I want to extract all rows from gwas1 which have a chr1 value in the CHR column and which have a BP value between 11,845,678 (this value is 500,000 down from the BP value in the text file) and 12,845,678 (this value is 500,000 down from the BP value in the text file).

I can do this manually for a single gwas file using the below code (this does not use the text file of 5,000 rows however):

export CHR="chr11"
export BP=107459522
export WINDOW=500000

awk -v CHR=$CHR -v BP_pos=$(($BP + $WINDOW)) -v BP_neg=$(($BP - $WINDOW)) 'BEGIN{FS=OFS="\t"}FNR==1 || ($1 == CHR && $2 < BP_pos && $2 > BP_neg )' gwas1 > gwas1_extract

I want to be able to do this for all 5,000 gwas files listed in my text file. The output for every gwas file should only include the rows in which (i) the column CHR contained the value for that file (e.g., for gwas1 this is chr1) and the BP column contained values that were within 500,000 of the value given in the text file for the BP column (for gwas1 this is 500,000 either side of 12,345,678). The output file would look like this:

CHR     BP           
chr1    12345678  
chr1    12345679  
chr1    12356777

awk --version = GNU Awk 4.0.2

Any help would be great!


Solution

  • Sample inputs:

    $ cat file.txt
    CHR     BP      FILE
    chr1    12345678        gwas1
    chr2    87654321        gwas2
    chr4    99999999        gwas4             # file gwas4 does not exist
    
    $ cat gwas1
    CHR     BP
    chr1    12345678                          # match
    chr1    12345679                          # match
    chr1    12356777                          # match
    chr1    99999999
    
    $ cat gwas2
    CHR     BP
    chr1    12345678
    chr2    87650000                          # match
    
    $ cat gwas3                               # no matches since gwas3 not in file.txt
    CHR     BP
    chr3    2134234
    

    NOTE: comments do not exist in actual files

    A single GNU awk script to process all gwas* files:

    awk -v diff=500000 '
    
    function abs(x) { return (x < 0.0) ? -x : x }
    
    BEGIN         { FS=OFS="\t" }
    
    FNR==NR       { if (FNR>1)                      # 1st file; skip header and ...
                       bp_list[$3][$1]=$2           # save contents in our bp_list[FILE][CHR] array
                    next
                  }
    
    FNR==1        { close(outfile)                  # close previous output file
                    fn=FILENAME
                    outfile=fn "_extract"
                    if (fn in bp_list)              # if fn in 1st file then ...
                       print > outfile              # print header else ...
                    else                            # skip to next input file; also addresses gwas* matching on gwas*_extract files, ie, these will be skipped, too
                       nextfile
                    next
                  }
    
    fn in bp_list { if ($1 in bp_list[fn] && abs(bp_list[fn][$1] - $2) <= diff)
                       print > outfile
                  }
    ' file.txt gwas*
    

    NOTE: requires GNU awk for multidimensional arrays (aka array of arrays)

    This generates:

    $ head gwas*extract
    ==> gwas1_extract <==
    CHR     BP
    chr1    12345678
    chr1    12345679
    chr1    12356777
    
    ==> gwas2_extract <==
    CHR     BP
    chr2    87650000