I have a text file of 5,000 rows. Each row corresponds to a unique FILE
all labelled gwas*
:
CHR BP FILE
chr1 12345678 gwas1
chr2 87654321 gwas2
...
I have 5,000 gwas*
files all with unique file names (as found in column 4 above - FILE
), for example, gwas1
looks like this:
CHR BP
chr1 12345678
chr1 12345679
chr1 12356777
...
I want to extract all rows from each gwas*
file where the BP
value is within 500,000 (up and down) of the BP
value in the text file. The CHR
value must also match. For example:
gwas1
in the text file has the CHR
value chr1
and the BP
value 12,345,678
.gwas1
which have a chr1
value in the CHR
column and which have a BP
value between 11,845,678
(this value is 500,000 down from the BP
value in the text file) and 12,845,678
(this value is 500,000 down from the BP
value in the text file).I can do this manually for a single gwas file using the below code (this does not use the text file of 5,000 rows however):
export CHR="chr11"
export BP=107459522
export WINDOW=500000
awk -v CHR=$CHR -v BP_pos=$(($BP + $WINDOW)) -v BP_neg=$(($BP - $WINDOW)) 'BEGIN{FS=OFS="\t"}FNR==1 || ($1 == CHR && $2 < BP_pos && $2 > BP_neg )' gwas1 > gwas1_extract
I want to be able to do this for all 5,000 gwas files listed in my text file. The output for every gwas file should only include the rows in which (i) the column CHR
contained the value for that file (e.g., for gwas1
this is chr1
) and the BP
column contained values that were within 500,000 of the value given in the text file for the BP
column (for gwas1
this is 500,000 either side of 12,345,678
). The output file would look like this:
CHR BP
chr1 12345678
chr1 12345679
chr1 12356777
awk --version
= GNU Awk 4.0.2
Any help would be great!
Sample inputs:
$ cat file.txt
CHR BP FILE
chr1 12345678 gwas1
chr2 87654321 gwas2
chr4 99999999 gwas4 # file gwas4 does not exist
$ cat gwas1
CHR BP
chr1 12345678 # match
chr1 12345679 # match
chr1 12356777 # match
chr1 99999999
$ cat gwas2
CHR BP
chr1 12345678
chr2 87650000 # match
$ cat gwas3 # no matches since gwas3 not in file.txt
CHR BP
chr3 2134234
NOTE: comments do not exist in actual files
A single GNU awk
script to process all gwas*
files:
awk -v diff=500000 '
function abs(x) { return (x < 0.0) ? -x : x }
BEGIN { FS=OFS="\t" }
FNR==NR { if (FNR>1) # 1st file; skip header and ...
bp_list[$3][$1]=$2 # save contents in our bp_list[FILE][CHR] array
next
}
FNR==1 { close(outfile) # close previous output file
fn=FILENAME
outfile=fn "_extract"
if (fn in bp_list) # if fn in 1st file then ...
print > outfile # print header else ...
else # skip to next input file; also addresses gwas* matching on gwas*_extract files, ie, these will be skipped, too
nextfile
next
}
fn in bp_list { if ($1 in bp_list[fn] && abs(bp_list[fn][$1] - $2) <= diff)
print > outfile
}
' file.txt gwas*
NOTE: requires GNU awk
for multidimensional arrays (aka array of arrays)
This generates:
$ head gwas*extract
==> gwas1_extract <==
CHR BP
chr1 12345678
chr1 12345679
chr1 12356777
==> gwas2_extract <==
CHR BP
chr2 87650000