I am working on a bioinformatics project that involves very big NarrowPeak formated files that look like this:
(the columns are 'chrom ,chromStart, chromEnd, name, score ,strand, signalValue ,pValue, qValue ,peak')
chr1 713835 714424 chr1.1 1000 . 0.1621 10.6 -1 253
chr1 752775 753050 chr1.2 567 . 0.0365 2.09 -1 124
chr1 753373 753448 chr1.3 511 . 0.0243 1.31 -1 27
chr1 762057 763210 chr1.4 1000 . 0.1743 11.5 -1 846
chr1 793358 793612 chr1.5 574 . 0.0379 2.18 -1 121
chr1 805101 805538 chr1.6 783 . 0.0834 5.23 -1 200
chr1 839626 840461 chr1.7 1000 . 0.2079 13.8 -1 510
chr1 842125 842534 chr1.8 641 . 0.0526 3.15 -1 199
chr2 846334 846381 chr1.9 510 . 0.0241 1.3 -1 17
chr2 846545 846937 chr1.10 562 . 0.0355 2.03 -1 198
chr2 848219 848588 chr1.11 605 . 0.0448 2.64 -1 179
chr2 851784 852887 chr1.12 734 . 0.0728 4.51 -1 323
I have a way of indexing the file (using tabix in samtools) and querying a row by entering the chr number and range and getting the right row (I am using it because my teammate says it is the fastest way).
for example if my query is chr1:713835-714424 I will get the result:
chr1 713835 714424 chr1.1 1000 . 0.1621 10.6 -1 253
I want to be able to enter the chr num and range and get the score 253 alone.
I did it by turning the result to a data frame (with pandas).
def turn_query_results_into_df(query_results):
replaced=(query_results.replace('\t',','))
stripped = (replaced.strip())
lines = (replaced.split("\n") )
data=[]
for line in lines:
data.append((line.split(',')))
mydf1=pd.DataFrame(data,columns=('chrom chromStart chromEnd name score strand signalValue pValue qValue peak'.split()))
mydf1=mydf1.drop(mydf1.tail(1).index)
mydf1=mydf1.astype({'chromStart':'int32','chromEnd':'int32','score':'int32','signalValue':'float64','pValue':'float64','peak':'float64'})
return mydf1
print(peak_val = df1['peak'].values)
Is there a better way?
If there is a way using samtools, it would be the best, but I am having a hard time understanding it.
Thanks
Pandas is serious overkill. If you are using tabix
for your querying too, with a command sequence like:
$ tabix input.bed.gz # index the input
$ tabix input.bed.gz chr1:713835-714424 # query the input
chr1 713835 714424 chr1.1 1000 . 0.1621 10.6 -1 253
You can just pipe the output of your tabix
query to the Unix cut
utility to select only the 10th column:
$ tabix input.bed.gz chr1:713835-714424 | cut -f 10
253