I am running BLAST and would like to manipulate the output from using BLAST+6 format.
For example, I would like to take the E-value, query coverage, and identity for each hit and then plug them into an equation that weighs all three into a single "score". I would then like to take all scores and place them into a table so I can sort each hit by decreasing "score".
I would also like to generate an ORF for each BLAST hit on the database and add them to their corresponding place in the table.
Can someone point any resources/keywords I can search to learn about how to manipulate tabular data?
Example:
blastn -query genes.fasta -subject genome.fasta -outfmt "6 qseqid pident qcovs evalue"
Output:
qseqid pident qcovs evalue
0 moaC 100.00 0.0 161.0
1 moaC 99.38 1.0 161.0
I would like to take values from each column and use them as variables in an equation, then print that value in the corresponding row. I will be using a bash script or BioPython for BLAST, so I would like to make the data manipulation as part of that
Rather than a solution to this example, I would like to see if there's a resource where I can learn about this topic (up until this point I would use spreadsheet programs to manipulate tabular data)
For working with tabular data, I really recommend pandas.
First you want to convert your output to a pandas DataFrame
, which is a data structure that is well suited to store data that comes in tabular form.
For this example I'm using tblastn
and the example files four_human_proteins.fasta
and rhodopsin_nucs.fasta
.
>>> import pandas as pd
>>> from Bio.Blast.Applications import NcbiblastnCommandline
>>> cline = NcbiblastnCommandline(cmd='/path/to/BLAST+/2.8.1/bin/tblastn',
query='four_human_proteins.fasta',
subject='rhodopsin_nucs.fasta',
evalue='1e-10',
outfmt='"6 qseqid pident qcovs evalue"')
>>> print(cline)
/path/to/BLAST+/2.8.1/bin/tblastn -outfmt "6 qseqid pident qcovs evalue" -query four_human_proteins.fasta -evalue 1e-10 -subject rhodopsin_nucs.fasta
>>> blast_output = cline()[0].strip()
>>> print(blast_output)
sp|P08100|OPSD_HUMAN 96.552 100 0.0
sp|P08100|OPSD_HUMAN 93.391 100 0.0
sp|P08100|OPSD_HUMAN 95.092 94 0.0
sp|P08100|OPSD_HUMAN 84.795 98 0.0
sp|P08100|OPSD_HUMAN 82.164 98 0.0
sp|P08100|OPSD_HUMAN 96.396 89 2.65e-67
sp|P08100|OPSD_HUMAN 92.308 89 7.50e-36
sp|P08100|OPSD_HUMAN 93.220 89 1.81e-32
sp|P08100|OPSD_HUMAN 96.296 89 6.37e-32
sp|P08100|OPSD_HUMAN 88.462 89 4.64e-12
>>> headers = ['qseqid', 'pident', 'qcovs', 'evalue']
>>> rows = [line.split() for line in blast_output.splitlines()]
>>> df = pd.DataFrame(rows, columns=headers)
>>> print(df)
qseqid pident qcovs evalue
0 sp|P08100|OPSD_HUMAN 96.552 100 0.0
1 sp|P08100|OPSD_HUMAN 93.391 100 0.0
2 sp|P08100|OPSD_HUMAN 95.092 94 0.0
3 sp|P08100|OPSD_HUMAN 84.795 98 0.0
4 sp|P08100|OPSD_HUMAN 82.164 98 0.0
5 sp|P08100|OPSD_HUMAN 96.396 89 2.65e-67
6 sp|P08100|OPSD_HUMAN 92.308 89 7.50e-36
7 sp|P08100|OPSD_HUMAN 93.220 89 1.81e-32
8 sp|P08100|OPSD_HUMAN 96.296 89 6.37e-32
9 sp|P08100|OPSD_HUMAN 88.462 89 4.64e-12
First we need to tell pandas
which columns contain float
s.
>>> convert = {'pident': float,
'qcovs': float,
'evalue': float,
'qseqid': str}
>>> df = df.astype(convert)
Now can now easily perform columnwise operations on this DataFrame df
.
Define your score function and add the result as an extra column:
>>> df['score'] = df['qcovs'] / df['pident'] # adapt to your own needs
>>> print(df)
qseqid pident qcovs evalue score
0 sp|P08100|OPSD_HUMAN 96.552 100.0 0.000000e+00 1.035711
1 sp|P08100|OPSD_HUMAN 93.391 100.0 0.000000e+00 1.070767
2 sp|P08100|OPSD_HUMAN 95.092 94.0 0.000000e+00 0.988516
3 sp|P08100|OPSD_HUMAN 84.795 98.0 0.000000e+00 1.155729
4 sp|P08100|OPSD_HUMAN 82.164 98.0 0.000000e+00 1.192736
5 sp|P08100|OPSD_HUMAN 96.396 89.0 2.650000e-67 0.923275
6 sp|P08100|OPSD_HUMAN 92.308 89.0 7.500000e-36 0.964163
7 sp|P08100|OPSD_HUMAN 93.220 89.0 1.810000e-32 0.954731
8 sp|P08100|OPSD_HUMAN 96.296 89.0 6.370000e-32 0.924234
9 sp|P08100|OPSD_HUMAN 88.462 89.0 4.640000e-12 1.006082
And you can easily sort the DataFrame by this score
column
>>> df.sort_values(['score'], inplace=True)
>>> print(df)
qseqid pident qcovs evalue score
5 sp|P08100|OPSD_HUMAN 96.396 89.0 2.650000e-67 0.923275
8 sp|P08100|OPSD_HUMAN 96.296 89.0 6.370000e-32 0.924234
7 sp|P08100|OPSD_HUMAN 93.220 89.0 1.810000e-32 0.954731
6 sp|P08100|OPSD_HUMAN 92.308 89.0 7.500000e-36 0.964163
2 sp|P08100|OPSD_HUMAN 95.092 94.0 0.000000e+00 0.988516
9 sp|P08100|OPSD_HUMAN 88.462 89.0 4.640000e-12 1.006082
0 sp|P08100|OPSD_HUMAN 96.552 100.0 0.000000e+00 1.035711
1 sp|P08100|OPSD_HUMAN 93.391 100.0 0.000000e+00 1.070767
3 sp|P08100|OPSD_HUMAN 84.795 98.0 0.000000e+00 1.155729
4 sp|P08100|OPSD_HUMAN 82.164 98.0 0.000000e+00 1.192736