pythonbashtabularbiopythonblast

Resources to learn how to manage tabular data (from BLAST+6 format ) using Bash and/or Biopython


I am running BLAST and would like to manipulate the output from using BLAST+6 format.

For example, I would like to take the E-value, query coverage, and identity for each hit and then plug them into an equation that weighs all three into a single "score". I would then like to take all scores and place them into a table so I can sort each hit by decreasing "score".

I would also like to generate an ORF for each BLAST hit on the database and add them to their corresponding place in the table.

Can someone point any resources/keywords I can search to learn about how to manipulate tabular data?

Example:

blastn  -query genes.fasta  -subject genome.fasta  -outfmt "6 qseqid pident qcovs  evalue"

Output:

    qseqid pident qcovs evalue   
0   moaC     100.00    0.0       161.0      
1   moaC     99.38     1.0       161.0  

I would like to take values from each column and use them as variables in an equation, then print that value in the corresponding row. I will be using a bash script or BioPython for BLAST, so I would like to make the data manipulation as part of that

Rather than a solution to this example, I would like to see if there's a resource where I can learn about this topic (up until this point I would use spreadsheet programs to manipulate tabular data)


Solution

  • For working with tabular data, I really recommend pandas.

    First you want to convert your output to a pandas DataFrame, which is a data structure that is well suited to store data that comes in tabular form.

    For this example I'm using tblastn and the example files four_human_proteins.fasta and rhodopsin_nucs.fasta.

    >>> import pandas as pd
    >>> from Bio.Blast.Applications import NcbiblastnCommandline
    >>> cline = NcbiblastnCommandline(cmd='/path/to/BLAST+/2.8.1/bin/tblastn', 
                              query='four_human_proteins.fasta', 
                              subject='rhodopsin_nucs.fasta', 
                              evalue='1e-10',
                              outfmt='"6 qseqid pident qcovs evalue"')
    >>> print(cline)
    /path/to/BLAST+/2.8.1/bin/tblastn -outfmt "6 qseqid pident qcovs evalue" -query four_human_proteins.fasta -evalue 1e-10 -subject rhodopsin_nucs.fasta
    
    >>> blast_output = cline()[0].strip()
    >>> print(blast_output) 
    sp|P08100|OPSD_HUMAN    96.552  100 0.0
    sp|P08100|OPSD_HUMAN    93.391  100 0.0
    sp|P08100|OPSD_HUMAN    95.092  94  0.0
    sp|P08100|OPSD_HUMAN    84.795  98  0.0
    sp|P08100|OPSD_HUMAN    82.164  98  0.0
    sp|P08100|OPSD_HUMAN    96.396  89  2.65e-67
    sp|P08100|OPSD_HUMAN    92.308  89  7.50e-36
    sp|P08100|OPSD_HUMAN    93.220  89  1.81e-32
    sp|P08100|OPSD_HUMAN    96.296  89  6.37e-32
    sp|P08100|OPSD_HUMAN    88.462  89  4.64e-12
    
    
    >>> headers = ['qseqid', 'pident', 'qcovs', 'evalue']
    >>> rows = [line.split() for line in blast_output.splitlines()]    
    >>> df = pd.DataFrame(rows, columns=headers)
    >>> print(df)
        qseqid  pident  qcovs   evalue
    0   sp|P08100|OPSD_HUMAN    96.552  100     0.0
    1   sp|P08100|OPSD_HUMAN    93.391  100     0.0
    2   sp|P08100|OPSD_HUMAN    95.092  94  0.0
    3   sp|P08100|OPSD_HUMAN    84.795  98  0.0
    4   sp|P08100|OPSD_HUMAN    82.164  98  0.0
    5   sp|P08100|OPSD_HUMAN    96.396  89  2.65e-67
    6   sp|P08100|OPSD_HUMAN    92.308  89  7.50e-36
    7   sp|P08100|OPSD_HUMAN    93.220  89  1.81e-32
    8   sp|P08100|OPSD_HUMAN    96.296  89  6.37e-32
    9   sp|P08100|OPSD_HUMAN    88.462  89  4.64e-12
    

    First we need to tell pandas which columns contain floats.

    >>> convert = {'pident': float, 
                  'qcovs': float, 
                  'evalue': float,
                  'qseqid': str}
    >>> df = df.astype(convert) 
    

    Now can now easily perform columnwise operations on this DataFrame df. Define your score function and add the result as an extra column:

    >>> df['score'] = df['qcovs'] / df['pident']  # adapt to your own needs
    >>> print(df)
        qseqid  pident  qcovs   evalue  score
    0   sp|P08100|OPSD_HUMAN    96.552  100.0   0.000000e+00    1.035711
    1   sp|P08100|OPSD_HUMAN    93.391  100.0   0.000000e+00    1.070767
    2   sp|P08100|OPSD_HUMAN    95.092  94.0    0.000000e+00    0.988516
    3   sp|P08100|OPSD_HUMAN    84.795  98.0    0.000000e+00    1.155729
    4   sp|P08100|OPSD_HUMAN    82.164  98.0    0.000000e+00    1.192736
    5   sp|P08100|OPSD_HUMAN    96.396  89.0    2.650000e-67    0.923275
    6   sp|P08100|OPSD_HUMAN    92.308  89.0    7.500000e-36    0.964163
    7   sp|P08100|OPSD_HUMAN    93.220  89.0    1.810000e-32    0.954731
    8   sp|P08100|OPSD_HUMAN    96.296  89.0    6.370000e-32    0.924234
    9   sp|P08100|OPSD_HUMAN    88.462  89.0    4.640000e-12    1.006082
    

    And you can easily sort the DataFrame by this score column

    >>> df.sort_values(['score'], inplace=True)
    >>> print(df)
        qseqid  pident  qcovs   evalue  score
    5   sp|P08100|OPSD_HUMAN    96.396  89.0    2.650000e-67    0.923275
    8   sp|P08100|OPSD_HUMAN    96.296  89.0    6.370000e-32    0.924234
    7   sp|P08100|OPSD_HUMAN    93.220  89.0    1.810000e-32    0.954731
    6   sp|P08100|OPSD_HUMAN    92.308  89.0    7.500000e-36    0.964163
    2   sp|P08100|OPSD_HUMAN    95.092  94.0    0.000000e+00    0.988516
    9   sp|P08100|OPSD_HUMAN    88.462  89.0    4.640000e-12    1.006082
    0   sp|P08100|OPSD_HUMAN    96.552  100.0   0.000000e+00    1.035711
    1   sp|P08100|OPSD_HUMAN    93.391  100.0   0.000000e+00    1.070767
    3   sp|P08100|OPSD_HUMAN    84.795  98.0    0.000000e+00    1.155729
    4   sp|P08100|OPSD_HUMAN    82.164  98.0    0.000000e+00    1.192736