pythonpython-3.xbiopythonncbi

How to generate a PSSM matrix using PSI BLAST from BioPython


Is there any to generate PSSM matrix from PSI BLAST using the python package BioPython? Indeed, I have 8000 sequences in .fasta file. Every sequence length is also long?

I am using this below code:

for fasta in files:
alignment = AlignIO.read(fasta, "fasta")
summary_align = AlignInfo.SummaryInfo(alignment)
consensus = summary_align.dumb_consensus()
my_pssm = summary_align.pos_specific_score_matrix(consensus, chars_to_ignore = ['N', '-'])
file_pssm = fasta+"pssm"
with open(file_pssm) as f:
     f.write(my_pssm)

Is there any better way to do it? The matrix which is shown consisting of 0 and 1 only. I need actual PSSM scoring values (which are in normalized form)


Solution

  • I used the following code to extract PSSM matrices from a fasta file:

    from Bio.Blast.Applications import NcbipsiblastCommandline
    
    
    # Set the paths to the db
    db = "path/to/your/db"
    query = "path/to/the/fasta/file"
    output_dir = "output/dir/"
    
    # Set the parameters for the psiblast command
    psiblast_cline = NcbipsiblastCommandline(cmd="psiblast", query=query, db=db, num_iterations=3, evalue=0.01, num_threads=16, out_ascii_pssm=output_dir, save_each_pssm=True)
    
    # Run psiblast
    print("Running psiblast...")
    stdout, stderr = psiblast_cline()
    
    # Check the output for any error messages
    if stderr:
        print("Error running psiblast:", stderr)
    else:
        print("PSSM files generated successfully.")
    

    This should generate a bunch of files in the output directory, each containing the PSSM matrix of the corresponding sequence + some other information. You camn use the code below to extract the sequence and the PSSM matrix from and reutrn it as numpy array:

    def pssm_file_to_nparray(pssm_path):
        with open(pssm_path, 'r') as file:
            lines = file.readlines()
    
            # Skip the header lines
            lines = lines[3:-6]
    
            # Extract the PSSM values
            pssm_values = []
            pssm_seq = []
            for line in lines:
                line_values = line.strip().split()[2:22]
                seq_values = line.strip().split()[1]
                pssm_values.append([int(value) for value in line_values])
                pssm_seq.append(seq_values)
            
            return "".join(pssm_seq), np.array(pssm_values)
    

    Do know that you need to have the BLAST+ command line tool installed and its bin path set to be able to run the code. Do also note that the PSI-BLAST tool generates sometimes two pssm files for some sequence, so you would need to keep track of sequences that've you already extracted.