Is there any to generate PSSM matrix from PSI BLAST using the python package BioPython? Indeed, I have 8000 sequences in .fasta file. Every sequence length is also long?
I am using this below code:
for fasta in files:
alignment = AlignIO.read(fasta, "fasta")
summary_align = AlignInfo.SummaryInfo(alignment)
consensus = summary_align.dumb_consensus()
my_pssm = summary_align.pos_specific_score_matrix(consensus, chars_to_ignore = ['N', '-'])
file_pssm = fasta+"pssm"
with open(file_pssm) as f:
f.write(my_pssm)
Is there any better way to do it? The matrix which is shown consisting of 0 and 1 only. I need actual PSSM scoring values (which are in normalized form)
I used the following code to extract PSSM matrices from a fasta file:
from Bio.Blast.Applications import NcbipsiblastCommandline
# Set the paths to the db
db = "path/to/your/db"
query = "path/to/the/fasta/file"
output_dir = "output/dir/"
# Set the parameters for the psiblast command
psiblast_cline = NcbipsiblastCommandline(cmd="psiblast", query=query, db=db, num_iterations=3, evalue=0.01, num_threads=16, out_ascii_pssm=output_dir, save_each_pssm=True)
# Run psiblast
print("Running psiblast...")
stdout, stderr = psiblast_cline()
# Check the output for any error messages
if stderr:
print("Error running psiblast:", stderr)
else:
print("PSSM files generated successfully.")
This should generate a bunch of files in the output directory, each containing the PSSM matrix of the corresponding sequence + some other information. You camn use the code below to extract the sequence and the PSSM matrix from and reutrn it as numpy array:
def pssm_file_to_nparray(pssm_path):
with open(pssm_path, 'r') as file:
lines = file.readlines()
# Skip the header lines
lines = lines[3:-6]
# Extract the PSSM values
pssm_values = []
pssm_seq = []
for line in lines:
line_values = line.strip().split()[2:22]
seq_values = line.strip().split()[1]
pssm_values.append([int(value) for value in line_values])
pssm_seq.append(seq_values)
return "".join(pssm_seq), np.array(pssm_values)
Do know that you need to have the BLAST+ command line tool installed and its bin path set to be able to run the code. Do also note that the PSI-BLAST tool generates sometimes two pssm files for some sequence, so you would need to keep track of sequences that've you already extracted.