pythonbioinformaticsprotein-database

How to extract chain-IDs from PDB files?


I have all the PDB files stored on my local hard disk. The files are in pdbXXXX.ent.gz format.

I have a python program that reads a text file which must be in the following format:

pdb_id  chain_id  resolution

How can I prepare this plain text file from all those PDB files?


Solution

  • You can parse a PDB file with Biopython even when it is compressed. You just need to be careful to open the file in text mode ("rt") - otherwise you end up with a TypeError.

    I tested the following script with a rather small sample: 4 zipped PDB entries in a local folder.

    import gzip
    import warnings
    from pathlib import Path
    from Bio.PDB.PDBExceptions import PDBConstructionWarning
    from Bio.PDB import PDBParser
    
    # To get rid of those annoying warnings like 'WARNING: Chain B is discontinuous at line 4059.'
    warnings.simplefilter('ignore', PDBConstructionWarning)
    
    parser = PDBParser()
    
    if __name__ == "__main__":
        pdb_zips = Path("zipped_pdbs").glob('**/*.ent.gz')
        for pdb_filename in pdb_zips:
            with gzip.open(pdb_filename, "rt") as file_handle:
                structure = parser.get_structure("?", file_handle)
            # you could of course parse the pdb code from the file name as well. 
            # But I found this to be easier implemented.       
            pdb_code = structure.header.get("idcode")
            resolution = structure.header.get("resolution")
    
            for chain in structure.get_chains():
                print(f"{pdb_code}  {chain.id}  {resolution}")
    
    

    The output reads

    7LWV  A  3.12
    7LWV  B  3.12
    7LWV  C  3.12
    6U9D  A  3.19
    6U9D  B  3.19
    6U9D  C  3.19
    6U9D  D  3.19
    6U9D  E  3.19
    6U9D  F  3.19
    6U9D  G  3.19
    6U9D  H  3.19
    6U9D  I  3.19
    6U9D  J  3.19
    6U9D  K  3.19
    6U9D  L  3.19
    6U9D  M  3.19
    6U9D  N  3.19
    6U9D  O  3.19
    6U9D  P  3.19
    6U9D  Q  3.19
    6U9D  R  3.19
    6U9D  S  3.19
    6U9D  T  3.19
    6U9D  U  3.19
    6U9D  V  3.19
    6U9D  W  3.19
    6U9D  X  3.19
    1F34  A  2.45
    1F34  B  2.45
    2OXP  A  2.0