pythonfileprotein-database

Extract Columns from a Protein Data Bank (PDB) Text File


I want to make a plot with Matplotlib in Python and therefore read some data from a PDB-file (protein data bank). I want to extract every column from the file and store these columns in separate vectors. The PDB-file consists of columns with both text and floats. I'm very new to Matplotlib and I have tried several methods suggested to extract these columns, but nothing seem to work. What would be the best way to extract these columns? I'm going to load a lot of data in a later stage, so it's good if the method isn't too inefficient.

The PDB-files looks something like this:

ATOM      1  CA  MET A   1      38.012   8.932  -1.253
ATOM      2  CA  GLU A   2      39.809   5.652  -1.702
ATOM      3  CA  ALA A   3      43.007   5.013   0.368
ATOM      4  CA  ALA A   4      41.646   7.577   2.820
ATOM      5  CA  HIS A   5      42.611   4.898   5.481
ATOM      6  CA  SER A   6      46.191   5.923   5.090
ATOM      7  CA  LYS A   7      45.664   9.815   5.134
ATOM      8  CA  SER A   8      45.898  12.022   8.181
ATOM      9  CA  THR A   9      42.528  13.075   9.570
ATOM     10  CA  GLU A  10      43.330  16.633   8.378
ATOM     11  CA  GLU A  11      44.171  15.729   4.757
ATOM     12  CA  CYS A  12      40.589  14.150   4.745
ATOM     13  CA  LEU A  13      38.984  17.314   6.105
ATOM     14  CA  ALA A  14      40.633  19.053   3.220
ATOM     15  CA  TYR A  15      39.740  16.682   0.505
ATOM     16  CA  PHE A  16      36.138  17.421   1.566
ATOM     17  CA  GLY A  17      36.536  20.854   2.826
ATOM     18  CA  VAL A  18      34.184  20.012   5.553
ATOM     19  CA  SER A  19      34.483  20.966   9.177

Solution

  • Going off of @Kyle_S-C's recommendation, here's a way to do it using Biopython.

    First read your file into a Biopython Structure object:

    import Bio.PDB
    path = '/path/to/PDB/file' # your file path here
    p = Bio.PDB.PDBParser()
    structure = p.get_structure('myStructureName', path)
    

    Then, for example, you can get a list of just the Atom ids like this:

    ids = [a.get_id() for a in structure.get_atoms()]
    

    See the Biopython Structural Bioinformatics FAQ for more, including the following methods for accessing the PDB columns for an Atom:

    How do I extract information from an Atom object?

    Using the following methods:

    # a.get_name()           # atom name (spaces stripped, e.g. 'CA')
    # a.get_id()             # id (equals atom name)
    # a.get_coord()          # atomic coordinates
    # a.get_vector()         # atomic coordinates as Vector object
    # a.get_bfactor()        # isotropic B factor
    # a.get_occupancy()      # occupancy
    # a.get_altloc()         # alternative location specifier
    # a.get_sigatm()         # std. dev. of atomic parameters
    # a.get_siguij()         # std. dev. of anisotropic B factor
    # a.get_anisou()         # anisotropic B factor
    # a.get_fullname()       # atom name (with spaces, e.g. '.CA.')