I'm in a bit unusual situation. There are seven different proteins stored in a single file according to their residues names. Each protein has different sequence length. Now I need to calculate the center of mass of each protein and generate a time series data.I know how to do with a single protein, but do not with multiple protein system. For single protein I can do something like this:
import MDAnalysis as mda
import numpy as np
u = mda.Universe('lp400start.gro')
u1 = mda.Merge(u.select_atoms("not resname W and not resname WF and not resname ION"))
u1.load_new('lp400.xtc')
protein = u1.select_atoms("protein")
arr = np.empty((protein.n_residues, u1.trajectory.n_frames, 3))
for ts in u.trajectory:
arr[:, ts.frame] = protein.center_of_mass(compound='residues')
the data files are publicly available here. The residue sequence in the topology file can be checked using grep "^ *1[A-Z]" -B1 lp400final.gro | grep -v "^ *1[A-Z]"
, the output:
38ALA BB 74 52.901 33.911 6.318
--
38ALA BB 148 41.842 29.381 7.211
--
137GLY BB 455 36.756 4.287 3.284
--
137GLY BB 762 44.721 60.377 3.112
--
252HIS SC3 1368 28.682 37.936 6.727
--
252HIS SC3 1974 18.533 46.506 6.314
--
576PHE SC3 3263 48.937 38.538 4.013
--
576PHE SC3 4552 18.513 25.948 3.800
--
1092PRO SC1 6470 42.510 40.992 6.775
--
1092PRO SC1 8388 14.709 4.759 6.370
--
1016LEU SC110524 57.264 56.308 2.632
--
1016LEU SC112660 50.716 14.698 2.728
--
1285LYS SC215345 0.793 33.529 1.509
First protein has sequence length of 38 residues and it has a copy of its own and then the second protein and so on. Now I want to have the COM of each protein at each time frame and build it into a time series. In addition to proteins topology file also contains the DPPC particles as well. Could someone help me how to do this?
Thanks!
To make sure the output trajectory is correct it looks something like this enter link description here
I would load the system from the TPR file to maintain the bond information. Then MDAnalysis can determine fragments (namely, your proteins). Then loop over the fragments to determine the COM time series:
import MDAnalysis as mda
import numpy as np
# files from https://doi.org/10.5281/zenodo.846428
TPR = "lp400.tpr"
XTC = "lp400.xtc"
# build reduced universe to match XTC
# (ignore warnings that no coordinates are found for the TPR)
u0 = mda.Universe(TPR)
u = mda.Merge(u0.select_atoms("not resname W and not resname WF and not resname ION"))
u.load_new(XTC)
# segments (exclude the last one, which is DPPC and not protein)
protein_segments = u.segments[:-1]
# build the fragments
# (a dictionary with the key as the protein name -- I am using an
# OrderedDict so that the order is the same as in the TPR)
from collections import OrderedDict
protein_fragments = OrderedDict((seg.segid[6:], seg.atoms.fragments) for seg in protein_segments)
# analyze trajectory (with a nice progress bar)
timeseries = []
for ts in mda.log.ProgressBar(u.trajectory):
coms = []
for name, proteins in protein_fragments.items():
# loop over all the different proteins;
# unwrap to get the true COM under PBC (double check!!)
coms.extend([p.center_of_mass(unwrap=True) for p in proteins])
timeseries.append(coms)
timeseries = np.array(timeseries)
Note
unwrap=True
is doing the right thing (and that it is necessary — I didn't check if any proteins were split across periodic boundaries).(N_timesteps, M_proteins, 3)
, namely (10001, 14, 3)
.protein_fragments
is
OrderedDict([('EPHA', (<AtomGroup with 74 atoms>, <AtomGroup with 74 atoms>)),
('OMPA', (<AtomGroup with 307 atoms>, <AtomGroup with 307 atoms>)),
('OMPG', (<AtomGroup with 606 atoms>, <AtomGroup with 606 atoms>)),
('BTUB', (<AtomGroup with 1289 atoms>, <AtomGroup with 1289 atoms>)),
('ATPS', (<AtomGroup with 1918 atoms>, <AtomGroup with 1918 atoms>)),
('GLPF', (<AtomGroup with 2136 atoms>, <AtomGroup with 2136 atoms>)),
('FOCA', (<AtomGroup with 2685 atoms>, <AtomGroup with 2685 atoms>))])