pythonregexnumpycheminformatics

Parse multicolumn string using python


I'm trying to extract data from the text output of a cheminformatics program called NWChem, I've already extraced the part of the output that I'm interested in(the vibrational modes), here is the string that I have extracted:

s = '''                   1           2           3           4           5           6

 P.Frequency       -0.00        0.00        0.00        0.00        0.00        0.00

           1    -0.23581     0.00000     0.00000     0.00000     0.01800    -0.04639
           2     0.00000     0.25004     0.00000     0.00000     0.00000     0.00000
           3    -0.00000     0.00000     0.00000     0.00000    -0.21968    -0.08522
           4    -0.23425     0.00000     0.00000     0.00000    -0.14541     0.37483
           5     0.00000     0.00000     0.99611     0.00000     0.00000     0.00000
           6     0.00192     0.00000     0.00000     0.00000    -0.42262     0.43789
           7    -0.23425     0.00000     0.00000     0.00000    -0.14541     0.37483
           8     0.00000     0.00000     0.00000     0.99611     0.00000     0.00000
           9    -0.00193     0.00000     0.00000     0.00000    -0.01674    -0.60834

                    7           8           9

 P.Frequency     1583.30     3661.06     3772.30

           1    -0.00000    -0.00000     0.06664
           2     0.00000     0.00000     0.00000
           3    -0.06754     0.04934     0.00000
           4     0.41551     0.56874    -0.52878
           5     0.00000     0.00000     0.00000
           6     0.53597    -0.39157     0.42577
           7    -0.41551    -0.56874    -0.52878
           8     0.00000     0.00000     0.00000
           9     0.53597    -0.39157    -0.42577'''

First I split the data on rows with a regex.

import re
p = re.compile('\n + +(?=[\d| ]+\n\n P.Frequency +)')
d = re.split(p, s)
print(d[0])

                   1           2           3           4           5           6

 P.Frequency       -0.00        0.00        0.00        0.00        0.00        0.00

           1    -0.23581     0.00000     0.00000     0.00000     0.01800    -0.04639
           2     0.00000     0.25004     0.00000     0.00000     0.00000     0.00000
           3    -0.00000     0.00000     0.00000     0.00000    -0.21968    -0.08522
           4    -0.23425     0.00000     0.00000     0.00000    -0.14541     0.37483
           5     0.00000     0.00000     0.99611     0.00000     0.00000     0.00000
           6     0.00192     0.00000     0.00000     0.00000    -0.42262     0.43789
           7    -0.23425     0.00000     0.00000     0.00000    -0.14541     0.37483
           8     0.00000     0.00000     0.00000     0.99611     0.00000     0.00000
           9    -0.00193     0.00000     0.00000     0.00000    -0.01674    -0.60834

However I can't figure out how I can extract the vibrational modes that are presented vertically. I would like to get access easily to each vibrational mode in an array of array, or maybe a numpy array. like this:

[[-0.00, -0.23581, 0.0000, ..., -0.00193],
 [0.00, 0.00000, ..., 0.00000],
  ...
 [3772.30, 0.06664, ..., 0.0000, --0.42577]]

Solution

  • With 2 np.genfromtxt reads I can load your data file into 2 arrays, and concatenate them into one 9x9 array:

    In [134]: rows1 = np.genfromtxt('stack30874236.txt',names=None,skip_header=4,skip_footer=10)
    
    In [135]: rows2 =np.genfromtxt('stack30874236.txt',names=None,skip_header=17)
    
    In [137]: rows=np.concatenate([rows1[:,1:],rows2[:,1:]],axis=1)
    
    In [138]: rows
    Out[138]: 
    array([[-0.23581,  0.     ,  0.     ,  0.     ,  0.018  , -0.04639, -0.     , -0.     ,  0.06664],
           [ 0.     ,  0.25004,  0.     ,  0.     ,  0.     ,  0.     , 0.     ,  0.     ,  0.     ],
           ...
           [-0.00193,  0.     ,  0.     ,  0.     , -0.01674, -0.60834, 0.53597, -0.39157, -0.42577]])
    
    In [139]: rows.T
    Out[139]: 
    array([[-0.23581,  0.     , -0.     , -0.23425,  0.     ,  0.00192,  -0.23425,  0.     , -0.00193],
           [ 0.     ,  0.25004,  0.     ,  0.     ,  0.     ,  0.     ,
           ...
           [ 0.06664,  0.     ,  0.     , -0.52878,  0.     ,  0.42577, -0.52878,  0.     , -0.42577]])
    

    I had to choose the skip header/footer values to fit the datafile. Deducing them with code would take some more work.