pandascsvformat

Reading csv with specific formats via pandas read_csv


I am trying to read a file of atomic masses and other information from here. I am using pandas.read_csv for this task. Currently my reading code looks something like this:

#importing isotope masses
#see: https://www-nds.iaea.org/amdc/
masses = pd.read_csv('isotope_data/mass_1.mas20.txt', skiprows=36,skipfooter=2, \
                         names=['1N', '-Z','N','Z','A','EL','0','Delta','eDelta','BE','eBE',\
                               'DC','BeE','eBeE','AMU','AMU2','eAMU'],sep='\s+',engine='python')

print (masses.head(100))

(Note: this call isn't quite right because the data doesn't line up correctly, I'm working on that part) This probably will work OK, but one thing that's nice about the input file is that it gives the specific format statements for all of the rows of input, in the comment section of the text file I have:

   col 1     :  Fortran character control: 1 = page feed  0 = line feed
   format    :  a1,i3,i5,i5,i5,1x,a3,a4,1x,f14.6,f12.6,f13.5,1x,f10.5,1x,a2,f13.5,f11.5,1x,i3,1x,f13.6,f12.6

Unfortunately this formatting information is only directly useable in FORTRAN, I think. Is there a way to read this formatting information and apply it to my read_csv call so that I get the right format for each variable?

Maybe this is partly about what data structures I envision using after the call. Typically, what I do is project each of the pieces of data from the resulting Pandas dataframe into a numpy array, because I'm more comfortable using those.


Solution

  • I suggest to use fortranformat package. The following parser handles fixed-length fields and also the # and * characters:

    import fortranformat as ff
    format = ff.FortranRecordReader('(a1,i3,i5,i5,i5,1x,a3,a4,1x,f14.6,f12.6,f13.5,1x,f10.5,1x,a2,f13.5,f11.5,1x,i3,1x,f13.6,f12.6)')
    f = open('mass_1.mas20.txt', 'r')
    masses = []
    count = 0
    for line in f:
        count = count + 1
        if count > 36:
            masses.append(format.read(line.replace('*', ' ').replace('#', '.')))
    f.close()
    print (masses[:100])